#3 Project Progress: Import and Clean the Data in Python

I am so excited to share with you about my first project in Python related to Data Science. In order to complete this project, first of all, we need to import the data. Jupyter notebook is a interactive computing notebook environment, that is came with Anaconda - a Python distribution for scientific computing. We will run our code on this notebook.

We will use read_csv() method in pandas package to read the csv files. My file here is Data.csv, that I download from Kaggle (TMDB 5000 Movie Dataset | Kaggle). df.head() uses for printing some head rows of data. df.shape() uses for printing the shape of data, you can see in this data set, we have 8403 rows and 20 different features. 


After importing the data, we need to clean them to make sure we won't mess them up when we manipulate to build a model. We have total 20 features here but we won't use them all. So, we need to decide what features are important and what are not. Because I want to build a content-based movie recommendation system, so I will choose two features that I think important are genres and overview. 



We can use these code to create a new column for important features, or we can use this in shorten: df['important_feature'] = df['overview'] + df['genres'].



After creating new column, we need to check if we have any missing data or not. In this step, I already filled missing data by (' '). So you can see in line 50, the system gives 'False', that is boolean result to answer if we have any missing data or not. Our data looks great so far then!

Comments

  1. Comparing to what I've learned about cleaning the data, it's much more complicated than what I did in this project. Also, I already figured out the way to build a basic machine learning, but I feel like I take it to the wrong direction. That's why I just write about import and clean the data in this journal, I need more time to read, research, and learn about next step. But I guess I almost get there!

    I would like to hear some feedback and advice from you! Thanks!

    ReplyDelete
  2. Bao,

    Looks like you are making good progress. Indeed, cleaning data is a lengthy process and requires a vast array of knowledge. Take the time to read and plan on the front end so you can keep up a consistent pace in your project. You are doing quite well so far, keep it up!

    ReplyDelete

Post a Comment

Popular posts from this blog

#7 Sklearn - Python Package - Linear Regression (Part 2)

#6 Sklearn - Python package - Linear Regression