#2 Project update - General information regarding tools and packages in Python

In this journal, I will go over some general information and the tool that I will use to do this project. This is what I've searched and learned so far.

In order to complete this project, we will go over 7 steps. That is also considered to be the typical steps to build a machine learning model in general.

                               1. Import the Data

Kaggle.com is a great source for dataset, where we can find a lot of precious dataset. In this project, I will use the Movie List dataset. In this step, we will use Numpy package in Python for creating dataframe to hold the dataset. 

Numpy stands for Numerical Python, this is a general-purpose array-processing package in Python. We will utilize this package to deal with array. 

I will attach here the link of Numpy in case someone wants to get to know more about it:

  • https://www.w3schools.com/python/numpy

                               2. Clean the Data

Clean the data is a very important step, because our model is unable to work with missing data. Therefore, we need to prepare the data carefully by the way filling out the missing data by the mean value or dropping them.

In this step, we will use Pandas package in Python for cleaning and analyzing the data. "Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive".

There are 5 methods that I plan to use in this project: filter, sort_value, agg, query, and assign.

Here is some links related to Pandas package in Python that I've learned from:

  • https://www.w3schools.com/python/pandas/
  • https://pypi.org/project/pandas/
  • https://www.sharpsightlabs.com/blog/python-pandas/

                               3. Split the Data into Training/Test Sets

In machine learning, we will use 80% data for training the model and 20% for testing. 

                               4. Create a Model

                               5. Train the Model

Besides some other packages, we will learn Scikit-learn (sklearn in short) for creating and training the model. There are a ton of interesting and useful modules and methods in sklearn package that we need to explore. I think this package contributes 70% to the success of this project. This is must-learn package for students who want to proficient in Python especially and machine learning generally.

I am still learning this package, so I will update more about it in the next journal. Here is some sources that I used for learning sklearn package. 

  • https://www.activestate.com/resources/quick-reads/what-is-scikit-learn-in-python/
  • https://www.machinelearningplus.com/nlp/cosine-similarity/
  • https://www.youtube.com/watch?v=ueKXSupHz6Q&t=154s

                               6. Make Predictions (Not learned and searched yet)

                               7. Evaluate and Improve (Not learned and searched yet)

Step 4 and 5 will be 2 steps that will take me more time to learn and complete, they're considered to be the core section of this project. I will continue do more research and learn about them, then I will make a better update in the next journal. 

Comments

  1. Bao, great work. Have you implemented these strategies yet or still working out your specific analysis pathway? I can’t wait to see your project in greater detail.

    ReplyDelete
    Replies
    1. I've never implemented these strategies before this project. But I am working on it, you can expect to see more detail information in the next journal.

      Thanks for your feedback, I greatly appreciate it!

      Bao.

      Delete

Post a Comment

Popular posts from this blog

#8 Sklearn - Python package - Model evaluate metrics for regression

#3 Project Progress: Import and Clean the Data in Python

#10 Fighting The Semantic Gap On CBIR Systems