#2 Project update - General information regarding tools and packages in Python
In this journal, I will go over some general information and the tool that I will use to do this project. This is what I've searched and learned so far.
In order to complete this project, we will go over 7 steps. That is also considered to be the typical steps to build a machine learning model in general.
1. Import the Data
Kaggle.com is a great source for dataset, where we can find a lot of precious dataset. In this project, I will use the Movie List dataset. In this step, we will use Numpy package in Python for creating dataframe to hold the dataset.
Numpy stands for Numerical Python, this is a general-purpose array-processing package in Python. We will utilize this package to deal with array.
I will attach here the link of Numpy in case someone wants to get to know more about it:
- https://www.w3schools.com/python/numpy
2. Clean the Data
Clean the data is a very important step, because our model is unable to work with missing data. Therefore, we need to prepare the data carefully by the way filling out the missing data by the mean value or dropping them.
In this step, we will use Pandas package in Python for cleaning and analyzing the data. "Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive".
There are 5 methods that I plan to use in this project: filter, sort_value, agg, query, and assign.
Here is some links related to Pandas package in Python that I've learned from:
- https://www.w3schools.com/python/pandas/
- https://pypi.org/project/pandas/
- https://www.sharpsightlabs.com/blog/python-pandas/
3. Split the Data into Training/Test Sets
In machine learning, we will use 80% data for training the model and 20% for testing.
4. Create a Model
5. Train the Model
Besides some other packages, we will learn Scikit-learn (sklearn in short) for creating and training the model. There are a ton of interesting and useful modules and methods in sklearn package that we need to explore. I think this package contributes 70% to the success of this project. This is must-learn package for students who want to proficient in Python especially and machine learning generally.
I am still learning this package, so I will update more about it in the next journal. Here is some sources that I used for learning sklearn package.
- https://www.activestate.com/resources/quick-reads/what-is-scikit-learn-in-python/
- https://www.machinelearningplus.com/nlp/cosine-similarity/
- https://www.youtube.com/watch?v=ueKXSupHz6Q&t=154s
6. Make Predictions (Not learned and searched yet)
Bao, great work. Have you implemented these strategies yet or still working out your specific analysis pathway? I can’t wait to see your project in greater detail.
ReplyDeleteI've never implemented these strategies before this project. But I am working on it, you can expect to see more detail information in the next journal.
DeleteThanks for your feedback, I greatly appreciate it!
Bao.