#7 Sklearn - Python Package - Linear Regression (Part 2)

In this week, I continue working on Linear Regression project about predicting house pricing in Washington state. Comparing to the previous journal, this one is elaborated with more explanation and steps. 

Linear Regression algorithm generally means we will use a straight line to display the relationship between 2 variables (dependent and independent variables, or features and target). In other words, we will use this algorithm to predict the output (dependent variables) bases on the input (independent variable).

For packages in needed, we will import Linear Regression from sklearn.linear_model (using for creating linear regression and making prediction), train_test_split from sklearn.model_selection (using for splitting out data into 4 subset), import metrics from sklearn (using for evaluating accuracy of our model), import matplotlib.pyplot (using for visualizing data), and %matplotlib.inline for displaying the graph interactively in jupyter notebook.

I downloaded dataset from Kaggle and used pandas to import the csv file from my computer. As you see, our dataset has a lot of feature with some of them hold 0 values. So, we need to prepare and clean our data before making a model.


We will check if any columns contain null and duplicate values. Fortunately, our data seems kinda clean already. Then we will group the important feature contributes as the input (X), that we think it might affect significantly to the output (y). After that, I use train_test_split funtion to split out data into 4 subset: train_X, test_X, train_y, and test_y. We have 2 parameters here is test_size (the size of test set) = 0.33, that means we will use 33% of data for testing and 67% for training the model. Random_state parameter works as a random seed in this case, the value of it is doesn't matter, but we need it for keeping the same random value all the time we run the system.

We use function LinearRegression.fit(train_X, train_y) to train our model. Then we need to calculate the training and testing score (R-square value) in order to avoid overfitting and underfitting. Overfitting happens when training score is high but testing score is low, while underfitting means both scores are low. In both cases, we cannot expect our model gives accurate prediction. R-square value is a proportion of  variance explained, it will be between 0 and 1, the higher score is better because it means the more variance is explained by model. The training and testing score that we got here in turn are: 0.19022455401708283 and 0.263527837097853.




The graph below displays the actual output (black) and predict output value (red). I also use MAE, MSE, and RMSE to evaluate the accuracy of the model. 




Linear Regression means we will use a line to display the data, polynomial (PolynomialFeature function) means we will use the curve to display the data. In theory, the curve might help to fit our data better than a line. I will build 2 polynomial model, one with 2 degree and another with 7 degree. Let see how it goes.






Comparing the scores in 2 model:

                                          Model 1              |       Model 2                      |       Model 3
Training score:  0.19022455401708283   |  0.2128389585837115      |  0.32441396420852087
Testing score:      0.263527837097853     |  0.24681235471813445    |              1
RMSE score:      408611.92396699404     |  413222.9764941549        |     36804370.97389981

For training and testing score, the higher score, the better model, so we can see model 3 hold the highest score in training and testing. RMSE is Root of the Mean of the Square of the Error, the smaller RMSE score, the better model, from our result of among 3 model, model 1 got the smallest score. 

I think something might go wrong here, because the RMSE of model 3 supposes to give the smallest RMSE score. This is the limit of my journal in this week. I will study about these accurate evaluation tools more then give an update soon to explain about this. 

Comments

  1. Bao,
    Very interesting blog. What an unexpected result for the RMSE for model 3. I look forward to reading the resolution.

    ReplyDelete

Post a Comment

Popular posts from this blog

#8 Sklearn - Python package - Model evaluate metrics for regression

#3 Project Progress: Import and Clean the Data in Python

#10 Fighting The Semantic Gap On CBIR Systems