This report outlines the steps taken to preprocess data, analyze features, and apply various regression techniques to predict bike rental counts. The goal is to understand the impact of different features and models on the prediction accuracy.
- Categorical Features: Season, Holiday, Functioning Day
- Encoding: Converted categorical features to numerical values using
LabelEncoder
.
from sklearn.preprocessing import LabelEncoder
encoding = LabelEncoder()
bike['Seasons'] = encoding.fit_transform(bike['Seasons'])
bike['Holiday'] = encoding.fit_transform(bike['Holiday'])
bike['Functioning Day'] = encoding.fit_transform(bike['Functioning Day'])
- Season Adjustment: Added an autumn season to ensure all four seasons are represented.
- Features with High Ranges: Visibility (10m), Temperature (°C), Dew Point Temperature (°C)
- Scaling: Applied
MinMaxScaler
to scale features to a range of (0,1).
from sklearn.preprocessing import MinMaxScaler
scale = MinMaxScaler()
bike['Temperature (°C)'] = scale.fit_transform(bike[['Temperature (°C)']])
bike['Dew point temperature (°C)'] = scale.fit_transform(bike[['Dew point temperature (°C)']])
bike['Solar Radiation (MJ/m2)'] = scale.fit_transform(bike[['Solar Radiation (MJ/m2)']])
bike['Humidity(%)'] = scale.fit_transform(bike[['Humidity(%)']])
bike['Wind speed (m/s)'] = scale.fit_transform(bike[['Wind speed (m/s)']])
- Imputation: Used
SimpleImputer
to replace missing values with the mean of the column.
from sklearn.impute import SimpleImputer
import numpy as np
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
bike = imputer.fit_transform(bike)
- Key Features Identified: Hour, Temperature (°C), Solar Radiation (MJ/m2)
- Method Used:
SelectFromModel
withRandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
select2 = SelectFromModel(RandomForestRegressor())
selected = select2.fit_transform(x, y)
print(selected.shape)
print(select2.get_support())
- Feature Importance Results:
- Features selected: Hour, Temperature (°C), Solar Radiation (MJ/m2)
- When only these features were used, a high mean square error (~648.11) was observed, indicating that more features need to be included.
- Splitting ratios explored: (80:20), (90:10), (60:40), (50:50)
- Final choice: Default parameters (75:25)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=33)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
LR = LinearRegression().fit(x_train, y_train)
y_predict = LR.predict(x_test)
mse = mean_squared_error(y_test, y_predict)
print('Mean Square Error:', np.sqrt(mse))
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1)
ridge.fit(x_train, y_train)
y_predict = ridge.predict(x_test)
mse = mean_squared_error(y_test, y_predict)
print('Mean Square Error:', np.sqrt(mse))
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=1.0)
lasso.fit(x_train, y_train)
y_predict = lasso.predict(x_train)
mse = mean_squared_error(y_train, y_predict)
print('Mean Square Error:', np.sqrt(mse))
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.05)
elastic_net.fit(x_train, y_train)
y_predict = elastic_net.predict(x_test)
mse = mean_squared_error(y_test, y_predict)
print('Mean Square Error:', np.sqrt(mse))
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(x_train, y_train)
y_predict = model.predict(x_train)
mse = mean_squared_error(y_train, y_predict)
print('Mean Square Error:', np.sqrt(mse))
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(learning_rate=0.05)
model.fit(x_train, y_train)
y_predict = model.predict(x_train)
mse = mean_squared_error(y_train, y_predict)
print('Mean Square Error:', np.sqrt(mse))
Model | MSE |
---|---|
Linear Regression | 342.16 |
Lasso Regression | 428.54 |
Ridge Regression | 427.86 |
Elastic Net | 439.25 |
Random Forest Regressor | 86.56 |
Gradient Boosting Regressor | 223.37 |
- Best Performing Model: Random Forest Regressor (Lowest MSE: 86.56)
- Feature Importance: Hour, Temperature (°C), and Solar Radiation (MJ/m2) were the most impactful features.
- Scaling:
MinMaxScaler
was effective in normalizing the data. - Final Thoughts: Data scaling and feature selection helped improve performance, but including more features in the model yielded better results than strict feature selection.
A function was created to visualize the difference between predicted and actual values.
def plotGraph(y_train, y_pred_train, rand):
import matplotlib.pyplot as plt
if max(y_train) >= max(y_pred_train):
my_range = int(max(y_train))
else:
my_range = int(max(y_pred_train))
plt.scatter(range(len(y_train)), y_train, color='blue')
plt.scatter(range(len(y_pred_train)), y_pred_train, color='red')
plt.title(rand)
plt.show()

This graph helped in analyzing how well the model performed in comparison to actual data.
This report provides a comprehensive overview of the preprocessing steps, feature analysis, and regression techniques applied to the bike rental dataset. The Random Forest Regressor emerged as the most effective model for this prediction task.