This is a fictional project for studying purposes. The business context and the insights are not real. The dataset is from Rossmann, a large drug store chain in Europe with many stores all over the continent. The dataset is available on Kaggle.
The CEO from Rossmann wants to make some investments to renovate all stores and asked the managers of all the stores to say what the income of all the stores will be in the next 6 weeks so that he can decide how much to invest in each store based on that. All the managers asked the data department from Rossmann for a way to answear the CEO accurately. So, a regression model would be of great help.
Machine Learning Regression Model: Using the dataset from Kaggle, a machine learning regression model was created to be used for future predictions.
The notebook used to create the model is available here.Flask Prediction API: The model is available on the Render Cloud and can be acessible by an API created using Flask. The API source code is available here.
Telegram Chat Bot: A chat bot on Telegram (a desktop and messaging app) is available so that the CEO can send the id number of a store and the prediction of the sales for the next 6 weeks will be available there. You can clik here to check it out sending a number up to 4 digits. The Bot source code is available here.
Information about the attributes can be found here.
Attribute | Description |
---|---|
id | an Id that represents a (Store, Date) duple within the test set |
Store | a unique Id for each store |
Sales | the turnover for any given day (this is what you are predicting) |
Customers | the number of customers on a given day |
Open | an indicator for whether the store was open: 0 = closed, 1 = open |
StateHoliday | indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None |
SchoolHoliday | indicates if the (Store, Date) was affected by the closure of public schools |
StoreType | differentiates between 4 different store models: a, b, c, d |
Assortment | describes an assortment level: a = basic, b = extra, c = extended |
CompetitionDistance | distance in meters to the nearest competitor store |
CompetitionOpenSince[Month/Year] | gives the approximate year and month of the time the nearest competitor was opened |
Promo | indicates whether a store is running a promo on that day |
Promo2 | Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating |
Promo2Since[Year/Week] | describes the year and calendar week when the store started participating in Promo2 |
PromoInterval | describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store |
- The model will be available on Github, so the model file has to be smaller than 50 Mb.
- Extra assortment is greater tha extended and basic.
- Stores that have no competitors in the dataset (CompetitionDIstance and CompetitionOpenSince[Month/Year] not available) were interpreted as stores that have no competitors near them.
- The number 200000 was considered as a good one to replace the NA values on the column CompetitionDistance for stores with no competitors meaning that the nearest competitor is very far.
- Understand the Business problem.
- Download the dataset from Kaggle.
- Clean the dataset removing outliers, NA values and unnecessary features.
- Explore the data to create hypothesis, think about a few insights and validate them.
- Prepare the data to be used by the modeling algorithms encoding variables, splitting train and test dataset and other necessary operations.
- Create the models using machine learning algorithms.
- Evaluate the created models to find the one that best fits to your problem.
- Tune the model to achieve a better performance.
- Deploy the model in production so that it is available to the user.
- Find possible improvements to be explored in the future.
I1: Stores with greater assortments should sell more.
True: Stores with greater assortment sell more.
I2: Stores with closer competitors should sell less.
False: Stores with closer competitors sells almost the same average amount than the others.
I3: Stores that have a competitor for longer periods of time should sell more.
False: Stores with competitors for a longer time sell less.
I4: Stores with longer active promotions should sell more.
False: Store with longer promotions stop selling more after some time.
I5: Stores with more consecutive promotions should sell more.
False: Stores with more consecutive promotions sell less.
I6: Stores open during the Christmas holiday should sell more.
False: Stores open during Christmas don't sell more.
I7: Stores should sell more over the years.
False: Stores sell less over the years.
I8: Stores should sell more in the second half of the year.
True: The stores sell more in the second half of the year.
I9: Stores should sell more after the 10th of each month.
True: Stores really sell more after the 10th day of each month.
I10: Stores should sell less on weekends.
True: Saturday and Sun are the worst seling days.
I11: Stores should sell less during school holidays.
True: Stores sellless during school holiday except for the august.
The final result of this project is a regression model. Therefore, some machine learning models were created. In all, 5 models were created, one of them is a simple model that calculates the average sales to serve as a comparison with machine learning models. The other models initially created were Linear Regression, Regularized Linear Regression, Random Forest and XGBoost.
The Boruta algorithm was used to select features for the model and 18 features were selected to the final model. The models were evaluated considering three metrics, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE). The initial models performances are in the table below.
Model Name | MAE | MAPE | RMSE |
---|---|---|---|
Random Forest Regressor | 680.19 | 0.10 | 1008.96 |
XGBoost Regressor | 874.26 | 0.13 | 1256.33 |
Average Model | 1354.80 | 0.46 | 1835.14 |
Linear Regression | 1867.09 | 0.29 | 2671.05 |
Lasso | 2198.58 | 0.34 | 3110.51 |
To decide which would be the final model, a cross-validation was carried out to evaluate the performance of the algorithms in a more robust way. These metrics are represented in the table below.
Model Name | MAE | MAPE | RMSE |
---|---|---|---|
Random Forest Regressor | 837.52 +/- 216.76 | 0.12 +/- 0.02 | 1254.42 +/- 316.65 |
XGBoost Regressor | 1069.47 +/- 139.48 | 0.15 +/- 0.02 | 1523.41 +/- 182.76 |
Linear Regression | 2081.73 +/- 295.63 | 0.3 +/- 0.02 | 2952.52 +/- 468.37 |
Lasso | 2388.68 +/- 398.48 | 0.34 +/- 0.01 | 3369.37 +/- 567.55 |
The Random Forest model was the best among all the models created. However, XGBoost was chosen to be deployed because it tends to take up less disk space than Random Forest. After choosing which would be the final model, a random search hyperparameter optimization was used to improve the performance of the model. The final model evaluation metrics are in the table below.
Model Name | MAE | MAPE | RMSE |
---|---|---|---|
XGBoost Regressor | 653.39 | 0.10 | 956.03 |
The XGBoost prediction model was chosen because it can be trained faster than a Random Forest model using a GPU. The model used in deployment was not the best one, but it is considerably smaller than the others, because it has a smaller number of estimators, and the error metrics are not so distant from the best model. A chat bot that answears the income for the next 6 weeks was also developed to work like a hands on tool. Now, the CEO can have access easily to the income of each store by simple sending a message to the chat bot.
- Develop some more features to the bot.
- Create an options menu to the Telegram Bot.
- Develop a model to determine the profit of the next day, month and year.
- Improve model prediction capabilities by adding new features.
- Search for stores with a high prediction error and find a way to enhance the predition of them.
- Try other machine learning algorithms.