Drugstore Sales Prediction

This is a fictional project for studying purposes. The business context and the insights are not real. The dataset is from Rossmann, a large drug store chain in Europe with many stores all over the continent. The dataset is available on Kaggle.

1. Description of the Business Problem

The CEO from Rossmann wants to make some investments to renovate all stores and asked the managers of all the stores to say what the income of all the stores will be in the next 6 weeks so that he can decide how much to invest in each store based on that. All the managers asked the data department from Rossmann for a way to answear the CEO accurately. So, a regression model would be of great help.

The tools that were created:

Machine Learning Regression Model: Using the dataset from Kaggle, a machine learning regression model was created to be used for future predictions.

The notebook used to create the model is available here.

Flask Prediction API: The model is available on the Render Cloud and can be acessible by an API created using Flask. The API source code is available here.

Telegram Chat Bot: A chat bot on Telegram (a desktop and messaging app) is available so that the CEO can send the id number of a store and the prediction of the sales for the next 6 weeks will be available there. You can clik here to check it out sending a number up to 4 digits. The Bot source code is available here.

2. Dataset Attributes

Information about the attributes can be found here.

Attribute	Description
id	an Id that represents a (Store, Date) duple within the test set
Store	a unique Id for each store
Sales	the turnover for any given day (this is what you are predicting)
Customers	the number of customers on a given day
Open	an indicator for whether the store was open: 0 = closed, 1 = open
StateHoliday	indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
SchoolHoliday	indicates if the (Store, Date) was affected by the closure of public schools
StoreType	differentiates between 4 different store models: a, b, c, d
Assortment	describes an assortment level: a = basic, b = extra, c = extended
CompetitionDistance	distance in meters to the nearest competitor store
CompetitionOpenSince[Month/Year]	gives the approximate year and month of the time the nearest competitor was opened
Promo	indicates whether a store is running a promo on that day
Promo2	Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
Promo2Since[Year/Week]	describes the year and calendar week when the store started participating in Promo2
PromoInterval	describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

3. Business Premises

The premises that were assumed for the development of the business problem solution are:

The model will be available on Github, so the model file has to be smaller than 50 Mb.
Extra assortment is greater tha extended and basic.
Stores that have no competitors in the dataset (CompetitionDIstance and CompetitionOpenSince[Month/Year] not available) were interpreted as stores that have no competitors near them.
The number 200000 was considered as a good one to replace the NA values on the column CompetitionDistance for stores with no competitors meaning that the nearest competitor is very far.

4. Solution Strategy

Understand the Business problem.
Download the dataset from Kaggle.
Clean the dataset removing outliers, NA values and unnecessary features.
Explore the data to create hypothesis, think about a few insights and validate them.
Prepare the data to be used by the modeling algorithms encoding variables, splitting train and test dataset and other necessary operations.
Create the models using machine learning algorithms.
Evaluate the created models to find the one that best fits to your problem.
Tune the model to achieve a better performance.
Deploy the model in production so that it is available to the user.
Find possible improvements to be explored in the future.

5. The Insights

I1: Stores with greater assortments should sell more.

True: Stores with greater assortment sell more.

I2: Stores with closer competitors should sell less.

False: Stores with closer competitors sells almost the same average amount than the others.

I3: Stores that have a competitor for longer periods of time should sell more.

False: Stores with competitors for a longer time sell less.

I4: Stores with longer active promotions should sell more.

False: Store with longer promotions stop selling more after some time.

I5: Stores with more consecutive promotions should sell more.

False: Stores with more consecutive promotions sell less.

I6: Stores open during the Christmas holiday should sell more.

False: Stores open during Christmas don't sell more.

I7: Stores should sell more over the years.

False: Stores sell less over the years.

I8: Stores should sell more in the second half of the year.

True: The stores sell more in the second half of the year.

I9: Stores should sell more after the 10th of each month.

True: Stores really sell more after the 10th day of each month.

I10: Stores should sell less on weekends.

True: Saturday and Sun are the worst seling days.

I11: Stores should sell less during school holidays.

True: Stores sellless during school holiday except for the august.

6. Machine Learning Modeling

The final result of this project is a regression model. Therefore, some machine learning models were created. In all, 5 models were created, one of them is a simple model that calculates the average sales to serve as a comparison with machine learning models. The other models initially created were Linear Regression, Regularized Linear Regression, Random Forest and XGBoost.

The Boruta algorithm was used to select features for the model and 18 features were selected to the final model. The models were evaluated considering three metrics, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE). The initial models performances are in the table below.

Model Name	MAE	MAPE	RMSE
Random Forest Regressor	680.19	0.10	1008.96
XGBoost Regressor	874.26	0.13	1256.33
Average Model	1354.80	0.46	1835.14
Linear Regression	1867.09	0.29	2671.05
Lasso	2198.58	0.34	3110.51

7. Final Model

To decide which would be the final model, a cross-validation was carried out to evaluate the performance of the algorithms in a more robust way. These metrics are represented in the table below.

Model Name	MAE	MAPE	RMSE
Random Forest Regressor	837.52 +/- 216.76	0.12 +/- 0.02	1254.42 +/- 316.65
XGBoost Regressor	1069.47 +/- 139.48	0.15 +/- 0.02	1523.41 +/- 182.76
Linear Regression	2081.73 +/- 295.63	0.3 +/- 0.02	2952.52 +/- 468.37
Lasso	2388.68 +/- 398.48	0.34 +/- 0.01	3369.37 +/- 567.55

The Random Forest model was the best among all the models created. However, XGBoost was chosen to be deployed because it tends to take up less disk space than Random Forest. After choosing which would be the final model, a random search hyperparameter optimization was used to improve the performance of the model. The final model evaluation metrics are in the table below.

Model Name	MAE	MAPE	RMSE
XGBoost Regressor	653.39	0.10	956.03

8. Conclusion

The XGBoost prediction model was chosen because it can be trained faster than a Random Forest model using a GPU. The model used in deployment was not the best one, but it is considerably smaller than the others, because it has a smaller number of estimators, and the error metrics are not so distant from the best model. A chat bot that answears the income for the next 6 weeks was also developed to work like a hands on tool. Now, the CEO can have access easily to the income of each store by simple sending a message to the chat bot.

9. Future Work

Develop some more features to the bot.
Create an options menu to the Telegram Bot.
Develop a model to determine the profit of the next day, month and year.
Improve model prediction capabilities by adding new features.
Search for stores with a high prediction error and find a way to enhance the predition of them.
Try other machine learning algorithms.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
data		data
img		img
parameter		parameter
rossmann-telegram-api		rossmann-telegram-api
webapp		webapp
.gitignore		.gitignore
README.md		README.md
model_rossmann.pkl		model_rossmann.pkl
store_sales_prediction.ipynb		store_sales_prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Drugstore Sales Prediction

1. Description of the Business Problem

The tools that were created:

2. Dataset Attributes

3. Business Premises

The premises that were assumed for the development of the business problem solution are:

4. Solution Strategy

5. The Insights

6. Machine Learning Modeling

7. Final Model

8. Conclusion

9. Future Work

About

Releases

Packages

Languages

m4theus4ndr4de/regression-drugstore-sales-prediction

Folders and files

Latest commit

History

Repository files navigation

Drugstore Sales Prediction

1. Description of the Business Problem

The tools that were created:

2. Dataset Attributes

3. Business Premises

The premises that were assumed for the development of the business problem solution are:

4. Solution Strategy

5. The Insights

6. Machine Learning Modeling

7. Final Model

8. Conclusion

9. Future Work

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages