Due to dramatic changes in the world financial environment, cryptocurrencies have gained popularity as one of the alternative investment available to most. The volatility of cryptocurrency assets, would be a bit of a challenge to predict prices changes. Using machine learning model, we hope to create a way to predict crypto market data. We will assess and analyze historical data of six most popular cryptocurrencies and compare the findings to real world market data.
- Kaggle The dataset contains historical trades on several cryptoassets such as Ethereum, Dogecoin, Bitcoin, Cardano and more.
- G-Reaserch is a quantitative finance research firm in Europe. They utilized machine learning, big data and the most advanced technology to predict movements in the financial markets.
- Which Machine Learning Model would best predict future price changes?
- By how much is cryptocurrency price going to increase in the near future compared to current price market?
- Which coin/assets would be more stable out of the six cryptocurrency chosen for this project?
- Which features affect the close price most?
- Description of data preprocessing:
- An assessment was performed to determine if there are missing data. This was remedied by either removing the NaN rows or filling in the gaps as was performed since the data is a time series dataset.
- Convert minute-by-minute data to day-by-day data for each crypto, and merge the 6 cryptoassets DataFrame into a new DataFrame.
- Visualizations were created to view trends and correlations.
- Description of feature engineering and the feature selection, including the decision-making process:
- Use "High", "Low" & "VWAP" columns in the DataFrame as features, and "Close" as target.
- Description of how data was split intro training and testing sets:
- A split of 80-20 of training-test sets was performed on the date. Three year worth of data was used to train the model and 9 months was used to test the models.
- Explanation of model choice, including limitations and benefits:
- Advantages: LinearRegression is simple to implement and run very fast.
- Limitations: Outliers can have huge effects on the regression.
- Explanation of changes in model choice (if changes occurred between the Segment 2 and Segment 3 deliverables):
- There is no changes from previous work.
- Description of how the model have been trained thus far, and any additional training that will take place:
- Train the dataset of each crypto using trading data from the previous three years, and test the dataset using the last 9 months of trading data.
- Description of current accuracy score:
-
Explanation of model choice, including limitations and benefits:
- Advantages: Highly flexible and faster than Gradient Boosting
- Limitations: Data can be very noisy, in which it is unable to understand or interpret meaninglees data correctly
-
Explanation of changes in model choice (if changes occurred between the Segment 2 and Segment 3 deliverables):
- There was no severe changes from segment 2 to 3, however we did have to change the features to deliver a better score than the last segment
-
Description of how they have trained the model thus far and any additional training that will take place:
- The Scikit-Learning API model showed a high RSME (Root-mean-square deviation) of predictional errors. When using Hyperparameter, the number did improve slightly but not as much as I would like. Several other models could be used: Field Search or Randomized Search.
-
Description of current accuracy score
-
Explanation of model choice, including limitations and benefits:
- Advantages: This model is great for Classification and Regression tasks.
- Limitations: It cannot extrapolate and can only make a prediction that is an average of previously observed labels.
-
Explanation of changes in model choice (if changes occurred between the Segment 2 and Segment 3 deliverables):
- Added Hyperparameter tuning to the model in hopes to improve the score
-
Description of how they have trained the model thus far, and any additional training that will take place:
- Added Neural Network using keras and the scores improve significantly
-
Description of current accuracy score:
- Model without tuning:
- Model with tuning:
- Model with ANN:
- Explanation of model choice, including limitations and benefits:
- Advantages: More accurate and run very fast.
- Limitations: Maybe easily over-fitting.
- Explanation of changes in model choice (if changes occurred between the Segment 2 and Segment 3 deliverables):
- This is new machine learning model we add in segment 3.
- Description of how they have trained the model thus far, and any additional training that will take place:
- Using two hidden layers with five neurons each and one output layer with one neuron.
- Description of current accuracy score
The dashboard was built using Tableau Public
The Google Slides Presentation is here
- PostgreSQL
- Amazon Web Services (AWS)
- Jupyter Notebook
- Tableau Public
- Google Slides
- Google Colab