Probleme statement given here
The end goal of this project is to develop a binary classification model that can classify Wikipedia articles as either featured or non-featured.
We are using Google Colab as IDE beacuse it is very powerfull tool.
we have used libraries like pandas,numpy,seaborn,matplotlib,sklearn,imblearn etc
And For Dataset creation we are using Wikipidia API also.
First you have to run to create dataset file.
As we are saving data from API it will take time or get timeout due to responce of server so run it continously until all data gets saved
Dataset looks like before
we performed operations like remove duplicates,removing nan ,checking missing values and changing data type
This is most important part
We checked Dependent Feature we got to know Data is highly imbalenced ratio is 98:2
we used Linear regression , Random forst and SVM
Results of three modes are
Feature imporatnce of feature is :
Based on the results, we can draw the following conclusions:
The Linear Regression model has the highest accuracy on the test set (98.48%), followed closely by the Random Forest model (94.67%). The SVM model has the lowest accuracy on the test set (94%).
The Linear Regression model has a slightly higher accuracy on the training set than the Random Forest model (97.97% vs. 94.52%). The SVM model has the highest accuracy on the training set (97%).
The AUROC score is a metric that indicates the quality of the model's ranking rather than the absolute accuracy. Both the Linear Regression and Random Forest models have high AUROC scores of 0.97, indicating that they can effectively separate the two classes. The SVM model has a slightly lower AUROC score of 0.94, indicating that it may not be as effective in ranking the samples.
Overall, the Linear Regression and Random Forest models seem to perform well on this dataset, while the SVM model has room for improvement. However, it's important to note that the choice of model depends on the specific requirements of the problem at hand, and other factors such as interpretability, computational complexity, and scalability should also be considered.