A Machine Learning project implemented for the Artificial Intelligence Society DataSci '17 competition. The UCI Student Performance Data Set was used to effectively predict if a student is expected to fail the course or not, using school reports and questionnaires. Different classification models were compared. The best model, Logistic Regression achieved about 70% accuracy.
This project requires Python 3.4 and the following Python libraries installed:
On Unix systems, the above libraries can be installed using these (or corresponding ones for your distribution) commands,
sudo apt-get install build-essential python3-dev python3-setuptools python3-numpy python3-scipy libatlas-dev libatlas3gf-base
sudo apt-get install python3-pip
sudo -H pip3 install -U scikit-learn
sudo -H pip3 install pandas
This is a classification problem. The reason for terming it as a classification problem being, we are asked to identify students who might end up failing the final exam.
Thus, we need to identify such students and intervene before it's too late. Or in other words, we have to classify the whole group of students into two sections - Ones who are expected to fail the course and others who are expected to pass the course.
- Total number of students: 395
- Total number of features: 30
- Target: 1 ("passed")
- Number of students who passed: 265
- Number of students who failed: 130
- Number of columns with numeral values: 13
Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason',
'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed
Support Vector Machine - Linear
Total execution time: 23.7 milliseconds
Accuracy is 69.62%
Support Vector Machine - Polynomial
Total execution time: 478.0 milliseconds
Accuracy is 63.92%
Support Vector Machine - Radial Basis Function
Total execution time: 8.7 milliseconds
Accuracy is 66.46%
Logistic Regression
Total execution time: 2.7 milliseconds
Accuracy is 69.62%
Decision Tree
Total execution time: 1.7 milliseconds
Accuracy is 56.33%
Random Forest
Total execution time: 201.7 milliseconds
Accuracy is 69.62%
Random Forest (Optimised)
Total execution time: 47.6 milliseconds
Accuracy is 63.29%
Based on the statistics obtained, Logistic Regression provides the best performance and also in the least time.
The other models didn't perform well in comparison with Logistic regression. Though Logistic Regression appears to be a clear winner in both measures used to decide the best model i.e. performance and training and testing time, I would've still chosen to trade off the training time for the higher performance. The reason being that failing a course in school can have some adverse effect on the mental and psychological health of the student, making accuracy a dominant factor in comparing different models.
Thus, as the predictions need to be very accurate, we would have to value the accuracy more as compared to training and testing time.
Attributes for Data Set along with the DataSet itself can be found at UCI ML Repository
This project is licensed under the MIT License - see the LICENSE.md file for details
- Paulo Cortez - For making the Data Set public by donating it to the UCI ML Repository.