Text Classification task.
IMDB and Rotten Tomatoes reviews.
inspired by two kaggle competitions:
- https://www.kaggle.com/c/word2vec-nlp-tutorial
- https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews
- Random Forest
- BoW features
- W2V features
- LSTM Recurrent Neural Network
####IMDB
- Combine IMDB reviews with Rotten-Tomatoes reviews:
cd data/imdb/
python combine_rottom.py
- Preprocess IMDB dataset in many ways. Save all versions in
./processed/
directory
cd ./processed/
./runscript.sh
####Rotten Tomatoes
- Combine Rotten-Tomatoes reviews with IMDB reviews:
cd data/rot_tom/
python combine_imdb.py
- Preprocess Rotten-Tomatoes dataset in many ways. Save all versions in
./processed/
directory
cd ./processed/
./runscript.sh
cd src/RandomForest/BOW/
mkdir ModelResponses
./bow_forest.sh
Average word embeddings to get review embedding features:
cd src/RandomForest/W2V/
mkdir ModelResponses
./w2v_forest.sh
Cluster word embeddings to get bag-of-centroids features:
cd src/RandomForest/W2V/
mkdir ModelResponses
mkdir ModelResponses/boc
mkdir W2VClusters
./boc_forest.sh
cd src/RNN/
mkdir ModelResponses
mkdir LSTM-Models
./runscript.sh
theano
lasagne
sklearn
nltk
numpy
cPickle
pandas
gensim