- In recent years, image captioning has gathered widespread attention, which involves generating a concise description of an image.
- This task involves Computer Vision (CV), Natural Language
Generation (NLG), and Machine Learning (ML)/ Deep learning methods.
Deep learning (DL) is rapidly advancing and
one of the most researched fields of study that makes its way into many aspects of our daily lives. It
is a sub field of Machine Learning concerned with algorithms and is inspired by the structure and the
functioning of the brain.
- In this project, we developed a model that generates a concise natural language description of an image.
- We used Microsoft Common Objects in Context (MS COCO) dataset for this project which has class labels, labels for different segments of an image, and a set of captions for a given image.
MS COCO Dataset - https://cocodataset.org/home
Script to download dataset - Download Script
Following setup is needed to train the model.
- A GPU with atleast 16GB memory
- Atleast 8GB of RAM
- Download latest versions of the following packages
nltk, os, torch, vocabulary, PIL, Image, COCO, numpy, tqdm, json,data_set_loader, pipeline_models, sys, math, os, time,
CNN Architecture consists of:
- Conv2D layers
- Max Pooling
- Relu Activation
- Fully Connected Layer
RNN Architecture consists of:
- LSTM Layers
- Memory Cell
- Forget Gate
- Input Gate
- Input Modulation Gate
- Output Gate
- In
data_set_loader.py
andvocabulary.py
set the path to dataset correctly. - Run
model_training.py
to train the model. - If training for the first time, in
data_set_loader.py
, setvocab_from_file=True
in the methodget_data_set_loader
. From the next time you run, set it tofalse
to use the vocab file already created previously. - Set the parameters
epoch, batch_size, mode, path_to_files
indataset_loader.py
pipeline_models.py
has the architecture of the models we are using.- After each epoch we are saving encoder(CNN) and decoder(LSTM) models as checkpoints in
/models
folder.
- Using the
model_inference.ipynb
notebook file, there are two ways in which we can see the captions generated by the model we have built. - We have uploaded our trained models
encoderCNN.pkl
anddecoderRNN.pkl
in this repository. These models will be used for inferencing.
Use the get_caption()
method. A random image is selected each time we run this method.
Use the get_image_caption(image_path)
method. The parameter to this method is the path for the image whose caption we want to know.
You can use the sample image surf_image.jpg
for instance to see the output. You can download any image from the internet and see its output by passing its path.
Out of memory issue:
- Try reducing batch_size
- Increasing gpu computation capability