Skip to content

Generate Natural language captions to images using encoder decoder architecture

Notifications You must be signed in to change notification settings

vamshi-chidara/image-caption-generator

Repository files navigation

Introduction

  • In recent years, image captioning has gathered widespread attention, which involves generating a concise description of an image.
  • This task involves Computer Vision (CV), Natural Language Generation (NLG), and Machine Learning (ML)/ Deep learning methods. Deep learning (DL) is rapidly advancing and one of the most researched fields of study that makes its way into many aspects of our daily lives. It is a sub field of Machine Learning concerned with algorithms and is inspired by the structure and the functioning of the brain.

Problem Statement

  • In this project, we developed a model that generates a concise natural language description of an image.
  • We used Microsoft Common Objects in Context (MS COCO) dataset for this project which has class labels, labels for different segments of an image, and a set of captions for a given image.

Dataset

MS COCO Dataset - https://cocodataset.org/home
Script to download dataset - Download Script

Requirements

Following setup is needed to train the model.

  1. A GPU with atleast 16GB memory
  2. Atleast 8GB of RAM
  3. Download latest versions of the following packages
nltk, os, torch, vocabulary, PIL, Image, COCO, numpy, tqdm, json,data_set_loader, pipeline_models, sys, math, os, time, 

Network Architecture

CNN Architecture consists of:

  1. Conv2D layers
  2. Max Pooling
  3. Relu Activation
  4. Fully Connected Layer

RNN Architecture consists of:

  1. LSTM Layers
  2. Memory Cell
  3. Forget Gate
  4. Input Gate
  5. Input Modulation Gate
  6. Output Gate

Procedure to Train Model

  1. In data_set_loader.py and vocabulary.py set the path to dataset correctly.
  2. Run model_training.py to train the model.
  3. If training for the first time, in data_set_loader.py, set vocab_from_file=True in the method get_data_set_loader. From the next time you run, set it to false to use the vocab file already created previously.
  4. Set the parameters epoch, batch_size, mode, path_to_files in dataset_loader.py
  5. pipeline_models.py has the architecture of the models we are using.
  6. After each epoch we are saving encoder(CNN) and decoder(LSTM) models as checkpoints in /models folder.

Procedure to Predict

  • Using the model_inference.ipynb notebook file, there are two ways in which we can see the captions generated by the model we have built.
  • We have uploaded our trained models encoderCNN.pkl and decoderRNN.pkl in this repository. These models will be used for inferencing.

Option 1 : Caption for a random image from the test dataset.

Use the get_caption() method. A random image is selected each time we run this method.

Option 2 : Caption for an image by passing its absolute path.

Use the get_image_caption(image_path) method. The parameter to this method is the path for the image whose caption we want to know.
You can use the sample image surf_image.jpg for instance to see the output. You can download any image from the internet and see its output by passing its path.

Frequently encountered problems

Out of memory issue:

  • Try reducing batch_size
  • Increasing gpu computation capability

Output

About

Generate Natural language captions to images using encoder decoder architecture

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published