A comprehensive NLP pipeline for text analysis, including preprocessing, tokenization, lemmatization, POS tagging, NER, sentiment analysis, topic modeling, keyword extraction, dependency parsing, summarization, spell checking, and visualization.
This project was developed as part of my final undergraduate project in 2022.
- pip install -r requirements.txt
- Clone the repository.
- Install dependencies using
requirements.txt
. - Run
main.py
to execute the pipeline.
- Text Preprocessing: Lowercasing, punctuation removal, and whitespace normalization.
- Tokenization: Splitting text into tokens and removing stopwords.
- Lemmatization: Reducing words to their base forms.
- POS Tagging: Assigning part-of-speech tags to tokens.
- Chunking: Grouping tokens into meaningful chunks (e.g., noun phrases).
- Named Entity Recognition (NER): Identifying entities like names, dates, and locations.
- Sentiment Analysis: Analyzing the sentiment of the text (positive, negative, neutral).
- Topic Modeling: Identifying topics in the text using Latent Dirichlet Allocation (LDA).
- Keyword Extraction: Extracting important keywords using TF-IDF.
- Dependency Parsing: Analyzing grammatical relationships between words.
- Text Summarization: Generating a summary of the text using LSA.
- Spell Checking: Correcting spelling errors in the text.
- Visualization: Word cloud, bar chart, and network graph for insights.
- Saving Results: Saving processed data and visualizations to files.