Kabyle NLP Toolkit

This repository provides a lightweight, modular toolkit for processing the Tatoeba corpus specifically for Kabyle and English, including downloading, extracting, and generating a bilingual corpus; splitting the corpus into separate language files; and automatically fixing non‑standard characters in Kabyle text to conform to a standardized character set.

Create and activate a virtual environment

python3 -m venv env

On Linux/Mac:

source env/bin/activate

Getting Started

Clone the repository using:

git clone https://github.com/BoFFire/kabyle-nlp-toolkit.git

Go to :

cd kabyle-nlp-toolkit

Install dependencies from requirements.txt

pip install -r requirements.txt

Run the main corpus processing script for English-Kabyle:

python3 get_tatoeba_corpus.py --source_lang eng --target_lang kab

All output files are saved in the "corpus" directory by default.

Expected files:

corpus/
├── eng_kab_sentence_pairs.tsv
├── en.txt
├── kab.txt
├── kab_fixed.txt
└── kab_stopwords.txt

corpus/eng_kab_sentence_pairs.tsv : TSV file containing bilingual sentence pairs
corpus/en.txt : File with English sentences only
corpus/kab.txt : File with original Kabyle sentences
corpus/kab_fixed.txt : File with fixed Kabyle sentences (non-standard characters replaced)
kab_stopwords.txt : File with a list of kabyle stopwords candidates.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
check_kab_chars.py		check_kab_chars.py
downloader.py		downloader.py
extractor.py		extractor.py
fixer.py		fixer.py
get_tatoeba_corpus.py		get_tatoeba_corpus.py
kab_stopwords.py		kab_stopwords.py
pairing.py		pairing.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kabyle NLP Toolkit

Create and activate a virtual environment

On Linux/Mac:

Getting Started

Install dependencies from requirements.txt

Run the main corpus processing script for English-Kabyle:

All output files are saved in the "corpus" directory by default.

About

Releases 1

Packages

Languages

License

BoFFire/kabyle-nlp-toolkit

Folders and files

Latest commit

History

Repository files navigation

Kabyle NLP Toolkit

Create and activate a virtual environment

On Linux/Mac:

Getting Started

Install dependencies from requirements.txt

Run the main corpus processing script for English-Kabyle:

All output files are saved in the "corpus" directory by default.

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages