This repository provides a lightweight, modular toolkit for processing the Tatoeba corpus specifically for Kabyle and English, including downloading, extracting, and generating a bilingual corpus; splitting the corpus into separate language files; and automatically fixing non‑standard characters in Kabyle text to conform to a standardized character set.
python3 -m venv env
source env/bin/activate
Clone the repository using:
git clone https://github.com/BoFFire/kabyle-nlp-toolkit.git
Go to :
cd kabyle-nlp-toolkit
pip install -r requirements.txt
python3 get_tatoeba_corpus.py --source_lang eng --target_lang kab
Expected files:
corpus/
├── eng_kab_sentence_pairs.tsv
├── en.txt
├── kab.txt
├── kab_fixed.txt
└── kab_stopwords.txt
corpus/eng_kab_sentence_pairs.tsv
: TSV file containing bilingual sentence pairscorpus/en.txt
: File with English sentences onlycorpus/kab.txt
: File with original Kabyle sentencescorpus/kab_fixed.txt
: File with fixed Kabyle sentences (non-standard characters replaced)kab_stopwords.txt
: File with a list of kabyle stopwords candidates.