Synthetic Q&A Data Generation CLI
StarfishData CLI is a Python-based command-line tool designed to generate synthetic Q&A pairs while ensuring uniqueness using TF-IDF & cosine similarity. Built on Llama models, it enables AI-powered dataset creation for training and research purposes.
✅ Generate Q&A pairs from any topic
✅ Ensures uniqueness (removes duplicates using TF-IDF & cosine similarity)
✅ Works with both remote & local models
✅ CLI-based (easy command-line execution)
✅ Downloads & manages models automatically
poetry install
pip install typer torch transformers rich llama-cpp-python huggingface_hub scikit-learn
starfishdata --help
starfishdata generate --prompt "History of AI" --num-records 5
starfishdata download --hf-token <YOUR_HF_TOKEN>
starfishdata cleanup-models
starfishdata generate --prompt "Topic" --num-records 5
Flag | Description | Default |
---|---|---|
--prompt |
Topic to generate Q&A about | Required |
--num-records |
Number of Q&A pairs (max 100) | 1 |
--file |
GGUF model file | Default model |
--output-file |
Output JSONL file | output.jsonl |
--cleanup |
Delete models after inference | False |
starfishdata download --hf-token <YOUR_HF_TOKEN>
Flag | Description |
---|---|
--name |
Hugging Face model name |
--file |
GGUF model file |
--hf-token |
Hugging Face API Token |
starfishdata cleanup-models
Deletes all downloaded models from cache.
StarfishData allows users to switch models or use local models for generation. You can specify a different model file or use a locally downloaded GGUF model.
starfishdata generate --prompt "History of AI" --num-records 5 --file "custom_model.gguf"
👉 This allows you to use a different GGUF model file.
If you want to switch models, you can download a different one from Hugging Face:
starfishdata download --name "NewModel/Qwen2" --file "new_model.gguf" --hf-token <YOUR_HF_TOKEN>
You can use a locally stored GGUF model instead of downloading:
starfishdata generate --prompt "Space Exploration" --num-records 10 --file "/path/to/local_model.gguf"
Ensure the model file exists before running this command.
This project is licensed under the MIT License.
Contributions are welcome! To contribute:
- Fork the repo
- Create a feature branch (
feature-new-thing
) - Submit a PR 🚀
🎉 Happy Coding! 🚀
Let me know if you need further refinements!