-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(chat): fetch examples from chromadb for few-shot learning #21
Conversation
… corresponding queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌 Nice work! I have a few suggestions and comments, mostly about keeping functions simple and reusable.
fix: default prompt templates in aikg/config/chat.py
need to be adapted to support {examples_str}
.
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
Thank you for all the comments and suggestions @cmdoret 🙌 Summary of relevant or open changesBased on your suggestion on making the example parser more flexible:
Clarification 1clarification/question: the new
Because of this in # aikg/flows/chroma_examples.py
# L110-122
# Create subject documents
docs = get_sparql_examples(
dir=chroma_input_dir,
)
# Vectorize and index documents by batches to reduce overhead
logger.info(f"Indexing by batches of {chroma_cfg.batch_size} items")
embed_counter = 0
for doc in docs:
for batch in chunked(doc, chroma_cfg.batch_size):
embed_counter += len(batch)
index_batch(batch)
logger.info(f"Indexed {embed_counter} items.") Clarification 2clarification/question: for #21 (review) should I directly change the |
In [2]: from more_itertools import chunked
In [3]: docs = [{'data': i, 'meta': v} for i, v in enumerate(['red', 'blue', 'green'])]
In [4]: docs
Out[4]:
[{'data': 0, 'meta': 'red'},
{'data': 1, 'meta': 'blue'},
{'data': 2, 'meta': 'green'}]
In [5]: for batch in chunked(docs, 2):
...: print(f"---\nBATCH: {batch}")
...:
---
BATCH: [{'data': 0, 'meta': 'red'}, {'data': 1, 'meta': 'blue'}]
---
BATCH: [{'data': 2, 'meta': 'green'}]
|
Alright, all changes included as follows:
I tested the code and it runs successfully 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have one last question regarding the batch-indexing. Other than that, it looks good! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me :) Great work 🚀
This PR adds the following:
Major:
New ChromaDB collection that includes example questions and corresponding queries. It works as follows:
examples/
where all the example questions are indexed and their corresponding query attached as metadatagenerate_examples
queries the collection and outputs a structured prompt which can be embedded into a template promptA practical example. Given a file
q1.sparql
:The ChromaDB collection
examples/
will store the questionWho am I?
and attachSELECT ?me
as metadata. When calling thegenerate_example
function on a question (Who am I working with?
), ChromaDB will look for the closest question(s) and return the following output:Minor
schema
instead oftest
get_sparql_examples
) toio.py
as it seemed more fitting there