Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(chat): fetch examples from chromadb for few-shot learning #21

Merged
merged 16 commits into from
Mar 1, 2024

Conversation

supermaxiste
Copy link
Member

This PR adds the following:

Major:

New ChromaDB collection that includes example questions and corresponding queries. It works as follows:

  • It creates a collection called examples/ where all the example questions are indexed and their corresponding query attached as metadata
  • A new function generate_examples queries the collection and outputs a structured prompt which can be embedded into a template prompt

A practical example. Given a file q1.sparql:

# Who am I?
SELECT ?me

The ChromaDB collection examples/ will store the question Who am I? and attach SELECT ?me as metadata. When calling the generate_example function on a question (Who am I working with?), ChromaDB will look for the closest question(s) and return the following output:

Question:
Who am I?
Answer:
SELECT ?me

Minor

  • Renamed default collection to schema instead of test
  • Added function to import examples (get_sparql_examples) to io.py as it seemed more fitting there
  • Updated dependencies to have prefect working (see Update prefect version #18 for longer-term solution)
  • Updated ChromaDB API functions

@supermaxiste supermaxiste added the enhancement New feature or request label Jan 25, 2024
@supermaxiste supermaxiste self-assigned this Jan 25, 2024
@cmdoret cmdoret self-requested a review January 25, 2024 12:21
Copy link
Member

@cmdoret cmdoret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌 Nice work! I have a few suggestions and comments, mostly about keeping functions simple and reusable.

fix: default prompt templates in aikg/config/chat.py need to be adapted to support {examples_str}.

supermaxiste and others added 7 commits February 5, 2024 13:18
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
Co-authored-by: Cyril Matthey-Doret <cyril.matthey-doret@epfl.ch>
@supermaxiste
Copy link
Member Author

Thank you for all the comments and suggestions @cmdoret 🙌

Summary of relevant or open changes

Based on your suggestion on making the example parser more flexible:

  • I created a parse_sparql_example function that works with a text stream in aikg/utils/io.py
  • I added get_sparql_examples as a task in aikg/flows/chroma_examples.py as you suggested

Clarification 1

clarification/question: the newget_sparql_examples function works slightly differently compared to before.

  • Before: output a list with a single Document
  • After: output a list with multiple Document one per example

Because of this in chroma_build_examples_flow I had to loop through the output to be able to batch

# aikg/flows/chroma_examples.py 
# L110-122

    # Create subject documents
    docs = get_sparql_examples(
        dir=chroma_input_dir,
    )

    # Vectorize and index documents by batches to reduce overhead
    logger.info(f"Indexing by batches of {chroma_cfg.batch_size} items")
    embed_counter = 0
    for doc in docs:
        for batch in chunked(doc, chroma_cfg.batch_size):
            embed_counter += len(batch)
            index_batch(batch)
    logger.info(f"Indexed {embed_counter} items.")

Clarification 2

clarification/question: for #21 (review) should I directly change the sparql_template to match the one we used or would you like to add another template so that we have one with and one without examples?

@cmdoret
Copy link
Member

cmdoret commented Feb 5, 2024

  • clarification 1: If docs is a list of Documents, we should be able to loop directly on chunk(docs):
In [2]: from more_itertools import chunked

In [3]: docs = [{'data': i, 'meta': v} for i, v in enumerate(['red', 'blue', 'green'])]

In [4]: docs
Out[4]: 
[{'data': 0, 'meta': 'red'},
 {'data': 1, 'meta': 'blue'},
 {'data': 2, 'meta': 'green'}]

In [5]: for batch in chunked(docs, 2):
   ...:     print(f"---\nBATCH: {batch}")
   ...: 
---
BATCH: [{'data': 0, 'meta': 'red'}, {'data': 1, 'meta': 'blue'}]
---
BATCH: [{'data': 2, 'meta': 'green'}]
  • clarification 2: Indeed, examples should be optional. We can change the template directly, however we should set a default value for the examples parameter of generate_sparql, such that the user can ignore it and it will just insert nothing. Does that make sense?.
    • Also rather than having the Examples: header in the template, it should probably be inserted dynamically if examples is not empty. Otherwise, not providing examples will result in an Example: header followed by another header.

@supermaxiste
Copy link
Member Author

Alright, all changes included as follows:

  • generate_sparql now has examples set to an empty string as default and the argument had to be moved after the arguments without default
  • without example provided, now {str_examples} doesn't inject anything into the prompt. The header "Examples" is part of the example generation function
  • I restored the loop on Documents to a loop on chunks defined by the chroma config and within the chunks I'm looping through each doc

I tested the code and it runs successfully 👍

Copy link
Member

@cmdoret cmdoret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have one last question regarding the batch-indexing. Other than that, it looks good! :)

@cmdoret cmdoret self-requested a review February 29, 2024 13:11
Copy link
Member

@cmdoret cmdoret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :) Great work 🚀

@cmdoret cmdoret changed the title feat(few-shot-examples): add new chromadb flow to get similar examples feat(chat): fetch examples from chromadb for few-shot learning Mar 1, 2024
@cmdoret cmdoret merged commit ec3618e into main Mar 1, 2024
@cmdoret cmdoret deleted the feat/examples branch April 29, 2024 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants