Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document extractors #79

Closed
pedrocassalpacheco opened this issue Jan 28, 2025 · 3 comments
Closed

Document extractors #79

pedrocassalpacheco opened this issue Jan 28, 2025 · 3 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@pedrocassalpacheco
Copy link
Collaborator

I tried using the document extractors following the instructions on the doc string.

from langchain_community.document_transformers import GLiNEREntityExtractor

extractor = GLiNEREntityExtractor()
documents = extractor.transform_documents(chuncks)


ImportError Traceback (most recent call last)
Cell In[14], line 1
----> 1 from langchain_community.document_transformers import GLiNEREntityExtractor
4 extractor = GLiNEREntityExtractor()
5 documents = extractor.transform_documents(chuncks)

ImportError: cannot import name 'GLiNEREntityExtractor' from 'langchain_community.document_transformers' (/Users/pedropacheco/Projects/dev/tests/.venv/lib/python3.10/site-packages/langchain_community/document_transformers/init.py)

Note that init.py is empty. Quite sure it needs to be something like

..
from .gliner_entity_extractor import GLiNEREntityExtractor

all = ['GLiNEREntityExtractor']

..

@bjchambers
Copy link
Collaborator

I think there are a few issues:

The doc strings need to be changed, since this is no longer in langchain_community.

Whether they go in __init__.py is a matter of style. If they go in, then they can be imported with from langchain_graph_retriever.document_transformers import GLiNEREntityExtractor. If not, they could be imported with langchain_graph_retriever.document_transformers.gliner import GLiNEREntityExtractor. Specifically, nothing needs to be in __init__.py, it's about whether you want to allow importing it without naming the file. On the other hand, importing it in __init__.py means that anyone who imports anything from document_transformers also imports everything imported by that __init__.py. This can pose a problem when gliner pulls in an optional import that may not be there. This can be handled by deferring the actual imports until the GLiNEREntityExtractor is instantiated, or it can be handled by not including it.

So probably two tasks here:

  1. Update the doc strings
  2. Decide whether we want to include them in __init__.py given the above.

@bjchambers bjchambers added bug Something isn't working good first issue Good for newcomers labels Jan 29, 2025
@epinzur
Copy link
Collaborator

epinzur commented Feb 5, 2025

Some progress made on #113. I'll finish this up tomorrow on another PR.

@epinzur
Copy link
Collaborator

epinzur commented Feb 11, 2025

this is now closed via #136

@epinzur epinzur closed this as completed Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants