Document extractors #79

pedrocassalpacheco · 2025-01-28T23:10:31Z

I tried using the document extractors following the instructions on the doc string.

from langchain_community.document_transformers import GLiNEREntityExtractor

extractor = GLiNEREntityExtractor()
documents = extractor.transform_documents(chuncks)

ImportError Traceback (most recent call last)
Cell In[14], line 1
----> 1 from langchain_community.document_transformers import GLiNEREntityExtractor
4 extractor = GLiNEREntityExtractor()
5 documents = extractor.transform_documents(chuncks)

ImportError: cannot import name 'GLiNEREntityExtractor' from 'langchain_community.document_transformers' (/Users/pedropacheco/Projects/dev/tests/.venv/lib/python3.10/site-packages/langchain_community/document_transformers/init.py)

Note that init.py is empty. Quite sure it needs to be something like

..
from .gliner_entity_extractor import GLiNEREntityExtractor

all = ['GLiNEREntityExtractor']

..

bjchambers · 2025-01-28T23:29:04Z

I think there are a few issues:

The doc strings need to be changed, since this is no longer in langchain_community.

Whether they go in __init__.py is a matter of style. If they go in, then they can be imported with from langchain_graph_retriever.document_transformers import GLiNEREntityExtractor. If not, they could be imported with langchain_graph_retriever.document_transformers.gliner import GLiNEREntityExtractor. Specifically, nothing needs to be in __init__.py, it's about whether you want to allow importing it without naming the file. On the other hand, importing it in __init__.py means that anyone who imports anything from document_transformers also imports everything imported by that __init__.py. This can pose a problem when gliner pulls in an optional import that may not be there. This can be handled by deferring the actual imports until the GLiNEREntityExtractor is instantiated, or it can be handled by not including it.

So probably two tasks here:

Update the doc strings
Decide whether we want to include them in __init__.py given the above.

epinzur · 2025-02-05T18:44:19Z

Some progress made on #113. I'll finish this up tomorrow on another PR.

epinzur · 2025-02-11T23:23:57Z

this is now closed via #136

bjchambers added bug Something isn't working good first issue Good for newcomers labels Jan 29, 2025

bjchambers assigned epinzur Feb 3, 2025

epinzur closed this as completed Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document extractors #79

Document extractors #79

pedrocassalpacheco commented Jan 28, 2025

bjchambers commented Jan 28, 2025

epinzur commented Feb 5, 2025

epinzur commented Feb 11, 2025

Document extractors #79

Document extractors #79

Comments

pedrocassalpacheco commented Jan 28, 2025

bjchambers commented Jan 28, 2025

epinzur commented Feb 5, 2025

epinzur commented Feb 11, 2025