→ doc main | dev
→ Japanese doc main | dev
dvg
is incomplatible with CPython 3.11, because some of its dependencyies are so. Reference: Python 3.11 Readiness https://pyreadiness.org/3.11/
dvg
.
ℹ️ Released an alpha version of stng
(PyPI, GitHub), a CLI tool similar to dvg
, but uses a Sentence-Transformer model. Heavy for usual PCs, though depending on GPU performance.
The dvg
is an off-the-shelf grep-like tool that performs semantic similarity search, for Windows, macOS, and Ubuntu.
With SCDV models, search document files that contain similar parts to query. Supports searching within text files (.txt), PDF files (.pdf), and MS Word files (.docx).
Basically, it can be installed with pip dvg
, but if you want to target PDF files or Japanese documents in addition to English, you need to install an option.
→ Installation on Ubuntu / macOS
→ Installation on Windows
Search for the document files similar to the query phrase.
dvg -v -m en <query_phrase> <document_files>...
Each line of output is, from left to right, similarity (the closer the number is to 1, the higher the similarity), length (characters) of the paragraph, file name, and range of line numbers.
dvg
has several options. Here are some options that may be used frequently.
-v, --verbose
Verbose option. If specified, it will show the documents that have the highest similarity at that time.
-m MODEL, --model=MODEL
The available models are en
(for English documents) and ja
(for Japanese documents).
-k NUM, --top-k=NUM
Show top NUM documents as results. The default value is 20.
Specify 0
to show all the documents searched, sorted by the degree of match to the query.
-p, --paragraph
If this option is specified, each paragraph in one document file will be considered as a document. Multiple paragraphs of a single document file will be output in the search results.
If this option is not specified, one document file will be considered as one document. A single document file will be displayed in the search results only once at most.
-w NUM, --window=NUM
A chunk of lines specified by this number will be recognized as a paragraph.
The default value is 20.
-f QUERYFILE, --query-file=QUERYFILE
Read query text from the file.
The query file could be a PDF as well as a text file, like document files.
(As far as I have tried, when the query is specified as a file, better results tend to be obtained by increasing the size of the paragraph with the --window option, e.g. -w 80
)
-i TEXT, --include=TEXT
Only paragraphs that contain the specified string will be included in the search results.
-e TEXT, --exclude=TEXT
Only paragraphs that do not contain the specified string will be included in the search results.
-l CHARS, --min-length=CHARS
Paragraphs shorter than this value get their similarity values lowered. You can use this to exclude short paragraphs from the search results. The default value is 80.
-t CHARS, --excerpt-length=CHARS
The length of the excerpt displayed in the rightmost column of the search results. The default value is 80.
-q, --quote
Show the entire paragraph (without excerpts) of search results.
-H, --header
Add a heading line to the output.
-j NUM, --worker=NUM
Number of worker processes. Option to run in parallel.
→ Search individual lines of a text file
→ An error message like: "ModuleNotFoundError: No module named 'docopt'"
→ An error message like "dvg: command not found ".
→ A warning message "None of PyTorch, TensorFlow >= 2.0, or Flax have been found..."
→ Aborted by segmentation fault (SIGSEGV)
Thanks to Wikipedia for releasing a huge corpus of languages:
https://dumps.wikimedia.org/
dvg is distributed under BSD-2 license.
-
PyPI page https://pypi.org/project/dvg/
-
D. Mekala et al., "SCDV: Sparse Composite Document Vectors using soft clustering over distributional representations," https://arxiv.org/abs/1612.06778
- Change PDF text extraction tool to GhostScript for easier installation on Windows