Skip to content

Latest commit

 

History

History
23 lines (15 loc) · 2.22 KB

README.md

File metadata and controls

23 lines (15 loc) · 2.22 KB

Okapi BM25

A Python implementation of the BM25 for file retrieval

Given a query Q, containing keywords q1,...,qn, BM25 score of a document is

where the IDF weight of the query term qi is computed as:

Implementation

There are two main modules:

QueryParser parses the query to produce a list.

BuildIndex builds an inverted index and computes the scores of the documents according to the BM25 ranking function.

  • process_files: processes corpus files to produce a dictionary
  • index_one_file & regular_index: map words to their position in the corresponding document
  • inverted_index: return a dictionary with each word as the key and its value is another dictionary, whose key is filename and value is word position in that file
  • inverse_df: return a dictionary with each word as the key and the IDF as value
  • docLen and avgdocl: calculates the length of each document, the average document length in the text collection, respectively
  • BM25scores: return BM25 scores of the documents