Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Reverse Indexing in C++ #1

Closed
pedrobiqua opened this issue Oct 2, 2024 · 2 comments · Fixed by #2 or #13
Closed

Implement Reverse Indexing in C++ #1

pedrobiqua opened this issue Oct 2, 2024 · 2 comments · Fixed by #2 or #13
Assignees
Labels
C++ enhancement New feature or request good first issue Good for newcomers

Comments

@pedrobiqua
Copy link
Collaborator

We need to implement a reverse indexing system in C++ to optimize document retrieval and improve the search efficiency in our search engine. Reverse indexing will allow us to associate each word with a list of documents or pages where it appears, facilitating keyword-based search.

Tasks:

  • Define the Data Structure:
    Use an efficient structure to store the index (e.g., std::unordered_map or std::map), where the key is a word and the value is a list of documents/pages.

  • Document Parsing:
    Implement a function to process documents or web pages, tokenizing them into words and populating the reverse index.
    Remove punctuation and normalize the text to lowercase.

  • Update the Index:
    Implement logic to update the index as new documents are added or removed.

  • Search Query:
    Implement a function that, given a word, returns the corresponding documents/pages using the reverse index.

  • Testing:
    Create unit tests to ensure that the index works correctly and that queries return the expected results.
    Test with different dataset sizes to evaluate performance.

Requirements:

  • Familiarity with STL data structures (maps and lists).
  • Basic knowledge of string manipulation and text processing in C++.

References:

pedrobiqua added a commit that referenced this issue Oct 11, 2024
Adição de comentário onde será chamado a implementação do index reverso
pedrobiqua added a commit that referenced this issue Dec 10, 2024
@pedrobiqua
Copy link
Collaborator Author

The following tasks still need to be completed:

  • Implement the Python interface using Cython.
  • Fix bugs related to Latin characters, such as ç, ~, `, ', and accents.
  • Improve preprocessing by adding features like stop word removal.
  • Handle hyphenated words properly.

@pedrobiqua pedrobiqua added enhancement New feature or request good first issue Good for newcomers labels Dec 10, 2024
@pedrobiqua pedrobiqua assigned pedrobiqua and unassigned pedrobiqua Dec 10, 2024
@pedrobiqua pedrobiqua reopened this Dec 10, 2024
@pedrobiqua pedrobiqua linked a pull request Dec 10, 2024 that will close this issue
pedrobiqua added a commit that referenced this issue Dec 11, 2024
pedrobiqua added a commit that referenced this issue Dec 11, 2024
@pedrobiqua
Copy link
Collaborator Author

pedrobiqua commented Dec 17, 2024

To do list:

  • Implement the Python interface using Cython.

@pedrobiqua pedrobiqua linked a pull request Dec 17, 2024 that will close this issue
16 tasks
@pedrobiqua pedrobiqua added the C++ label Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ enhancement New feature or request good first issue Good for newcomers
Projects
None yet
1 participant