Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add Preprocessing Module with RSPL Stemmer Algorithm for Portuguese Language #39

Closed
pedrobiqua opened this issue Jan 10, 2025 · 0 comments · Fixed by #41
Closed
Assignees
Labels
C++ enhancement New feature or request

Comments

@pedrobiqua
Copy link
Collaborator

Feature Request

Description

Introduce a preprocessing module to the search_engine library that implements the RSPL Stemmer algorithm. The RSPL Stemmer is a well-known algorithm for stemming in the Portuguese language, consisting of 8 steps, and can significantly enhance the performance of information retrieval systems for Portuguese documents.

This feature would improve the library's ability to process and normalize text data in Portuguese, making it more versatile for multilingual search engine applications.

Proposed Solution

Implement the RSPL Stemmer algorithm in C++ within the preprocessing module of the library. The new module should:

  1. Accept text input in Portuguese.
  2. Normalize the text through the 8 steps of the RSPL Stemmer.
  3. Return the stemmed version of the input text for further indexing or querying.

Example usage (in C++):

#include "preprocessing.h"

// Example of stemming a Portuguese sentence
std::string input_text = "correrá correndo correu";
std::string stemmed_text = Preprocessing::stemRSPL(input_text);
// Expected output: "corr corr corr"

Alternatives Considered

  • Snowball Stemmer: While Snowball is a general-purpose stemmer, it may not be as effective as RSPL for Portuguese due to language-specific nuances.
  • Lemmatization: Though more accurate, lemmatization requires additional resources such as a morphological dictionary, making it computationally expensive.

Additional Context

The RSPL algorithm is described in the following reference:

A stemming algorithm for the portuguese language

This implementation aligns with the library's goal of providing efficient and accurate tools for text preprocessing and search engine optimization.

@pedrobiqua pedrobiqua added enhancement New feature or request C++ labels Jan 10, 2025
@pedrobiqua pedrobiqua self-assigned this Jan 10, 2025
@pedrobiqua pedrobiqua linked a pull request Jan 19, 2025 that will close this issue
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant