FuzzyMatcher

The Fuzzy Matcher component performs intelligent text matching between OCR-extracted book spine text and a reference database of book titles. It employs fuzzy string matching algorithms to handle OCR imperfections, text variations, and partial matches using RapidFuzz.

Overview

This component processes the output from the Text Extractor module, matching each extracted text segment against a reference database of book titles and authors. The fuzzy matching approach uses token set ratio comparison to ensure that word order variations, minor OCR errors, or formatting differences don't prevent successful matches.

Core Features

The matcher provides sophisticated text matching capabilities with:

Configurable confidence thresholds for both OCR and match quality
Adjustable maximum matches per text segment
Normalized scoring system (0.0 to 1.0)
Text preprocessing to handle punctuation and case variations
Detailed logging of the matching process
Structured JSON output for downstream processing

Usage

The matcher can be run directly using Poetry:

poetry run fuzzy-matcher

For programmatic use within your Python code:

from bookshelf_scanner import FuzzyMatcher

matcher = FuzzyMatcher(
    min_ocr_confidence = 0.1,  # Minimum confidence for OCR results
    min_match_score    = 0.8,   # Minimum score to consider a match valid
    max_matches        = 3       # Maximum number of matches per text segment
)
matcher.match_books()

Configuration

The matcher supports several configuration options:

FuzzyMatcher(
    reference_db_path  = Path("path/to/custom/database.duckdb"),
    ocr_results_path   = Path("path/to/custom/results.json"),
    output_file        = Path("path/to/custom/output.json"),
    max_matches        = 3,
    min_ocr_confidence = 0.1,
    min_match_score    = 0.8
)

Output Format

The matcher generates a JSON file structured as follows:

{
    "bookshelf_image_1.jpg": {
        "texts": [
            "Foundation",
            "Isaac Asimov"
        ],
        "matches": [
            {
                "title": "Foundation",
                "author": "Isaac Asimov",
                "score": 0.95
            },
            {
                "title": "Foundation and Empire",
                "author": "Isaac Asimov",
                "score": 0.85
            }
        ]
    }
}

Dependencies

This module requires:

DuckDB for database operations
RapidFuzz for fuzzy string matching
Project utilities for logging and path management

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FuzzyMatcher

Overview

Core Features

Usage

Configuration

Output Format

Dependencies

Files

README.md

Latest commit

History

README.md

File metadata and controls

FuzzyMatcher

Overview

Core Features

Usage

Configuration

Output Format

Dependencies