Skip to content
/ segtok Public

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features

Notifications You must be signed in to change notification settings

xamgore/segtok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

segtok

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. Ported from the python package (not maintained anymore), and fixes the contractions bug.

use segtok::{segmenter::*, tokenizer::*};

fn main() {
    let input = include_str!("../tests/test_google.txt");

    let sentences: Vec<Vec<_>> = split_multi(input, SegmentConfig::default())
        .into_iter()
        .map(|span| split_contractions(web_tokenizer(&span)).collect())
        .collect();
}

About

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features

Topics

Resources

Stars

Watchers

Forks

Languages