Skip to content

Latest commit

 

History

History
20 lines (20 loc) · 875 Bytes

README.md

File metadata and controls

20 lines (20 loc) · 875 Bytes

Chinese Words Segmentation

Dataset

  • total 36497 chinese news, which is collected from the Internet

Purpose

Different from English, there are no space between Chinese words. This project aims to implement Chinese word segmentation without dictionary.

  • develop Chinese word segmentation algorithm based on entropy
  • find new Chinese words (which is not in corpus)

Method

Chinese Words Segmentation

Internal Aggregation

  • see WordInfo.calculateAggregation()

External Collocation Richness

  • see WordInfo.calculateEntropy()

Word Frequency

If the word frequency is too high, it might be stop words.
If the word frequency is too low, it may not be as word.
  • In this project, I set FREQ_MIN = 5, FREQ_MAX = 10. The word between FREQ_MIN and FREQ_MAX can be new words candidates

Usage

  • Change INPUT_FILE in wordSegment.py
  • python 2