Skip to content

Commit

Permalink
write new readme
Browse files Browse the repository at this point in the history
  • Loading branch information
neurlang authored and Your Name committed Feb 9, 2025
1 parent 26597e0 commit 97f78bf
Showing 1 changed file with 42 additions and 68 deletions.
110 changes: 42 additions & 68 deletions dicts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,25 +14,26 @@ Create a `dirty.tsv` text file in the language folder. This is a TSV file that s

Example `dirty.tsv` content (Romanian language):

```
frumos fruˈmos
mâncare mɨnˈkare
apă ˈapə
om om
femeie feˈmeje
dragoste ˈdraɡoste
copil koˈpil
floare ˈfloare
pădure pəˈdure
soare ˈsoare
```
0 | TAB | 1
---------|-|----------
frumos | | fruˈmos
mâncare | | mɨnˈkare
apă | | ˈapə
om | | om
femeie | | feˈmeje
dragoste | | ˈdraɡoste
copil | | koˈpil
floare | | ˈfloare
pădure | | pəˈdure
soare | | ˈsoare

In case there are multiple possible IPA pronunciations for a specific language word, use multiple rows in `dirty.tsv` (Romanian language):

```
înțelege ɨntseˈledʒe
înțelege ɨntseˈleʒe
```
0 | TAB | 1
----------|-|----------
înțelege | | ɨntseˈledʒe
înțelege | | ɨntseˈleʒe


## language.json

Expand Down Expand Up @@ -80,40 +81,32 @@ Run `./clean_language.sh romanian` in Git Bash, replacing "romanian" with your d

After this, the `clean.tsv` file will appear in your language folder. Check the number of rows. A significant majority of the rows from the original `dirty.tsv` should be aligned in your `clean.tsv` file:

```
f r u m o s f r u m o s
m â n c a r e m ɨ n k a r e
a p ă a p ə
```
0 | 1 | 2 | 3 | 4 | 5 | 6 | TAB | 0 | 1 | 2 | 3 | 4 | 5 | 6
--|---|---|---|---|---|---|-----|---|---|---|---|---|---|---
f | r | u | m | o | s | | | f | r | u | m | o | s |
m | â | n | c | a | r | e | | m | ɨ | n | k | a | r | e
a | p | ă | | | | | | a | p | ə | | | |

### coverage.sh

If more than 90% of the words are aligned, you can proceed. If not, you are recommended to run study_language.sh further.
1. Navigate to the `cmd/backtest` directory.
2. Run `./coverage.sh romanian`
3. If more than 90% of the words are covered, you can proceed. If not, you are
recommended to run `study_language.sh` then `clean_language.sh` further.

## Train phonemizer
## train_language.sh

The prerequisite for this step is the clean.tsv file.

1. Checkout this repo: `https://github.com/neurlang/classifier`
2. Navigate to the `cmd/train_phonemizer` subdirectory.
3. Compile the program using `go build`.
4. Run train_phonemizer with `-cleantsv PATH_TO_YOUR_CLEAN_TSV_FILE`:
`./train_phonemizer -cleantsv ../../../goruut/dicts/romanian/clean.tsv`

The algorithm will run for a while. After each retraining of the hashtron network,
files with the pattern `output.*.json.t.lzw` will start appearing.
The number (`*`) means the percentage of how successful the resulting model is.

I got a number of files:
* `output.60.json.t.lzw`
* `output.71.json.t.lzw`
* `output.88.json.t.lzw`
* `output.80.json.t.lzw`
* `output.93.json.t.lzw`
4. Run `train_language.sh romanian`

The `output.93.json.t.lzw` is the best file as its success rate is 93%.
The algorithm will run for a while. After each improvement of the hashtron network,
file with the name `weights1.json.lzw` will start appearing in your language's folder.

5. Move the `output.93.json.t.lzw` into the language dir and rename it to
`weights1.json.lzw`.
6. Delete the files with lower success rates.
To resume training later, use `train_language.sh romanian -resume`

## Adding the glue code (language.go)

Expand All @@ -127,34 +120,15 @@ The `output.93.json.t.lzw` is the best file as its success rate is 93%.
* Add new case to the switch statement:
* `case "UserFriendlyLanguageName":`
* `return yourlanguage.Language.ReadFile(lzw(filename))`
* In the second function LangName, add the langname according to your
UserFriendlyLanguageName and the actual folder.

## Testing the model
## Backtesting the model

1. Navigate to the `cmd/goruut` directory.
1. Navigate to the `cmd/backtest` directory.
2. Recompile using `go build`
3. Run it, pointing it to the default config file: `./goruut -configfile ../../configs/config.json`
4. Issue an HTTP POST REQUEST:

POST http://127.0.0.1:18080/tts/phonemize/sentence
```json
{
"Language": "Norwegian",
"Sentence": "bjornson"
}
```
You should see a response like:
```json
{
"Words": [
{
"CleanWord": "bjornson",
"Linguistic": "bjornson",
"Phonetic": "bjɔɳsɔn"
}
]
}
```

You can test words that were included in your `clean.tsv`, as only those will work.
Furthermore, if your phonemizer model does have less than 100% success rate, some
words from `clean.tsv` may not work.
3. Run the backtest, providing the parameter -langname
4. Example: `./backtest -langname romanian`
5. It will run for a while, printing the end-to-end success rate on words
in your language.

0 comments on commit 97f78bf

Please sign in to comment.