write new readme

neurlang · Feb 9, 2025 · 97f78bf · 97f78bf
1 parent 26597e0
commit 97f78bf
Showing 1 changed file with 42 additions and 68 deletions.
diff --git a/dicts/README.md b/dicts/README.md
@@ -14,25 +14,26 @@ Create a `dirty.tsv` text file in the language folder. This is a TSV file that s
 
 Example `dirty.tsv` content (Romanian language):
 
-```
-frumos     fruˈmos
-mâncare    mɨnˈkare
-apă        ˈapə
-om         om
-femeie     feˈmeje
-dragoste   ˈdraɡoste
-copil      koˈpil
-floare     ˈfloare
-pădure     pəˈdure
-soare      ˈsoare
-```
+0 | TAB | 1
+---------|-|----------
+frumos   | | fruˈmos
+mâncare  | | mɨnˈkare
+apă      | | ˈapə
+om       | | om
+femeie   | | feˈmeje
+dragoste | | ˈdraɡoste
+copil    | | koˈpil
+floare   | | ˈfloare
+pădure   | | pəˈdure
+soare    | | ˈsoare
 
 In case there are multiple possible IPA pronunciations for a specific language word, use multiple rows in `dirty.tsv` (Romanian language):
 
-```
-înțelege   ɨntseˈledʒe
-înțelege   ɨntseˈleʒe
-```
+0 | TAB | 1
+----------|-|----------
+înțelege  | | ɨntseˈledʒe
+înțelege  | | ɨntseˈleʒe
+
 
 ## language.json
 
@@ -80,40 +81,32 @@ Run `./clean_language.sh romanian` in Git Bash, replacing "romanian" with your d
 
 After this, the `clean.tsv` file will appear in your language folder. Check the number of rows. A significant majority of the rows from the original `dirty.tsv` should be aligned in your `clean.tsv` file:
 
-```
-f r u m o s   f r u m o s
-m â n c a r e m ɨ n k a r e
-a p ă        a p ə
-```
+0 | 1 | 2 | 3 | 4 | 5 | 6 | TAB | 0 | 1 | 2 | 3 | 4 | 5 | 6
+--|---|---|---|---|---|---|-----|---|---|---|---|---|---|---
+f | r | u | m | o | s |   |     | f | r | u | m | o | s |
+m | â | n | c | a | r | e |     | m | ɨ | n | k | a | r | e
+a | p | ă |   |   |   |   |     | a | p | ə |   |   |   |
+
+### coverage.sh
 
-If more than 90% of the words are aligned, you can proceed. If not, you are recommended to run study_language.sh further.
+1. Navigate to the `cmd/backtest` directory.
+2. Run `./coverage.sh romanian`
+3. If more than 90% of the words are covered, you can proceed. If not, you are
+   recommended to run `study_language.sh` then `clean_language.sh` further.
 
-## Train phonemizer
+## train_language.sh
 
 The prerequisite for this step is the clean.tsv file.
 
 1. Checkout this repo: `https://github.com/neurlang/classifier`
 2. Navigate to the `cmd/train_phonemizer` subdirectory.
 3. Compile the program using `go build`.
-4. Run train_phonemizer with `-cleantsv PATH_TO_YOUR_CLEAN_TSV_FILE`:
-   `./train_phonemizer -cleantsv ../../../goruut/dicts/romanian/clean.tsv`
-
-The algorithm will run for a while. After each retraining of the hashtron network,
-files with the pattern `output.*.json.t.lzw` will start appearing.
-The number (`*`) means the percentage of how successful the resulting model is.
-
-I got a number of files:
-* `output.60.json.t.lzw`
-* `output.71.json.t.lzw`
-* `output.88.json.t.lzw`
-* `output.80.json.t.lzw`
-* `output.93.json.t.lzw`
+4. Run `train_language.sh romanian` 
 
-The `output.93.json.t.lzw` is the best file as its success rate is 93%.
+The algorithm will run for a while. After each improvement of the hashtron network,
+file with the name `weights1.json.lzw` will start appearing in your language's folder.
 
-5. Move the `output.93.json.t.lzw` into the language dir and rename it to
-   `weights1.json.lzw`.
-6. Delete the files with lower success rates.
+To resume training later, use `train_language.sh romanian -resume` 
 
 ## Adding the glue code (language.go)
 
@@ -127,34 +120,15 @@ The `output.93.json.t.lzw` is the best file as its success rate is 93%.
 * Add new case to the switch statement:
   * `case "UserFriendlyLanguageName":`
   * `return yourlanguage.Language.ReadFile(lzw(filename))`
+* In the second function LangName, add the langname according to your
+  UserFriendlyLanguageName and the actual folder.
 
-## Testing the model
+## Backtesting the model
 
-1. Navigate to the `cmd/goruut` directory.
+1. Navigate to the `cmd/backtest` directory.
 2. Recompile using `go build`
-3. Run it, pointing it to the default config file: `./goruut -configfile ../../configs/config.json`
-4. Issue an HTTP POST REQUEST:
-
-POST http://127.0.0.1:18080/tts/phonemize/sentence
-```json
-{
-    "Language": "Norwegian",
-    "Sentence": "bjornson"
-}
-```
-You should see a response like:
-```json
-{
-	"Words": [
-		{
-			"CleanWord": "bjornson",
-			"Linguistic": "bjornson",
-			"Phonetic": "bjɔɳsɔn"
-		}
-	]
-}
-```
-
-You can test words that were included in your `clean.tsv`, as only those will work.
-Furthermore, if your phonemizer model does have less than 100% success rate, some
-words from `clean.tsv` may not work.
+3. Run the backtest, providing the parameter -langname
+4. Example: `./backtest -langname romanian`
+5. It will run for a while, printing the end-to-end success rate on words
+   in your language.
+