0.0.5alpha LASLA+

PonteIneptique · Dec 2, 2020 · 8697a87 · 8697a87
1 parent 9b47ab5
commit 8697a87
Show file tree

Hide file tree

Showing 5 changed files with 92 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
+*.tar
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]

diff --git a/LASLA-plus.md b/LASLA-plus.md
@@ -0,0 +1,82 @@
+Model LASLA+
+============
+
+## lemma
+
+|                  | accuracy | precision | recall | support |
+|------------------|----------|-----------|--------|---------|
+| all              | 0.9708   | 0.8237    | 0.8186 | 169819  |
+| known-tokens     | 0.9753   | 0.8906    | 0.8884 | 161865  |
+| unknown-tokens   | 0.8793   | 0.7416    | 0.7377 | 7954    |
+| ambiguous-tokens | 0.9207   | 0.6793    | 0.6802 | 42700   |
+| unknown-targets  | 0.6087   | 0.4403    | 0.4389 | 1104    |
+
+
+## pos
+
+|                  | accuracy | precision | recall | support |
+|------------------|----------|-----------|--------|---------|
+| all              | 0.9585   | 0.9425    | 0.9288 | 169819  |
+| known-tokens     | 0.9613   | 0.9461    | 0.9337 | 161865  |
+| unknown-tokens   | 0.9009   | 0.7065    | 0.6273 | 7954    |
+| ambiguous-tokens | 0.8944   | 0.8848    | 0.8424 | 52025   |
+
+
+## Gend
+
+|                  | accuracy | precision | recall | support |
+|------------------|----------|-----------|--------|---------|
+| all              | 0.9583   | 0.8983    | 0.8976 | 169819  |
+| known-tokens     | 0.9606   | 0.901     | 0.9016 | 161865  |
+| unknown-tokens   | 0.9134   | 0.8568    | 0.8329 | 7954    |
+| ambiguous-tokens | 0.8607   | 0.8608    | 0.8607 | 40191   |
+
+
+## Numb
+
+|                  | accuracy | precision | recall | support |
+|------------------|----------|-----------|--------|---------|
+| all              | 0.9708   | 0.9705    | 0.9675 | 169819  |
+| known-tokens     | 0.9723   | 0.972     | 0.9689 | 161865  |
+| unknown-tokens   | 0.9385   | 0.9156    | 0.9099 | 7954    |
+| ambiguous-tokens | 0.9041   | 0.9053    | 0.893  | 39600   |
+
+## Case
+
+|                  | accuracy | precision | recall | support |
+|------------------|----------|-----------|--------|---------|
+| all              | 0.9204   | 0.8878    | 0.7958 | 169819  |
+| known-tokens     | 0.9229   | 0.8953    | 0.8016 | 161865  |
+| unknown-tokens   | 0.868    | 0.6393    | 0.7006 | 7954    |
+| ambiguous-tokens | 0.8277   | 0.8465    | 0.7429 | 64272   |
+
+
+## Deg
+
+|                  | accuracy | precision | recall | support |
+|------------------|----------|-----------|--------|---------|
+| all              | 0.9797   | 0.9708    | 0.968  | 169819  |
+| known-tokens     | 0.9817   | 0.9725    | 0.972  | 161865  |
+| unknown-tokens   | 0.9375   | 0.9359    | 0.9052 | 7954    |
+| ambiguous-tokens | 0.9171   | 0.9227    | 0.9257 | 29785   |
+
+
+## Person
+
+|                  | accuracy | precision | recall | support |
+|------------------|----------|-----------|--------|---------|
+| all              | 0.997    | 0.9877    | 0.9759 | 169819  |
+| known-tokens     | 0.9978   | 0.9892    | 0.9815 | 161865  |
+| unknown-tokens   | 0.9809   | 0.9753    | 0.9432 | 7954    |
+| ambiguous-tokens | 0.978    | 0.94      | 0.915  | 10188   |
+
+
+## Dis
+
+|                  | accuracy | precision | recall | support |
+|------------------|----------|-----------|--------|---------|
+| all              | 0.9711   | 0.8789    | 0.8639 | 169819  |
+| known-tokens     | 0.9722   | 0.8797    | 0.8665 | 161865  |
+| unknown-tokens   | 0.9492   | 0.6889    | 0.5681 | 7954    |
+| ambiguous-tokens | 0.908    | 0.8573    | 0.8487 | 43148   |
+
diff --git a/README.md b/README.md
@@ -4,6 +4,10 @@
 
 Repository for LASLA Latin models: the models were fine-tuned by Thibault Clérice, data are based on LASLA data but some adaptation might be found. 
 
+## Download models
+
+[Check latest release, under assets](https://github.com/PonteIneptique/latin-lasla-models/releases/latest)
+
 ## Information about the model
 
 *Note:* the model is currently being fine-tuned in the context of my PhD. I'll fill this part when it will be done.
@@ -17,6 +21,11 @@ The training set is roughly **1.5M tokens**, dev test roughly 10k and test 16982
 - All punctuation signs are unknown, including the one used in abbr. `token[C]` == `lemma[Gaius]`
 - Lemma and tokens now accept lower and uppercasing. Noise was introduced in the dataset for better results.
 
+### Model LASLA+ (model-plus.tar)
+
+The model LASLA+ is trained on additionnal data, creating some noise in the original dataset and resulting in apparently worse results on classical data (approxim. -0.3%). It's results are 
+detailed in [LASLA-plus.md](LASLA-plus.md).
+
 ## Scores
 
 ### Table of Content

diff --git a/model-vulgate.tar b/model-vulgate.tar
diff --git a/model.tar b/model.tar