Skip to content

Commit

Permalink
0.0.5alpha LASLA+
Browse files Browse the repository at this point in the history
  • Loading branch information
PonteIneptique committed Dec 2, 2020
1 parent 9b47ab5 commit 8697a87
Show file tree
Hide file tree
Showing 5 changed files with 92 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
*.tar
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
82 changes: 82 additions & 0 deletions LASLA-plus.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
Model LASLA+
============

## lemma

| | accuracy | precision | recall | support |
|------------------|----------|-----------|--------|---------|
| all | 0.9708 | 0.8237 | 0.8186 | 169819 |
| known-tokens | 0.9753 | 0.8906 | 0.8884 | 161865 |
| unknown-tokens | 0.8793 | 0.7416 | 0.7377 | 7954 |
| ambiguous-tokens | 0.9207 | 0.6793 | 0.6802 | 42700 |
| unknown-targets | 0.6087 | 0.4403 | 0.4389 | 1104 |


## pos

| | accuracy | precision | recall | support |
|------------------|----------|-----------|--------|---------|
| all | 0.9585 | 0.9425 | 0.9288 | 169819 |
| known-tokens | 0.9613 | 0.9461 | 0.9337 | 161865 |
| unknown-tokens | 0.9009 | 0.7065 | 0.6273 | 7954 |
| ambiguous-tokens | 0.8944 | 0.8848 | 0.8424 | 52025 |


## Gend

| | accuracy | precision | recall | support |
|------------------|----------|-----------|--------|---------|
| all | 0.9583 | 0.8983 | 0.8976 | 169819 |
| known-tokens | 0.9606 | 0.901 | 0.9016 | 161865 |
| unknown-tokens | 0.9134 | 0.8568 | 0.8329 | 7954 |
| ambiguous-tokens | 0.8607 | 0.8608 | 0.8607 | 40191 |


## Numb

| | accuracy | precision | recall | support |
|------------------|----------|-----------|--------|---------|
| all | 0.9708 | 0.9705 | 0.9675 | 169819 |
| known-tokens | 0.9723 | 0.972 | 0.9689 | 161865 |
| unknown-tokens | 0.9385 | 0.9156 | 0.9099 | 7954 |
| ambiguous-tokens | 0.9041 | 0.9053 | 0.893 | 39600 |

## Case

| | accuracy | precision | recall | support |
|------------------|----------|-----------|--------|---------|
| all | 0.9204 | 0.8878 | 0.7958 | 169819 |
| known-tokens | 0.9229 | 0.8953 | 0.8016 | 161865 |
| unknown-tokens | 0.868 | 0.6393 | 0.7006 | 7954 |
| ambiguous-tokens | 0.8277 | 0.8465 | 0.7429 | 64272 |


## Deg

| | accuracy | precision | recall | support |
|------------------|----------|-----------|--------|---------|
| all | 0.9797 | 0.9708 | 0.968 | 169819 |
| known-tokens | 0.9817 | 0.9725 | 0.972 | 161865 |
| unknown-tokens | 0.9375 | 0.9359 | 0.9052 | 7954 |
| ambiguous-tokens | 0.9171 | 0.9227 | 0.9257 | 29785 |


## Person

| | accuracy | precision | recall | support |
|------------------|----------|-----------|--------|---------|
| all | 0.997 | 0.9877 | 0.9759 | 169819 |
| known-tokens | 0.9978 | 0.9892 | 0.9815 | 161865 |
| unknown-tokens | 0.9809 | 0.9753 | 0.9432 | 7954 |
| ambiguous-tokens | 0.978 | 0.94 | 0.915 | 10188 |


## Dis

| | accuracy | precision | recall | support |
|------------------|----------|-----------|--------|---------|
| all | 0.9711 | 0.8789 | 0.8639 | 169819 |
| known-tokens | 0.9722 | 0.8797 | 0.8665 | 161865 |
| unknown-tokens | 0.9492 | 0.6889 | 0.5681 | 7954 |
| ambiguous-tokens | 0.908 | 0.8573 | 0.8487 | 43148 |

9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

Repository for LASLA Latin models: the models were fine-tuned by Thibault Clérice, data are based on LASLA data but some adaptation might be found.

## Download models

[Check latest release, under assets](https://github.com/PonteIneptique/latin-lasla-models/releases/latest)

## Information about the model

*Note:* the model is currently being fine-tuned in the context of my PhD. I'll fill this part when it will be done.
Expand All @@ -17,6 +21,11 @@ The training set is roughly **1.5M tokens**, dev test roughly 10k and test 16982
- All punctuation signs are unknown, including the one used in abbr. `token[C]` == `lemma[Gaius]`
- Lemma and tokens now accept lower and uppercasing. Noise was introduced in the dataset for better results.

### Model LASLA+ (model-plus.tar)

The model LASLA+ is trained on additionnal data, creating some noise in the original dataset and resulting in apparently worse results on classical data (approxim. -0.3%). It's results are
detailed in [LASLA-plus.md](LASLA-plus.md).

## Scores

### Table of Content
Expand Down
Binary file removed model-vulgate.tar
Binary file not shown.
Binary file removed model.tar
Binary file not shown.

0 comments on commit 8697a87

Please sign in to comment.