Skip to content

Commit

Permalink
Merge pull request #5 from esrel/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
esrel authored Jan 27, 2023
2 parents 668e504 + ed5d5f5 commit e78eb36
Show file tree
Hide file tree
Showing 67 changed files with 6,033 additions and 35,100 deletions.
268 changes: 134 additions & 134 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,9 @@ Each dialog (file) is stored as a JSON file that has the following structure:
{
"DOC_ID": "numeric part of a filename",
"tokens": "flat list of tokens",
"blocks": "list of token start & end indices for blocks from parser",
"chunks": "list of token start & end indices for blocks in text file (tab-separated)",
"groups": "list of token start & end indices for groups in text file (nl-separated)",
"relations": "list of relations"
"blocks": "list of token start & end indices for blocks in text file (tab-separated)",
"groups": "list of token start & end indices for groups in text file (newline-separated)",
"relations": "list of discourse relations"
}
```

Expand All @@ -28,46 +27,38 @@ For example (reduced):
{
"DOC_ID": "0703000001",
"tokens": [
"helpdesk",
"buongiorno",
"sono",
"<PER>"
],
"blocks": [
[0, 1],
[1, 4]
],
"chunks": [
[0, 4]
],
"groups": [
[0, 4]
"helpdesk", "buongiorno", "sono", "<PER>",
"s\u00ec", "sono", "<PER>", "un", "collega",
"ho", "il",
"PC",
"che", "presumibilmente", "non", "funziona", "da",
"s\u00ec", "stamattina"
],
"blocks": [[0, 4], [4, 9], [9, 11], [11, 12], [12, 17]],
"groups": [[0, 4], [4, 17]],
"relations": [
{
"label": "Explicit",
"sense": [{"connective": null, "sense": "Expansion.Restatement.Equivalence"}],
"conn": [[59, 60]],
"arg1": [[5, 7]],
"arg2": [[60, 63]],
"label": "Implicit",
"sense": "Expansion.Conjunction",
"conns": "e",
"conn": [],
"arg1": [[5, 9]],
"arg2": [[9, 17], [18, 19]],
"sup1": [],
"sup2": []
},
{
"label": "Implicit",
"sense": [
{"connective": "poi", "class": "Temporal.Asynchronous"},
{"connective": "e", "class": "Expansion.Conjunction"}
],
"conn": [],
"arg1": [[20, 31]],
"arg2": [[31, 32], [33, 38]],
"label": "Explicit",
"sense": "Expansion.Restatement.Equivalence",
"conn": [[59, 60]],
"arg1": [[5, 7]],
"arg2": [[60, 63]],
"sup1": [],
"sup2": []
},
{
"label": "AltLex",
"sense": [{"connective": null, "sense": "Expansion.Restatement"}],
"sense": "Expansion.Restatement",
"conn": [[159, 161]],
"arg1": [[141, 144], [151, 154], [169, 171]],
"arg2": [[157, 164]],
Expand All @@ -87,7 +78,8 @@ import typing as t
class DiscourseRelation:
# label(s)
label: str # type
sense: t.List[t.Dict[str, str]] = None
sense: str # relation sense
conns: str # connective string (for Implicit)
# spans
conn: t.List[t.Tuple[int, int]] = None
arg1: t.List[t.Tuple[int, int]] = None
Expand All @@ -100,42 +92,22 @@ class Dialog:
doc_id: str
tokens: t.List[str]
blocks: t.List[t.Tuple[int, int]]= None
chunks: t.List[t.Tuple[int, int]] = None
groups: t.List[t.Tuple[int, int]] = None
relations: t.List[DiscourseRelation] = None
```

All spans (for a **connective**, **Arg1** and **Arg2**, **Sup1** and **Sup2**) are lists of start & end indices
with respect to `tokens`.
## Spans

A relation can have several senses.
Each sense has the `connective` & `sense` fields; where `connective` field is only populated for Implicit relations.
A Discourse Relation can contain 5 spans: a discourse **connective** (`conn`),
its **arguments** (`arg1` and `arg2`), and supplementary materials to the arguments (`sup1` and `sup2`).
Each span can be composed of 0 or more non-adjacent segments.
Consequently, all spans are lists of start & end indices with respect to `tokens`;
e.g. `[[141, 144], [151, 154], [169, 171]],`

## Anonymization
## LUNA Relation Types (Labels)

The data has been anonymized at token-level using the following conversions:

| Replacement | Freq | Description |
|:--------------|-----:|:------------------------------------------------|
| `<NUM>` | 337 | number-words; e.g. `duomilasei` |
| `<ORD>` | 29 | ordinals; e.g. `quarto` |
| `<DIGIT>` | 740 | digit-words; e.g. `due` |
| `<CHAR>` | 86 | letter; e.g. `C` |
| `<PUNC>` | 18 | punctuation; e.g. `barra` |
| `<WORD>` | 11 | a word to be masked; e.g. password, spelling |
| `<CHARS>` | 5 | a sequence of letters (abbreviation); e.g. `SG` |
| `<BRAND>` | 36 | brands (hardware); e.g. `Fujitsu` |
| `<SW>` | 159 | software; e.g. `Windows` |
| `<PER>` | 278 | person names; e.g. `Monica` |
| `<ORG>` | 54 | named organizations; e.g. `CSI` |
| `<LOC>` | 126 | locations; e.g. `Italia` |
| `<LOC.SPELL>` | 25 | locations for spelling; e.g. `Ancona` |
| `<WD>` | 13 | week days; e.g. `domenica` |
| `<MM>` | 13 | month names; e.g. `gennaio` |
| `<MISC>` | 2 | other; not covered above |


## Relation Types
Since LUNA is following PDTB format, Discourse Relation types are the same.
The distribution is given below.

| Type | ALL | TRN | DEV | TST |
|:---------|------:|------:|------:|------:|
Expand All @@ -145,46 +117,57 @@ The data has been anonymized at token-level using the following conversions:
| EntRel | 56 | 33 | 7 | 16 |


## Relation Senses as Sense 1
## LUNA Relation Senses

### Additional Senses
A Discourse Relation can have several senses with respect to the Relation Type:

- Discourse Marker
- Interrupted
- Repetition
- `Explicit` relations can have only 2 senses.
- `Implicit` relations can have up to 4 senses: 2 connectives with 2 senses each.
- `AltLex` relations are as `Explicit` relations.
- `EntRel` relations have no senses.

The observed sense counts are the following:

### Sense Counts
- `0` - no sense (errors)
- `1s` - 1 sense
- `2s` - 2 senses
- `2c` - 2 connectives, 1 sense each

LUNA (and PDTB) Discourse Relations Senses are 3 level:
e.g. `Comparison.Concession.Epistemic concession`.
It is often the case that relations are annotated up to certain level;
i.e. not all relations have all 3 levels.
| Type | ALL | 0 | 1s | 2s | 2c |
|:---------|------:|----:|------:|----:|----:|
| Explicit | 1,052 | 4 | 1,045 | 3 | NA |
| Implicit | 490 | 3 | 481 | 3 | 3 |
| AltLex | 11 | 1 | 10 | NA | NA |
| EntRel | 56 | NA | NA | NA | NA |

`EntRel` has no senses.

There are 3 *Implicit* discourse relations that contain 2nd connective and its sense;
the counts below do not include those:
### Relation Sense Selection

- Comparison.Concession.Semantic concession (1)
- Expansion.Conjunction (1)
Since the amount of discourse relations having a second sense is very little
(3 `Explicit` & 3 `Implicit` with a second sense and 3 `Implicit` with a second connective);
all the discourse relations have been "simplified" to have exactly 1 sense (or 0, if missing).

There are 3 *Explicit* and 3 *Implicit* discourse relations that contain 2 senses
(for the 1st connective).

- Explicit
- Comparison
- Expansion.Conjunction
- Temporal.Asynchronous

- Implicit
- Comparison.Concession
- Contingency.Cause
- Contingency.Goal
In case more than 1 sense is available, the selected sense is the first one.
For `Implicit` 2 connective relations it is the 1st sense of the 1st connective.

### Relation Sense Levels
LUNA (and PDTB) Discourse Relations Senses are 3+ level:
e.g. `Comparison.Concession.Epistemic concession`.
It is often the case that relations are annotated up to a certain level;
i.e. not all relations have all 3 levels.

#### Level 1 Senses

PDTB has 4 Level 1 senses: `Comparison`, `Contingency`, `Expansion` and `Temporal`.
LUNA adds 3 more which have only 1 level:

- `Discourse Marker`
- `Interrupted`
- `Repetition`

While `Interrupted` and `Repetition` senses are quite frequent, `Discourse Marker` appears only once.


| Sense | Explicit | Implicit | AltLex |
|:-----------------|---------:|---------:|-------:|
| Comparison | 187 | 47 | 0 |
Expand All @@ -199,32 +182,11 @@ There are 3 *Explicit* and 3 *Implicit* discourse relations that contain 2 sense

#### Level 2 Senses

| Sense | Explicit | Implicit | AltLex |
|:------------------------|---------:|---------:|-------:|
| Comparison (no L2) | 1 | 0 | 0 |
| Comparison.Concession | 144 | 27 | 0 |
| Comparison.Contrast | 42 | 20 | 0 |
| Contingency (no L2) | 1 | 0 | 0 |
| Contingency.Cause | 265 | 88 | 2 |
| Contingency.Condition | 124 | 8 | 1 |
| Contingency.Goal | 73 | 10 | 1 |
| Expansion (no L2) | 1 | 0 | 0 |
| Expansion.Alternative | 28 | 3 | 1 |
| Expansion.Conjunction | 111 | 70 | 1 |
| Expansion.Instantiation | 8 | 3 | 1 |
| Expansion.Restatement | 65 | 85 | 3 |
| Temporal (no L2) | 0 | 0 | 0 |
| Temporal.Asynchronous | 128 | 55 | 3 |
| Temporal.Synchrony | 28 | 9 | 3 |
| Interrupted | 29 | 1 | 0 |
| Repetition | 0 | 108 | 0 |
| Discourse Marker | 1 | 0 | 0 |
| MISSING | 4 | 3 | 1 |
Even though mose relations have level 2 sense, a relation can have a level 1 sense only.


#### Level 3+ Senses

Level 3 senses are better ignored due to data sparsity.
The 3rd level further categorizes L2 relations into the following types:
(as `Comparison.Concession.Epistemic concession`, `Contingency.Cause.Semantic cause`, etc.).
Refer to Tonelli et al. (2010) for further detail.
Expand All @@ -236,37 +198,75 @@ Refer to Tonelli et al. (2010) for further detail.
- Semantic
- Speech act

`Explansion.Reatatement` is further categorized into:
- `Expansion.Restatement.Equivalence`
- `Expansion.Restatement.Specification`

`Temporal` sense has no 3rd level, i.e. only

## Known Issues, Peculiarities and TODOs
- `Temporal.Asynchronous`
- `Temporal.Synchrony`

- `070400_0020`: `conn` and `arg2` spans overlap in `Explicit` relation
`Expansion.Restatement` on level 3 is further categorized into:

- 2 sense relations (6):
- `Expansion.Restatement.Equivalence`
- `Expansion.Restatement.Specification`

- Relation Types:
- Explicit: 3
- Implicit: 3

- IDs
- `0704000014`: 1
- `0704000015`: 1
- `0704000040`: 1
- `0705000004`: 1
- `0705230003`: 1
- `0705230007`: 1
#### Sense Counts

- 2 connective relations (3):
The table below contains sense counts as they appear in the data.

- Relation Types
- Implicit: 3

- IDs
- `0703000001`: 2
- `0704000001`: 1
| Sense | Explicit | Implicit | AltLex |
|:------------------------------------|---------:|---------:|-------:|
| Comparison (no L2) | 1 | 0 | 0 |
| Comparison.Concession | 144 | 27 | 0 |
| Comparison.Contrast | 42 | 20 | 0 |
| Contingency (no L2) | 1 | 0 | 0 |
| Contingency.Cause | 265 | 88 | 2 |
| Contingency.Condition | 124 | 8 | 1 |
| Contingency.Goal | 73 | 10 | 1 |
| Expansion (no L2) | 1 | 0 | 0 |
| Expansion.Alternative | 28 | 3 | 1 |
| Expansion.Conjunction | 111 | 70 | 1 |
| Expansion.Instantiation | 8 | 3 | 1 |
| Expansion.Restatement (no L3) | 4 | 8 | 1 |
| Expansion.Restatement.Equivalence | 25 | 22 | 0 |
| Expansion.Restatement.Specification | 36 | 55 | 2 |
| Temporal (no L2) | 0 | 0 | 0 |
| Temporal.Asynchronous | 128 | 55 | 3 |
| Temporal.Synchrony | 28 | 9 | 3 |
| Interrupted | 29 | 1 | 0 |
| Repetition | 0 | 108 | 0 |
| Discourse Marker | 1 | 0 | 0 |
| MISSING | 4 | 3 | 1 |


## Anonymization

The data has been anonymized at token-level using the following conversions:

| Replacement | Freq | Description |
|:--------------|-----:|:------------------------------------------------|
| `<NUM>` | 337 | number-words; e.g. `duomilasei` |
| `<ORD>` | 29 | ordinals; e.g. `quarto` |
| `<DIGIT>` | 740 | digit-words; e.g. `due` |
| `<CHAR>` | 86 | letter; e.g. `C` |
| `<PUNC>` | 18 | punctuation; e.g. `barra` |
| `<WORD>` | 11 | a word to be masked; e.g. password, spelling |
| `<CHARS>` | 5 | a sequence of letters (abbreviation); e.g. `SG` |
| `<BRAND>` | 36 | brands (hardware); e.g. `Fujitsu` |
| `<SW>` | 159 | software; e.g. `Windows` |
| `<PER>` | 278 | person names; e.g. `Monica` |
| `<ORG>` | 54 | named organizations; e.g. `CSI` |
| `<LOC>` | 126 | locations; e.g. `Italia` |
| `<LOC.SPELL>` | 25 | locations for spelling; e.g. `Ancona` |
| `<WD>` | 13 | week days; e.g. `domenica` |
| `<MM>` | 13 | month names; e.g. `gennaio` |
| `<MISC>` | 2 | other; not covered above |



## Notes, Known Issues, Peculiarities and TODOs

- `0704000020`: `conn` and `arg2` spans overlap in `Explicit` relation (DONE)

- 0 sense relations (8):

Expand Down
Binary file removed data/.DS_Store
Binary file not shown.
Loading

0 comments on commit e78eb36

Please sign in to comment.