Skip to content

Commit 08da2af

Browse files
Angel ChangStanford NLP
Angel Chang
authored and
Stanford NLP
committed
Merge branch 'master' of jamie.stanford.edu:/u/nlp/git/javanlp
1 parent d53d5c0 commit 08da2af

File tree

4 files changed

+81554
-81380
lines changed

4 files changed

+81554
-81380
lines changed

README.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -38,13 +38,13 @@ At present [the current released version of the code](https://stanfordnlp.github
3838
#### Build with Maven
3939

4040
1. Make sure you have Maven installed, details here: [https://maven.apache.org/](https://maven.apache.org/)
41-
2. If you run this command in the CoreNLP directory: `mvn package` , it should run the tests and build this jar file: `CoreNLP/target/stanford-corenlp-3.7.0.jar`
41+
2. If you run this command in the CoreNLP directory: `mvn package` , it should run the tests and build this jar file: `CoreNLP/target/stanford-corenlp-3.9.2.jar`
4242
3. When using the latest version of the code make sure to download the latest versions of the [corenlp-models](http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar), [english-models](http://nlp.stanford.edu/software/stanford-english-corenlp-models-current.jar), and [english-models-kbp](http://nlp.stanford.edu/software/stanford-english-kbp-corenlp-models-current.jar) and include them in your CLASSPATH. If you are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.
43-
4. If you want to use Stanford CoreNLP as part of a Maven project you need to install the models jars into your Maven repository. Below is a sample command for installing the Spanish models jar. For other languages just change the language name in the command. To install `stanford-corenlp-models-current.jar` you will need to set `-Dclassifier=models`. Here is the sample command for Spanish: `mvn install:install-file -Dfile=/location/of/stanford-spanish-corenlp-models-current.jar -DgroupId=edu.stanford.nlp -DartifactId=stanford-corenlp -Dversion=3.9.1 -Dclassifier=models-spanish -Dpackaging=jar`
43+
4. If you want to use Stanford CoreNLP as part of a Maven project you need to install the models jars into your Maven repository. Below is a sample command for installing the Spanish models jar. For other languages just change the language name in the command. To install `stanford-corenlp-models-current.jar` you will need to set `-Dclassifier=models`. Here is the sample command for Spanish: `mvn install:install-file -Dfile=/location/of/stanford-spanish-corenlp-models-current.jar -DgroupId=edu.stanford.nlp -DartifactId=stanford-corenlp -Dversion=3.9.2 -Dclassifier=models-spanish -Dpackaging=jar`
4444

4545
### Useful resources
4646

47-
You can find releases of Stanford CoreNLP on [Maven Central](https://search.maven.org/#artifactdetails%7Cedu.stanford.nlp%7Cstanford-corenlp%7C3.7.0%7Cjar).
47+
You can find releases of Stanford CoreNLP on [Maven Central](https://search.maven.org/artifact/edu.stanford.nlp/stanford-corenlp/3.9.2/jar).
4848

4949
You can find more explanation and documentation on [the Stanford CoreNLP homepage](http://stanfordnlp.github.io/CoreNLP/).
5050

src/edu/stanford/nlp/process/PTBLexer.flex

+32-10
Original file line numberDiff line numberDiff line change
@@ -249,7 +249,7 @@ import edu.stanford.nlp.util.logging.Redwood;
249249

250250

251251
/** Turn on to find out how things were tokenized. */
252-
private static final boolean DEBUG = false;
252+
private static final boolean DEBUG = true;
253253

254254
/** A logger for this class */
255255
private static final Redwood.RedwoodChannels logger = Redwood.channels(PTBLexer.class);
@@ -756,11 +756,10 @@ ABCOMP2 = Invt|Elec|Natl|M[ft]g|Dept|Blvd|Rd|Ave|[P][l]|viz
756756
/* ABRREV2 abbreviations are normally followed by an upper case word.
757757
* We assume they aren't used sentence finally. Ph is in there for Ph. D Sc for B.Sc.
758758
*/
759-
ABBREV4 = {ABTITLE}|vs|[v]|Alex|Wm|Jos|Cie|a\.k\.a|cf|TREAS|Ph|[S][c]|{ACRO}|{ABCOMP2}
759+
ABBREV4 = {ABTITLE}|vs|[v]|Wm|Jos|Cie|a\.k\.a|cf|TREAS|Ph|[S][c]|{ACRO}|{ABCOMP2}
760760
ABBREV2 = {ABBREV4}\.
761761
ACRONYM = ({ACRO})\.
762762
/* Cie. is used by French companies sometimes before and sometimes at end as in English Co. But we treat as allowed to have Capital following without being sentence end. Cia. is used in Spanish/South American company abbreviations, which come before the company name, but we exclude that and lose, because in a caseless segmenter, it's too confusable with CIA. */
763-
/* in the WSJ Alex. is generally an abbreviation for Alex. Brown, brokers! */
764763
/* Added Wm. for William and Jos. for Joseph */
765764
/* In tables: Mkt. for market Div. for division of company, Chg., Yr.: year */
766765

@@ -873,6 +872,7 @@ CP1252_MISC_SYMBOL = [\u0086\u0087\u0089\u0095\u0098\u0099]
873872
if (normalizeSpace) {
874873
txt = SINGLE_SPACE_PATTERN.matcher(txt).replaceAll("\u00A0"); // change to non-breaking space
875874
}
875+
if (DEBUG) { logger.info("Used {SGML1} to recognize " + origTxt + " as " + txt); }
876876
return getNext(txt, origTxt);
877877
}
878878
<YyTokenizePerLine>{SGML2}
@@ -881,6 +881,7 @@ CP1252_MISC_SYMBOL = [\u0086\u0087\u0089\u0095\u0098\u0099]
881881
if (normalizeSpace) {
882882
txt = txt.replace(' ', '\u00A0'); // change space to non-breaking space
883883
}
884+
if (DEBUG) { logger.info("Used {SGML2} to recognize " + origTxt + " as " + txt); }
884885
return getNext(txt, origTxt);
885886
}
886887
{SPMDASH} { if (ptb3Dashes) {
@@ -970,12 +971,16 @@ CP1252_MISC_SYMBOL = [\u0086\u0087\u0089\u0095\u0098\u0099]
970971
"; probablyLeft=" + false); }
971972
return getNext(norm, tok);
972973
}
973-
{DATE} { String txt = yytext();
974+
{DATE} { String origTxt = yytext();
975+
String txt;
974976
if (escapeForwardSlashAsterisk) {
975-
txt = LexerUtils.escapeChar(txt, '/');
977+
txt = LexerUtils.escapeChar(origTxt, '/');
978+
} else {
979+
txt = origTxt;
976980
}
977-
return getNext(txt, yytext());
978-
}
981+
if (DEBUG) { logger.info("Used {DATE} to recognize " + origTxt + " as " + txt); }
982+
return getNext(txt, origTxt);
983+
}
979984
/* Malaysian currency */
980985
RM/{NUM} { String txt = yytext();
981986
return getNext(txt, txt);
@@ -1073,8 +1078,24 @@ RM/{NUM} { String txt = yytext();
10731078
// since the last one matches two things, even newlines (if not tokenize per line)
10741079
return processAbbrev1();
10751080
}
1076-
{ABBREV2} { return getNext(); }
1077-
{ABBREV4}/{SPACE} { return getNext(); }
1081+
{ABBREV2} { String tok = yytext();
1082+
if (DEBUG) { logger.info("Used {ABBREV2} to recognize " + tok); }
1083+
return getNext(tok, tok);
1084+
}
1085+
/* Last millenium (in the WSJ) "Alex." is generally an abbreviation for Alex. Brown, brokers! Recognize just this case. */
1086+
<YyNotTokenizePerLine>Alex\./{SPACENL}Brown { String tok = yytext();
1087+
if (DEBUG) { logger.info("Used {ALEX} to recognize " + tok); }
1088+
return getNext(tok, tok);
1089+
}
1090+
1091+
<YyTokenizePerLine>Alex\./{SPACE}Brown { String tok = yytext();
1092+
if (DEBUG) { logger.info("Used {ALEX} (2) to recognize " + tok); }
1093+
return getNext(tok, tok);
1094+
}
1095+
{ABBREV4}/{SPACE} { String tok = yytext();
1096+
if (DEBUG) { logger.info("Used {ABBREV4} to recognize " + tok); }
1097+
return getNext(tok, tok);
1098+
}
10781099
{ACRO}/{SPACENL} { return getNext(); }
10791100
{TBSPEC2}/{SPACENL} { return getNext(); }
10801101
{ISO8601DATETIME} { return getNext(); }
@@ -1118,9 +1139,10 @@ RM/{NUM} { String txt = yytext();
11181139
txt = LEFT_PAREN_PATTERN.matcher(txt).replaceAll(openparen);
11191140
txt = RIGHT_PAREN_PATTERN.matcher(txt).replaceAll(closeparen);
11201141
}
1142+
if (DEBUG) { logger.info("Used {SMILEY} to recognize " + origText + " as " + txt); }
11211143
return getNext(txt, origText);
11221144
}
1123-
{ASIANSMILEY} { String txt = yytext();
1145+
{ASIANSMILEY} { String txt = yytext();
11241146
String origText = txt;
11251147
if (normalizeParentheses) {
11261148
txt = LEFT_PAREN_PATTERN.matcher(txt).replaceAll(openparen);

0 commit comments

Comments
 (0)