Skip to content

Commit ed4142d

Browse files
rursprungakolarkunnu
authored andcommitted
phone-search analyzer: don't emit sip/tel prefix, int'l prefix, extension & unformatted input (opensearch-project#16993)
* `phone-search` analyzer: don't emit int'l prefix this was an oversight in the initial implementation: if the tokenizer emits the international calling prefix in the search analyzer then all documents with the same international calling prefix will match. e.g. when searching for `+1-555-123-4567` not only documents with this number would match but also any other document with a `1` token (i.e. any other number with this prefix). thus the search functionality is currently broken for this analyzer, making it useless. the test coverage has now been extended to cover these and other use-cases. Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> * `phone-search` analyzer: don't emit extension & unformatted input if these tokens are emitted it meant that phone numbers with other international dialling prefixes still matched. e.g. searching for `+1 1234` would also match a number stored as `+2 1234`, which was wrong. the tokens still need to be emited for the `phone` analyzer, e.g. when the user only enters the extension / local number it should still match, the same is with the other ngrams: these are needed for search-as-you-type style queries where the user input needs to match against partial phone numbers. Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> * `phone-search` analyzer: don't emit sip/tel prefix in line with the previous two commits, this is something else the search analyzer shouldn't emit since otherwise searching for any number with such a prefix will match _any_ document with the same prefix. Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com> --------- Signed-off-by: Ralph Ursprung <Ralph.Ursprung@avaloq.com>
1 parent 9d02bab commit ed4142d

File tree

4 files changed

+166
-15
lines changed

4 files changed

+166
-15
lines changed

CHANGELOG.md

+1
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
9292
- Always use `constant_score` query for `match_only_text` field ([#16964](https://github.com/opensearch-project/OpenSearch/pull/16964))
9393
- Fix Shallow copy snapshot failures on closed index ([#16868](https://github.com/opensearch-project/OpenSearch/pull/16868))
9494
- Fix multi-value sort for unsigned long ([#16732](https://github.com/opensearch-project/OpenSearch/pull/16732))
95+
- The `phone-search` analyzer no longer emits the tel/sip prefix, international calling code, extension numbers and unformatted input as a token ([#16993](https://github.com/opensearch-project/OpenSearch/pull/16993))
9596

9697
### Security
9798

plugins/analysis-phonenumber/src/main/java/org/opensearch/analysis/phone/PhoneNumberTermTokenizer.java

+17-6
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,9 @@ private Set<String> getTokens() throws IOException {
9898

9999
// Rip off the "tel:" or "sip:" prefix
100100
if (input.indexOf("tel:") == 0 || input.indexOf("sip:") == 0) {
101-
tokens.add(input.substring(0, 4));
101+
if (addNgrams) {
102+
tokens.add(input.substring(0, 4));
103+
}
102104
input = input.substring(4);
103105
}
104106

@@ -128,14 +130,23 @@ private Set<String> getTokens() throws IOException {
128130
countryCode = Optional.of(String.valueOf(numberProto.getCountryCode()));
129131
input = String.valueOf(numberProto.getNationalNumber());
130132

131-
// Add Country code, extension, and the number as tokens
132-
tokens.add(countryCode.get());
133+
// add full number as tokens
133134
tokens.add(countryCode.get() + input);
134-
if (!Strings.isEmpty(numberProto.getExtension())) {
135-
tokens.add(numberProto.getExtension());
135+
136+
if (addNgrams) {
137+
// Consider the country code as an ngram - it makes no sense in the search analyzer as it'd match all values with the
138+
// same country code
139+
tokens.add(countryCode.get());
140+
141+
// Add extension without country code (not done for search analyzer as that might match numbers in other countries as
142+
// well!)
143+
if (!Strings.isEmpty(numberProto.getExtension())) {
144+
tokens.add(numberProto.getExtension());
145+
}
146+
// Add unformatted input (most likely the same as the extension now since the prefix has been removed)
147+
tokens.add(input);
136148
}
137149

138-
tokens.add(input);
139150
}
140151
} catch (final NumberParseException | StringIndexOutOfBoundsException e) {
141152
// Libphone didn't like it, no biggie. We'll just ngram the number as it is.

plugins/analysis-phonenumber/src/test/java/org/opensearch/analysis/phone/PhoneNumberAnalyzerTests.java

+9-9
Original file line numberDiff line numberDiff line change
@@ -87,11 +87,7 @@ public void testEuropeDetailled() throws IOException {
8787
* Test for all tokens which are emitted by the "phone" analyzer.
8888
*/
8989
public void testEuropeDetailledSearch() throws IOException {
90-
assertTokensAreInAnyOrder(
91-
phoneSearchAnalyzer,
92-
"tel:+441344840400",
93-
Arrays.asList("tel:+441344840400", "tel:", "441344840400", "44", "1344840400")
94-
);
90+
assertTokensAreInAnyOrder(phoneSearchAnalyzer, "tel:+441344840400", Arrays.asList("tel:+441344840400", "441344840400"));
9591
}
9692

9793
public void testEurope() throws IOException {
@@ -166,6 +162,10 @@ public void testTelPrefix() throws IOException {
166162
assertTokensInclude("tel:+1228", Arrays.asList("1228", "122", "228"));
167163
}
168164

165+
public void testTelPrefixSearch() throws IOException {
166+
assertTokensInclude("tel:+1228", Arrays.asList("1228"));
167+
}
168+
169169
public void testNumberPrefix() throws IOException {
170170
assertTokensInclude("+1228", Arrays.asList("1228", "122", "228"));
171171
}
@@ -189,21 +189,21 @@ public void testLocalNumberWithCH() throws IOException {
189189
}
190190

191191
public void testSearchInternationalPrefixWithZZ() throws IOException {
192-
assertTokensInclude(phoneSearchAnalyzer, "+41583161010", Arrays.asList("41", "41583161010", "583161010"));
192+
assertTokensAreInAnyOrder(phoneSearchAnalyzer, "+41583161010", Arrays.asList("+41583161010", "41583161010"));
193193
}
194194

195195
public void testSearchInternationalPrefixWithCH() throws IOException {
196-
assertTokensInclude(phoneSearchCHAnalyzer, "+41583161010", Arrays.asList("41", "41583161010", "583161010"));
196+
assertTokensAreInAnyOrder(phoneSearchCHAnalyzer, "+41583161010", Arrays.asList("+41583161010", "41583161010"));
197197
}
198198

199199
public void testSearchNationalPrefixWithCH() throws IOException {
200200
// + is equivalent to 00 in Switzerland
201-
assertTokensInclude(phoneSearchCHAnalyzer, "0041583161010", Arrays.asList("41", "41583161010", "583161010"));
201+
assertTokensAreInAnyOrder(phoneSearchCHAnalyzer, "0041583161010", Arrays.asList("0041583161010", "41583161010"));
202202
}
203203

204204
public void testSearchLocalNumberWithCH() throws IOException {
205205
// when omitting the international prefix swiss numbers must start with '0'
206-
assertTokensInclude(phoneSearchCHAnalyzer, "0583161010", Arrays.asList("41", "41583161010", "583161010"));
206+
assertTokensAreInAnyOrder(phoneSearchCHAnalyzer, "0583161010", Arrays.asList("0583161010", "41583161010"));
207207
}
208208

209209
/**

plugins/analysis-phonenumber/src/yamlRestTest/resources/rest-api-spec/test/analysis-phone/20_search.yml

+139
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,37 @@
3232
index: test
3333
id: 1
3434
body: { "phone": "+41 58 316 10 10", "phone-ch": "058 316 10 10" }
35+
- do:
36+
index:
37+
index: test
38+
id: 2
39+
body: { "phone": "+41 58 316 99 99", "phone-ch": "058 316 99 99" }
40+
- do:
41+
index:
42+
index: test
43+
id: 3
44+
# number not used in the examples below, just present to make sure that it's never matched
45+
body: { "phone": "+41 12 345 67 89", "phone-ch": "012 345 67 89" }
46+
- do:
47+
index:
48+
index: test
49+
id: 4
50+
# germany has a different phone number length, but for this test we ignore it and pretend they're the same
51+
body: { "phone": "+49 58 316 10 10", "phone-ch": "+49 58 316 10 10" }
52+
- do:
53+
index:
54+
index: test
55+
id: 5
56+
body: { "phone": "+1-888-280-4331", "phone-ch": "+1-888-280-4331" }
57+
- do:
58+
index:
59+
index: test
60+
id: 6
61+
body: { "phone": "tel:+441344840400", "phone-ch": "tel:+441344840400" }
3562
- do:
3663
indices.refresh: {}
3764

65+
# international format in document & search will always work
3866
- do:
3967
search:
4068
rest_total_hits_as_int: true
@@ -45,6 +73,7 @@
4573
"phone": "+41583161010"
4674
- match: { hits.total: 1 }
4775

76+
# correct national format & international format in search will always work
4877
- do:
4978
search:
5079
rest_total_hits_as_int: true
@@ -54,3 +83,113 @@
5483
match:
5584
"phone-ch": "+41583161010"
5685
- match: { hits.total: 1 }
86+
87+
# national format without country specified won't work
88+
- do:
89+
search:
90+
rest_total_hits_as_int: true
91+
index: test
92+
body:
93+
query:
94+
match:
95+
"phone": "0583161010"
96+
- match: { hits.total: 0 }
97+
98+
# correct national format with country specified in document & search will always work
99+
- do:
100+
search:
101+
rest_total_hits_as_int: true
102+
index: test
103+
body:
104+
query:
105+
match:
106+
"phone-ch": "0583161010"
107+
- match: { hits.total: 1 }
108+
109+
# search-as-you-type style query
110+
- do:
111+
search:
112+
rest_total_hits_as_int: true
113+
index: test
114+
body:
115+
query:
116+
match:
117+
"phone": "+4158316"
118+
- match: { hits.total: 2 }
119+
120+
# search-as-you-type style query
121+
- do:
122+
search:
123+
rest_total_hits_as_int: true
124+
index: test
125+
body:
126+
query:
127+
match:
128+
"phone-ch": "058316"
129+
- match: { hits.total: 2 }
130+
131+
# international format in document & search will always work
132+
- do:
133+
search:
134+
rest_total_hits_as_int: true
135+
index: test
136+
body:
137+
query:
138+
match:
139+
"phone": "+1 888 280 4331"
140+
- match: { hits.total: 1 }
141+
142+
# international format in document & search will always work
143+
- do:
144+
search:
145+
rest_total_hits_as_int: true
146+
index: test
147+
body:
148+
query:
149+
match:
150+
"phone-ch": "+1 888 280 4331"
151+
- match: { hits.total: 1 }
152+
153+
# national format in search won't work if no country is specified
154+
- do:
155+
search:
156+
rest_total_hits_as_int: true
157+
index: test
158+
body:
159+
query:
160+
match:
161+
"phone": "888 280 4331"
162+
- match: { hits.total: 0 }
163+
164+
# document & search have a tel: prefix
165+
- do:
166+
search:
167+
rest_total_hits_as_int: true
168+
index: test
169+
body:
170+
query:
171+
match:
172+
"phone": "tel:+441344840400"
173+
- match: { hits.total: 1 }
174+
175+
# only document has a tel: prefix
176+
- do:
177+
search:
178+
rest_total_hits_as_int: true
179+
index: test
180+
body:
181+
query:
182+
match:
183+
"phone": "+441344840400"
184+
- match: { hits.total: 1 }
185+
186+
# only search has a tel: prefix
187+
- do:
188+
search:
189+
rest_total_hits_as_int: true
190+
index: test
191+
body:
192+
query:
193+
match:
194+
"phone": "tel:+1 888 280 4331"
195+
- match: { hits.total: 1 }

0 commit comments

Comments
 (0)