This document describes the process of configuring Elasticsearch templates for Mimirsbrunn.
We can picture Elasticsearch as a black box, where we store JSON documents. These documents are of different kinds, and depend on our business. Since we deal with geospatial data, and Navitia in particular works with public transportations, the types of documents we store are:
- administrative regions:
- addresses:
- streets
- point of interests (POIs)
- stops (Public Transportations)
We first submit configuration files to Elasticsearch to describe how we want each document type to be handled. These are so called component templates, and index templates, which include:
- settings: how do we want the text to be handled? do we want to use synonyms, lowercase, use stems,…
- mappings: how each field of each type of document listed above is handled.
When the documents are indexed according to our settings and mappings, we can then query Elasticsearch, and play with lots of parameters to push the ranking of documents up or down.
This document describes how we establish a baseline for these templates, and the process of updating them.
Configuring Elasticsearch templates is an iterative process, which, when done right, results in:
- reduced memory consumption in Elasticsearch, by reducing the size / number of indices.
- reduced search duration, by simplifying the query
- better ranking
We'll construct a table with all the fields, for each type of document. The source of information is
the document, which is a rust structure serialized to JSON. When building this resource, be sure to
exclude what would be skipped (marked as skip
) by the serializer.
field | type | description |
---|---|---|
administrative_regions | Vec<Arc> | A list of parent administrative regions |
approx_coord | Option | Coordinates of (the center??) of the region, similar to coord Given in lat lon |
bbox | Option<Rect> | Bounding Box |
boundary | Option<MultiPolygon> | Describes the shape of the admin region |
codes | BTreeMap<String, String> | Some codes used in OSM, like ISO3166, ref:nuts, wikidata |
context | Option | Used for debugging |
coord | Coord | Coordinates of the region |
country_codes | Vec | Country Codes |
id | String | Unique id created by cosmogony |
insee | String | A code used to identify regions in France. From OSM |
label | String | ?? |
labels | I18nProperties | ?? |
level | u32 | Position of the region in the admin hierarchy |
name | String | Name |
names | I18nProperties | Name, but internationalized, eg name:en, name:ru, name:es |
parent_id | Option | id of the parent admin region (or none if root) |
weight | f64 | A number associated with the population in that region |
zip_codes | Vec | Zip codes (can be more than one) |
zone_type | Option | Describes the type, eg city, suburb, country,… |
Addresses, compared to administrative regions, have very little unique fields, just house number and street:
field | type | description |
---|---|---|
approx_coord | Option | |
context | Option | |
coord | Coord | |
country_codes | Vec | |
house_number | String | Identifier in the street |
id | String | Unique identifier |
label | String | |
name | String | |
street | Street | Reference to the street the address belongs to. |
weight | f64 | |
zip_codes | Vec |
No particular fields for streets:
field | type | description |
---|---|---|
administrative_regions | Vec<Arc> | |
approx_coord | Option | |
context | Option | |
coord | Coord | |
country_codes | Vec | |
id | String | |
label | String | |
name | String | |
weight | f64 | |
zip_codes | Vec |
field | type | description |
---|---|---|
address | Option | Address associated with that POI Can be an address or a street |
administrative_regions | Vec<Arc> | |
approx_coord | Option | |
context | Option | |
coord | Coord | |
country_codes | Vec | |
id | String | |
label | String | |
labels | I18nProperties | |
name | String | |
names | I18nProperties | |
poi_type | PoiType | id / name references in NTFS |
properties | BTreeMap<String, String> | |
weight | f64 | |
zip_codes | Vec |
Stop (Public Transportations)
field | type | description |
---|---|---|
administrative_regions | Vec<Arc> | |
approx_coord | Option | |
codes | BTreeMap<String, String> | |
comments | Vec | |
commercial_modes | Vec | |
context | Option | |
coord | Coord | |
country_codes | Vec | |
coverages | Vec | |
feed_publishers | Vec | |
id | String | |
label | String | |
lines | Vec | |
name | String | |
physical_modes | Vec | |
properties | BTreeMap<String, String> | |
timezone | String | |
weight | f64 | The weight depends on the number of lines, and other parameters. |
zip_codes | Vec |
When we combine together all the fields from the previous documents, we obtain the following table, which shows all the fields in use, and by what type of document.
field | type | description | adm | add | poi | stp | str |
---|---|---|---|---|---|---|---|
address | Option | Address associated with that POI | ✓ | ||||
administrative_regions | Vec<Arc> | A list of parent administrative regions | ✓ | ✓ | ✓ | ✓ | |
approx_coord | Option | Coordinates of the object, similar to coord | ✓ | ✓ | ✓ | ✓ | ✓ |
bbox | Option<Rect> | Bounding Box | ✓ | ||||
boundary | Option<MultiPolygon> | Describes the shape of the admin region | ✓ | ||||
codes | BTreeMap<String, String> | Some codes used in OSM, like ISO3166, ref:nuts, wikidata | ✓ | ✓ | |||
comments | Vec | ✓ | |||||
commercial_modes | Vec | ✓ | |||||
context | Option<Conte✓t> | Used to return information (debugging) | ✓ | ✓ | ✓ | ✓ | ✓ |
coord | Coord | ✓ | ✓ | ✓ | ✓ | ✓ | |
country_codes | Vec | Country Codes | ✓ | ✓ | ✓ | ✓ | ✓ |
coverages | Vec | ✓ | |||||
feed_publishers | Vec | ✓ | |||||
house_number | String | Identifier in the street | ✓ | ||||
id | String | Unique identifier | ✓ | ✓ | ✓ | ✓ | ✓ |
insee | String | A code used to identify regions in France. | ✓ | ||||
label | String | ?? | ✓ | ✓ | ✓ | ✓ | ✓ |
labels | I18nProperties | ?? | ✓ | ✓ | |||
level | u32 | Position of the region in the admin hierarchy | ✓ | ||||
lines | Vec | ✓ | |||||
name | String | Name | ✓ | ✓ | ✓ | ✓ | ✓ |
names | I18nProperties | Name, but internationalized, eg name:en, name:ru, name:es | ✓ | ✓ | |||
parent_id | Option | id of the parent admin region (or none if root) | ✓ | ||||
physical_modes | Vec | ✓ | |||||
poi_type | PoiType | id / name references in NTFS | ✓ | ||||
properties | BTreeMap<String, String> | ✓ | ✓ | ||||
street | Street | Reference to the street the address belongs to. | ✓ | ||||
timezone | String | ✓ | |||||
weight | f64 | ✓ | ✓ | ✓ | ✓ | ✓ | |
zip_codes | Vec | ✓ | ✓ | ✓ | ✓ | ✓ | |
zone_type | Option | Describes the type, eg city, suburb, country,… | ✓ |
Talk about type
, indexed_at
(and pipeline)
We can extract from this table a list of fields that are (almost) common to all the documents. In this table of common fields, we indicate what type is used for Elasticsearch, whether we should index the field, and some comments.
field | type | adm | add | poi | stp | str | Elasticsearch | Index | Comment |
---|---|---|---|---|---|---|---|---|---|
administrative_regions | Vec<Arc<Admin>> |
✓ | ✓ | ✓ | ✓ | ✗ | large object | ||
approx_coord | Option<Geometry> |
✓ | ✓ | ✓ | ✓ | ✓ | ?? | ✗ | Improved geo_point in Elasticsearch may render approx_coord obsolete |
context | Option<Context> |
✓ | ✓ | ✓ | ✓ | ✓ | ✗ | Output | |
coord | Coord |
✓ | ✓ | ✓ | ✓ | ✓ | geo_point | ✓ | Index for reverse API |
country_codes | Vec<String> |
✓ | ✓ | ✓ | ✓ | ✓ | ?? | ✗ | Are we searching with these ? |
id | String |
✓ | ✓ | ✓ | ✓ | ✓ | keyword | ✓ | Index for features API. Really need to index?? |
label | String |
✓ | ✓ | ✓ | ✓ | ✓ | SAYT | ✓ | Field created by binaries (contains name and other informations, like admin, country code, …) |
name | String |
✓ | ✓ | ✓ | ✓ | ✓ | text | ✓ | copy to full label |
weight | f64 |
✓ | ✓ | ✓ | ✓ | ✓ | float | ✗ | used for ranking |
zip_codes | Vec<String> |
✓ | ✓ | ✓ | ✓ | ✓ | text | ?? | copy to full label |
Now we'll turn this table into an actual component template, responsible for handling all the common fields.
A few points are important to notice:
- The text based search is happening on the label. The label is created by the indexing program, and contains the name, some information about the administrative region it belongs to, maybe a country code. So we're not indexing the name, because the search is happening on the label.
The component template also contains additional fields, that are not present in the document sent by the binaries:
field | type | adm | add | poi | stp | str | Elasticsearch | Index | Comment |
---|---|---|---|---|---|---|---|---|---|
indexed_at | ✓ | ✓ | ✓ | ✓ | ✓ | date | ✗ | Generated by an Elasticsearch pipeline | |
type | ✓ | ✓ | ✓ | ✓ | ✓ | constant_keyword | ✗ | Set in individual index templates |
The search template has to reflect the information found in the common template.
If we look back at the list of fields present in the administrative region document, and remove all the fields that are part of the common template, we have the following list of remaining fields:
field | type | Elasticsearch | Index | Comment |
---|---|---|---|---|
bbox | Option<Rect<f64>> |
Bounding Box | ✗ | |
boundary | Option<MultiPolygon<f64>> |
geo_shape | ✗ | |
codes | BTreeMap<String, String> |
✗ | ||
insee | String |
✗ | ||
labels | I18nProperties |
?? | ✓ | used in dynamic templates |
level | u32 |
✗ | used for ranking | |
names | I18nProperties |
✓ | used in dynamic templates | |
parent_id | Option<String> |
✗ | ||
zone_type | Option<ZoneType> |
keyword | ✓ | used for filtering |
The treatment of labels and names is done in a separate template, using dynamic templates.
This leaves the remaining fields to be indexed with the mimir-admin.json index template.
If we look back at the list of fields present in the administrative region document, and remove all the fields that are part of the common template, we have the following list of remaining fields:
field | type | Elasticsearch | Index | Comment |
---|---|---|---|---|
house_number | String | text | ✓ | ?? Should we index it ? |
street | Street | Reference to the street the address belongs to. | ✗ |
This leaves the remaining fields to be indexed with the mimir-addr.json index template.
For streets, its quite easy, because all the documents can be indexed with the base template, leaving mimir-street.json index template.
If we look back at the list of fields present in the poi document, and remove all the fields that are part of the common template, we have the following list of remaining fields:
field | type | Elasticsearch | Index | Comment |
---|---|---|---|---|
address | Option | object | ✗ | |
boundary | Option<MultiPolygon<f64>> |
geo_shape | ✗ | |
labels | I18nProperties |
?? | ✓ | used in dynamic templates |
names | I18nProperties |
✓ | used in dynamic templates | |
poi_type | PoiType |
keyword | ✓ | used for filtering |
properties | BTreeMap<String, String> |
object | ✓ | used for filtering |
This leaves the remaining fields to be indexed with the mimir-poi.json index template.
If we look back at the list of fields present in the stop document, and remove all the fields that are part of the common template, we have the following list of remaining fields:
field | type | Elasticsearch | Index | Comment |
---|---|---|---|---|
comments | Vec | ✗ | ||
commercial_modes | Vec | ✗ | ||
coverages | Vec | ✗ | ||
feed_publishers | Vec | ✗ | ||
lines | Vec | ✗ | ||
physical_modes | Vec | ✗ | ||
properties | BTreeMap<String, String> | flattened | ✓ | |
timezone | String | ✗ |
This leaves the remaining fields to be indexed with the mimir-stop.json index template.
For now there is a single binary that is used to insert templates in Elasticsearch. It must be used prior to the creation of any index. This binary uses the same configuration / command line configuration as the other binaries.
./target/release/ctlmimir -c ./config -m testing run
This program will look for the directories <config>/ctlmimir
, and <config>/elasticsearch
to read
some configuration values. and then scan <config>/elasticsearch/templates/components
and import
all the templates in there, and same thing for <config>/elasticsearch/templates/indices
.
You can check that all the templates directly in Elasticsearch: Since Mimirsbrunn's templates are prefixed with 'mimir-', you can run:
curl -X GET 'http://localhost:9200/_component_template/mimir-*' | jq '.'
Same thing for index templates:
curl -X GET 'http://localhost:9200/_component_template/mimir-*' | jq '.'
There are scenarios in which you may want to override certain values.
Let's say you want to make sure that all administrative region indices have a certain number of
replicas, different from the default one. So, prior to importing the templates, you can change
the index template in config/elasticsearch/templates/indices/mimir-admin.json
and change the
settings:
{
"elasticsearch": {
"index_patterns": ["munin_admin*"],
"template": {
"settings": {
"number_of_replicas": "2"
}
...
}
}
}
Then, when you run ctlmimir, you will have a unique value for the number of replicas for all indices
starting with munin_admin*
. You can then test that when you are creating a new index with
cosmogony2mimir
you will have the correct number of replicas.
Lets say that, following the previous scenario, you'd want to create a new admin index, but with a different number of replicas than that found in the index template.
In that case you can still use command line overrides:
cosmogony2mimir -s elasticsearch.settings.number_of_replicas=9 ...
Updating templates is essentially an iterative process, and we try to use a TDD approach:
- A new feature, a bug, and we create a new scenario in the features directory.
- We run the end to end tests (
cargo test --test end_to_end
), it fails - We update the templates, and run the test again.
Playing with templates, analyzers, tokenizers, and so on, boosting some results with regards to others requires an intimate knowledge of how
These measures should be taken into account when modifying the templates: Like most iterative process, we make a change, evaluate the results, estimate what needs to be changed to improve the measure, and loop again.
Evaluating the templates can be done with:
- ctlmimir, which is a binary used to import the templates found in
/config/elasticsearchs/templates
. With this tool, we just check that we can actually import the templates. - import2mimir.sh can be used to evaluate the whole indexing process, using ctl2mimir, and the other indexing tools.
- end to end tests are used to make sure that the indexing process is correct, and that searching predefined queries results are correct.
- benchmark are used to estimate the time it takes to either index or search.