A command-line application that generates measurements data for MIND Foods Hub Data Lake.
Data is generated in two formats: CSV and ndjson.
MIND Foods Hub data are stored in a single table, named dl_measurements
, that follows a denormalized data model to avoid expensive join operations.
This means that, for each row of the table, we can have missing (NULL
) values, depending on the type of measurement.
This is the table schema:
CREATE TABLE dl_measurements
(
id string,
double_value double,
str_value string,
unit_of_measure string,
sensor_id string,
sensor_type string,
sensor_desc_name string,
location_id string,
location_name string,
location_description string,
location_botanic_name string,
location_cultivation_name string,
location_latitude double,
location_longitude double,
location_altitude double,
measure_timestamp timestamp,
start_timestamp timestamp,
end_timestamp timestamp,
insertion_agent string,
insertion_timestamp timestamp,
CONSTRAINT dl_measurements_pk
PRIMARY KEY (id) DISABLE NOVALIDATE
)
PARTITIONED BY (partion_date string)
MIND Foods Hub sensors are of three types:
-
Measurements, that register discrete, floating-point, values (for example temperature, humidity, wind speed, etc, etc). This type of measurement is stored in
double_value
column, while the time of the measurement is stored in themeasure_timestamp
column. -
Phase sensors, that register a range of floating-point values in a given period.
This type of measurement is stored in thestr_value
column, while the time start and end of the measurement are stored respectively in thestart_timestamp
andend_timestamp
columns. -
Tag sensors, that register string-based values.
This type of measurement is stored indouble_value
column, while the time of the measurement is stored in themeasure_timestamp
column.
To randomly generate data for dl_measurements
we need to mock this relation between a sensor type and its measurement, and guarantee these logical constraints:
-
double_value
is only populated for float-based measurements whilestr_value
isNULL
.
measure_timestamp
is calculated, whilestart_timestamp
andend_timestamp
areNULL
-
For phase-based measurement
str_value
is populated, whiledouble_value
isNULL
.
Bothstart_timestamp
andend_timestamp
times are calculated, whilemeasure_timestamp
isNULL
-
For tag based measurement
str_value
is populated, whiledouble_value
isNULL
.
measure_timestamp
is calculated, whilestart_timestamp
andend_timestamp
areNULL
First, install "MFH Measurements data generator" dependencies:
$ npm i
Then run the application with the following command:
$ node index.js
By default "MFH Measurements data generator" generates 5 million rows for both the CSV and ndjson files.
To configure the number of rows to generate, use NUMBER_OF_ROWS
env variable:
$ NUMBER_OF_ROWS=100 node index.js
Other configuration env variables can be found in config.js file.