From 8fa7b2f6185fc19b98a937f983ade52fb353e55d Mon Sep 17 00:00:00 2001 From: Costa Tsaousis Date: Thu, 6 Feb 2025 12:28:42 +0000 Subject: [PATCH] protection against extreme cardinality (#19486) * protection against extreme cardinality * do not cleanup everything, but only what is required to reach 50% * check for collection flags and settings to prevent deletion * 1000 to 10000 * 1000 old instances * log extreme cardinality cleanup * add configuration for extreme cardinality * add documentation about extreme cardinality protection * allow retention to be configured in durations less than a day * min db size 32MiB * extreme cardinality working * fixed cleanup * check for extreme cardinality on rotation, not on startup * added option to enable/disable extreme cardinality protection * enable it by default * make unknown UUID errors and debug log * extreme cardinality logs more information about the current status * enable protection conditions * never delete more than requested --- docs/extreme-cardinality-protection.md | 114 +++++++++++ .../systemd-journal-annotations.c | 1 + src/daemon/config/netdata-conf-db.c | 10 +- .../contexts/api_v2_contexts_agents.c | 44 +++-- src/database/contexts/contexts-loading.c | 2 +- src/database/contexts/internal.h | 3 +- src/database/contexts/metric.c | 7 +- src/database/contexts/worker.c | 184 ++++++++++++++++-- src/database/engine/datafile.h | 2 +- src/database/engine/metric.c | 8 +- src/database/engine/metric.h | 5 +- src/database/engine/pdc.c | 14 +- src/database/engine/rrdengine.c | 2 +- src/database/engine/rrdengineapi.c | 15 ++ src/database/engine/rrdengineapi.h | 3 +- src/database/ram/rrddim_mem.c | 4 + src/database/ram/rrddim_mem.h | 1 + src/database/storage-engine.c | 4 + src/database/storage-engine.h | 1 + src/libnetdata/inicfg/inicfg.h | 3 +- src/libnetdata/inicfg/inicfg_api.c | 30 ++- src/libnetdata/inicfg/inicfg_conf_file.c | 2 +- src/libnetdata/inicfg/inicfg_internals.h | 2 +- src/libnetdata/parsers/duration.c | 20 +- src/libnetdata/parsers/duration.h | 16 +- src/libnetdata/uuid/uuid.h | 1 + src/registry/registry_init.c | 3 +- 27 files changed, 412 insertions(+), 89 deletions(-) create mode 100644 docs/extreme-cardinality-protection.md diff --git a/docs/extreme-cardinality-protection.md b/docs/extreme-cardinality-protection.md new file mode 100644 index 00000000000000..2a25f037e1d06a --- /dev/null +++ b/docs/extreme-cardinality-protection.md @@ -0,0 +1,114 @@ +# Extreme Cardinality Protection in Netdata + +Netdata’s tiered storage is designed to efficiently retain metric data and metadata for long periods. However, when extreme cardinality occurs—often unintentionally through misconfigurations or inadvertent practices (e.g., spawning many short-lived docker containers or using unbounded label values)—the long-term retention of metadata can lead to excessive resource consumption. + +To protect Netdata from extreme cardinality, Netdata has an automated protection. This document explains **why** this protection is needed, **how** it works, **how** to configure it, and **how** to verify its operation. + +## Why Extreme Cardinality Protection is Needed + +Extreme cardinality refers to the explosion in the number of unique time series generated when metrics are combined with a wide range of labels or dimensions. In modern observability platforms like Netdata, metrics aren’t just simple numeric values—they come with metadata (labels, tags, dimensions) that help contextualize the data. When these labels are overly dynamic or unbounded (for example, when using unique identifiers such as session IDs, user IDs, or ephemeral container names), combined with a very long retention, like the one provided by Netdata, the system ends up tracking an enormous number of unique series. + +Despite the fact that Netdata performs better than most other observability solution, extreme cardinality has a few implications: + +- **Resource Consumption:** The system needs to remember and index vast amounts of metadata, increasing its memory footprint. +- **Performance Degradation:** When performing long-term queries (days, weeks, months), the system needs to query a vast number of time-series leading to slower query responses. +- **Operational Complexity:** High cardinality makes it harder to manage, visualize, and analyze data. Dashboards can become cluttered. +- **Scalability Challenges:** As the number of time series grows, the resources required for maintaining aggregation points (Netdata parents) increase too. + +## Definition of Metrics Ephemerality + +Metrics ephemerality is the percentage of metrics that is no longer actively collected (old) compared to the total metrics available (sum of currently collected metrics and old metrics). + +- **Old Metrics** = The number of unique time-series that were once collected, but not currently. +- **Current Metrics** = The number of unique time-series actively being collected. + +High Ephemerality (close to 100%): The system frequently generates new unique metrics for a short period, indicating a high turnover in metrics. +Low Ephemerality (close to 0%): The system maintains a stable set of metrics over time, with little change in the total number of unique series. + +## How The Netdata Protection Works + +The mechanism kicks in during tier0 (high-resolution) database rotations (i.e., when the oldest tier0 samples are deleted) and proceeds as follows: + +1. **Counting Instances with Zero Tier0 Retention:** + - For each context (e.g., containers, disks, network interfaces, etc), Netdata counts the number of instances that have **ZERO** retention in tier0. + +2. **Threshold Verification:** + - If the number of instances with zero tier0 retention is **greater than or equal to 1000** (the default threshold) **and** these instances make up more than **50%** (the default Ephemerality threshold) of the total instances in that context, further action is taken. + +3. **Forceful Clearing in Long-Term Storage:** + - The system forcefully clears the retention of the excess time-series. This action automatically triggers the deletion of the associated metadata. So, Netdata "forgets" them. Their samples are still on disk, but they are no longer accessible. + +4. **Retention Rules:** + - **Protected Data:** + - Metrics that are actively collected (and thus present in tier0) are never deleted. + - A context with fewer than 1000 instances (as presented in the Netdata dashboards at the NIDL bar of the charts) is considered safe and is not modified. + - **Clean-up Trigger:** + - Only metrics that have lost their tier0 retention in a context that meets the thresholds (≥1000 instances and >50% ephemerality) will have their long-term retention cleared. + +## Configuration + +You can control the protection mechanism via the following settings in the `netdata.conf` file under the `[db]` section: + +```ini +[db] + extreme cardinality protection = yes + extreme cardinality keep instances = 1000 + extreme cardinality min ephemerality = 50 +``` + +- **extreme cardinality keep instances:** + The minimum number of instances per context that should be kept. The default value is **1000**. + +- **extreme cardinality min ephemerality:** + The minimum percentage (in percent) of instances in a context that have zero tier0 retention to trigger the cleanup. The default value is **50%**. + + +**Recommendations:** + +- If you have samples in tier0, you also have their corresponding long-term data and metadata. Ensure that tier0 retention is configured properly. +- If you expect to have more than 1000 instances per context per node (for example, more than 1000 containers, disks, network interfaces, database tables, etc.), adjust these settings to suit your specific environment. + +## How to Check Its Work + +When the protection mechanism is activated, Netdata logs a detailed message. The log entry includes: + +- The host name. +- The context affected. +- The number of metrics and instances that had their retention forcefully cleared. +- The time range for which the non-tier0 retention was deleted. + +### Example Log Message + +``` +EXTREME CARDINALITY PROTECTION: on host '', for context '': forcefully cleared the retention of metrics and instances, having non-tier0 retention from to . +``` + +This log message is tagged with the following message ID for easy identification: + +``` +MESSAGE_ID=d1f59606dd4d41e3b217a0cfcae8e632 +``` + +### Verification Steps + +1. **Using System Logs:** + + You can use `journalctl` (or your system’s log viewer) to search for the message ID: + +``` +journalctl --namespace=netdata MESSAGE_ID=d1f59606dd4d41e3b217a0cfcae8e632 +``` + +2. **Netdata Logs Dashboard:** + + Navigate to the Netdata Logs dashboard. On the right side under `MESSAGE_ID`, select **"Netdata extreme cardinality"** to filter only those messages. + +## Summary + +The extreme cardinality protection mechanism in Netdata is designed to automatically safeguard your system against the potential issues caused by excessive metric metadata retention. It does so by: + +- Automatically counting instances without tier0 retention. +- Checking against configurable thresholds. +- Forcefully clearing long-term retention (and metadata) when thresholds are exceeded. + +By properly configuring tier0 and adjusting the `extreme cardinality` settings in `netdata.conf`, you can ensure that your system remains both efficient and protected, even when extreme cardinality issues occur. diff --git a/src/collectors/systemd-journal.plugin/systemd-journal-annotations.c b/src/collectors/systemd-journal.plugin/systemd-journal-annotations.c index ef9204db30a271..4808d2577c1da7 100644 --- a/src/collectors/systemd-journal.plugin/systemd-journal-annotations.c +++ b/src/collectors/systemd-journal.plugin/systemd-journal-annotations.c @@ -618,6 +618,7 @@ static void netdata_systemd_journal_message_ids_init(void) { msgid_into_dict("8ddaf5ba33a74078b609250db1e951f3", "Sensor state transition"); msgid_into_dict("ec87a56120d5431bace51e2fb8bba243", "Netdata log flood protection"); msgid_into_dict("acb33cb95778476baac702eb7e4e151d", "Netdata Cloud connection"); + msgid_into_dict("d1f59606dd4d41e3b217a0cfcae8e632", "Netdata extreme cardinality"); } void netdata_systemd_journal_transform_message_id(FACETS *facets __maybe_unused, BUFFER *wb, FACETS_TRANSFORMATION_SCOPE scope __maybe_unused, void *data __maybe_unused) { diff --git a/src/daemon/config/netdata-conf-db.c b/src/daemon/config/netdata-conf-db.c index 22c31ac2bce273..98060154c17b41 100644 --- a/src/daemon/config/netdata-conf-db.c +++ b/src/daemon/config/netdata-conf-db.c @@ -3,12 +3,13 @@ #include "netdata-conf-db.h" #include "daemon/common.h" +#define DAYS 86400 int default_rrd_history_entries = RRD_DEFAULT_HISTORY_ENTRIES; bool dbengine_enabled = false; // will become true if and when dbengine is initialized bool dbengine_use_direct_io = true; static size_t storage_tiers_grouping_iterations[RRD_STORAGE_TIERS] = {1, 60, 60, 60, 60}; -static double storage_tiers_retention_days[RRD_STORAGE_TIERS] = {14, 90, 2 * 365, 2 * 365, 2 * 365}; +static time_t storage_tiers_retention_time_s[RRD_STORAGE_TIERS] = {14 * DAYS, 90 * DAYS, 2 * 365 * DAYS, 2 * 365 * DAYS, 2 * 365 * DAYS}; time_t rrdset_free_obsolete_time_s = 3600; time_t rrdhost_free_orphan_time_s = 3600; @@ -275,12 +276,13 @@ void netdata_conf_dbengine_init(const char *hostname) { disk_space_mb = inicfg_get_size_mb(&netdata_config, CONFIG_SECTION_DB, dbengineconfig, disk_space_mb); snprintfz(dbengineconfig, sizeof(dbengineconfig) - 1, "dbengine tier %zu retention time", tier); - storage_tiers_retention_days[tier] = inicfg_get_duration_days(&netdata_config, - CONFIG_SECTION_DB, dbengineconfig, new_dbengine_defaults ? storage_tiers_retention_days[tier] : 0); + storage_tiers_retention_time_s[tier] = inicfg_get_duration_days_to_seconds( + &netdata_config, CONFIG_SECTION_DB, + dbengineconfig, new_dbengine_defaults ? storage_tiers_retention_time_s[tier] : 0); tiers_init[tier].disk_space_mb = (int) disk_space_mb; tiers_init[tier].tier = tier; - tiers_init[tier].retention_seconds = (size_t) (86400.0 * storage_tiers_retention_days[tier]); + tiers_init[tier].retention_seconds = (size_t) storage_tiers_retention_time_s[tier]; strncpyz(tiers_init[tier].path, dbenginepath, FILENAME_MAX); tiers_init[tier].ret = 0; diff --git a/src/database/contexts/api_v2_contexts_agents.c b/src/database/contexts/api_v2_contexts_agents.c index 48ef753eb73b11..4339f54640c589 100644 --- a/src/database/contexts/api_v2_contexts_agents.c +++ b/src/database/contexts/api_v2_contexts_agents.c @@ -5,6 +5,17 @@ void build_info_to_json_object(BUFFER *b); +static time_t round_retention(time_t retention_seconds) { + if(retention_seconds > 60 * 86400) + retention_seconds = HOWMANY(retention_seconds, 86400) * 86400; + else if(retention_seconds > 86400) + retention_seconds = HOWMANY(retention_seconds, 3600) * 3600; + else + retention_seconds = HOWMANY(retention_seconds, 60) * 60; + + return retention_seconds; +} + void buffer_json_agents_v2(BUFFER *wb, struct query_timings *timings, time_t now_s, bool info, bool array) { if(!now_s) now_s = now_realtime_sec(); @@ -128,9 +139,9 @@ void buffer_json_agents_v2(BUFFER *wb, struct query_timings *timings, time_t now buffer_json_add_array_item_object(wb); buffer_json_member_add_uint64(wb, "tier", tier); - char human_retention[128]; - duration_snprintf_time_t(human_retention, sizeof(human_retention), (stime_t)group_seconds); - buffer_json_member_add_string(wb, "granularity", human_retention); + char human_duration[128]; + duration_snprintf_time_t(human_duration, sizeof(human_duration), (stime_t)group_seconds); + buffer_json_member_add_string(wb, "granularity", human_duration); buffer_json_member_add_uint64(wb, "metrics", storage_engine_metrics(eng->seb, localhost->db[tier].si)); buffer_json_member_add_uint64(wb, "samples", storage_engine_samples(eng->seb, localhost->db[tier].si)); @@ -148,18 +159,10 @@ void buffer_json_agents_v2(BUFFER *wb, struct query_timings *timings, time_t now buffer_json_member_add_time_t(wb, "to", now_s); buffer_json_member_add_time_t(wb, "retention", retention); - if(retention < 60) - duration_snprintf_time_t(human_retention, sizeof(human_retention), retention); - else if(retention < 24 * 60 * 60) { - int64_t rounded_retention_mins = duration_round_to_resolution(retention, 60); - duration_snprintf_mins(human_retention, sizeof(human_retention), rounded_retention_mins); - } - else { - int64_t rounded_retention_hours = duration_round_to_resolution(retention, 3600); - duration_snprintf_hours(human_retention, sizeof(human_retention), rounded_retention_hours); - } + duration_snprintf(human_duration, sizeof(human_duration), + round_retention(retention), "s", false); - buffer_json_member_add_string(wb, "retention_human", human_retention); + buffer_json_member_add_string(wb, "retention_human", human_duration); if(used || max) { // we have disk space information time_t time_retention = 0; @@ -169,17 +172,18 @@ void buffer_json_agents_v2(BUFFER *wb, struct query_timings *timings, time_t now time_t space_retention = (time_t)((NETDATA_DOUBLE)(now_s - first_time_s) * 100.0 / percent); time_t actual_retention = MIN(space_retention, time_retention ? time_retention : space_retention); - duration_snprintf_hours(human_retention, sizeof(human_retention), - (int)duration_round_to_resolution(time_retention, 3600)); + duration_snprintf( + human_duration, sizeof(human_duration), + (int)time_retention, "s", false); buffer_json_member_add_time_t(wb, "requested_retention", time_retention); - buffer_json_member_add_string(wb, "requested_retention_human", human_retention); + buffer_json_member_add_string(wb, "requested_retention_human", human_duration); - duration_snprintf_hours(human_retention, sizeof(human_retention), - (int)duration_round_to_resolution(actual_retention, 3600)); + duration_snprintf(human_duration, sizeof(human_duration), + (int)round_retention(actual_retention), "s", false); buffer_json_member_add_time_t(wb, "expected_retention", actual_retention); - buffer_json_member_add_string(wb, "expected_retention_human", human_retention); + buffer_json_member_add_string(wb, "expected_retention_human", human_duration); } } buffer_json_object_close(wb); diff --git a/src/database/contexts/contexts-loading.c b/src/database/contexts/contexts-loading.c index 5c52c5fcdabfe0..a6b5aeb2867477 100644 --- a/src/database/contexts/contexts-loading.c +++ b/src/database/contexts/contexts-loading.c @@ -18,7 +18,7 @@ static void rrdinstance_load_dimension_callback(SQL_DIMENSION_DATA *sd, void *da UUIDMAP_ID id = uuidmap_create(sd->dim_id); time_t min_first_time_t = LONG_MAX, max_last_time_t = 0; - get_metric_retention_by_id(host, id, &min_first_time_t, &max_last_time_t); + get_metric_retention_by_id(host, id, &min_first_time_t, &max_last_time_t, NULL); if((!min_first_time_t || min_first_time_t == LONG_MAX) && !max_last_time_t) { uuidmap_free(id); th_zero_retention_metrics++; diff --git a/src/database/contexts/internal.h b/src/database/contexts/internal.h index 17a6cab11c8898..5e5794cea42c56 100644 --- a/src/database/contexts/internal.h +++ b/src/database/contexts/internal.h @@ -60,6 +60,7 @@ typedef enum __attribute__ ((__packed__)) { RRD_FLAG_UPDATE_REASON_UNUSED = (1 << 22), // this context is not used anymore RRD_FLAG_UPDATE_REASON_DB_ROTATION = (1 << 23), // this context changed because of a db rotation + RRD_FLAG_NO_TIER0_RETENTION = (1 << 28), RRD_FLAG_MERGED_COLLECTED_RI_TO_RC = (1 << 29), // action to perform on an object @@ -475,7 +476,7 @@ void rrdcontext_update_from_collected_rrdinstance(RRDINSTANCE *ri); void rrdcontext_garbage_collect_single_host(RRDHOST *host, bool worker_jobs); -void get_metric_retention_by_id(RRDHOST *host, UUIDMAP_ID id, time_t *min_first_time_t, time_t *max_last_time_t); +void get_metric_retention_by_id(RRDHOST *host, UUIDMAP_ID id, time_t *min_first_time_t, time_t *max_last_time_t, bool *tier0_retention); void rrdcontext_delete_after_loading(RRDHOST *host, RRDCONTEXT *rc); void rrdcontext_initial_processing_after_loading(RRDCONTEXT *rc); diff --git a/src/database/contexts/metric.c b/src/database/contexts/metric.c index c1b424a32d6e69..c417a2191eb459 100644 --- a/src/database/contexts/metric.c +++ b/src/database/contexts/metric.c @@ -111,6 +111,7 @@ static void rrdmetric_delete_callback(const DICTIONARY_ITEM *item __maybe_unused static bool rrdmetric_conflict_callback(const DICTIONARY_ITEM *item __maybe_unused, void *old_value, void *new_value, void *rrdinstance __maybe_unused) { RRDMETRIC *rm = old_value; RRDMETRIC *rm_new = new_value; + rm_new->ri = rm->ri; internal_error(rm->id != rm_new->id, "RRDMETRIC: '%s' cannot change id to '%s'", @@ -131,9 +132,9 @@ static bool rrdmetric_conflict_callback(const DICTIONARY_ITEM *item __maybe_unus time_t new_first_time_s = 0; time_t new_last_time_s = 0; - if(rrdmetric_update_retention(rm)) { - new_first_time_s = rm->first_time_s; - new_last_time_s = rm->last_time_s; + if(rrdmetric_update_retention(rm_new)) { + new_first_time_s = rm_new->first_time_s; + new_last_time_s = rm_new->last_time_s; } internal_error(true, diff --git a/src/database/contexts/worker.c b/src/database/contexts/worker.c index 3afb2a10156071..d9f66b41119dd3 100644 --- a/src/database/contexts/worker.c +++ b/src/database/contexts/worker.c @@ -2,6 +2,18 @@ #include "internal.h" +static struct { + bool enabled; + size_t db_rotations; + size_t instances_count; + size_t active_vs_archived_percentage; +} extreme_cardinality = { + .enabled = true, // this value is ignored - there is a dynamic condition to enable it + .db_rotations = 0, + .instances_count = 1000, + .active_vs_archived_percentage = 50, +}; + static uint64_t rrdcontext_get_next_version(RRDCONTEXT *rc); static bool check_if_cloud_version_changed_unsafe(RRDCONTEXT *rc, bool sending __maybe_unused); @@ -9,7 +21,7 @@ static bool check_if_cloud_version_changed_unsafe(RRDCONTEXT *rc, bool sending _ static void rrdcontext_delete_from_sql_unsafe(RRDCONTEXT *rc); static void rrdcontext_dequeue_from_post_processing(RRDCONTEXT *rc); -static void rrdcontext_post_process_updates(RRDCONTEXT *rc, bool force, RRD_FLAGS reason, bool worker_jobs); +static bool rrdcontext_post_process_updates(RRDCONTEXT *rc, bool force, RRD_FLAGS reason, bool worker_jobs); static void rrdcontext_garbage_collect_for_all_hosts(void); @@ -100,7 +112,10 @@ static void rrdhost_update_cached_retention(RRDHOST *host, time_t first_time_s, } void rrdcontext_recalculate_context_retention(RRDCONTEXT *rc, RRD_FLAGS reason, bool worker_jobs) { - rrdcontext_post_process_updates(rc, true, reason, worker_jobs); + bool forcefully_removed_instances = false; + do { + forcefully_removed_instances = rrdcontext_post_process_updates(rc, true, reason, worker_jobs); + } while(forcefully_removed_instances); } void rrdcontext_recalculate_host_retention(RRDHOST *host, RRD_FLAGS reason, bool worker_jobs) { @@ -137,7 +152,7 @@ static void rrdcontext_recalculate_retention_all_hosts(void) { // ---------------------------------------------------------------------------- // garbage collector -void get_metric_retention_by_id(RRDHOST *host, UUIDMAP_ID id, time_t *min_first_time_t, time_t *max_last_time_t) { +void get_metric_retention_by_id(RRDHOST *host, UUIDMAP_ID id, time_t *min_first_time_t, time_t *max_last_time_t, bool *tier0_retention) { *min_first_time_t = LONG_MAX; *max_last_time_t = 0; @@ -152,6 +167,9 @@ void get_metric_retention_by_id(RRDHOST *host, UUIDMAP_ID id, time_t *min_first_ if (last_time_t > *max_last_time_t) *max_last_time_t = last_time_t; } + + if(tier == 0 && tier0_retention) + *tier0_retention = first_time_t || last_time_t; } } @@ -161,21 +179,24 @@ bool rrdmetric_update_retention(RRDMETRIC *rm) { if(rm->rrddim) { min_first_time_t = rrddim_first_entry_s(rm->rrddim); max_last_time_t = rrddim_last_entry_s(rm->rrddim); + rrd_flag_clear(rm, RRD_FLAG_NO_TIER0_RETENTION); } - else - get_metric_retention_by_id(rm->ri->rc->rrdhost, rm->uuid, &min_first_time_t, &max_last_time_t); + else { + bool tier0_retention; + get_metric_retention_by_id(rm->ri->rc->rrdhost, rm->uuid, &min_first_time_t, &max_last_time_t, &tier0_retention); - if((min_first_time_t == LONG_MAX || min_first_time_t == 0) && max_last_time_t == 0) - return false; + if(tier0_retention) + rrd_flag_clear(rm, RRD_FLAG_NO_TIER0_RETENTION); + else + rrd_flag_set(rm, RRD_FLAG_NO_TIER0_RETENTION); + } if(min_first_time_t == LONG_MAX) min_first_time_t = 0; if(min_first_time_t > max_last_time_t) { internal_error(true, "RRDMETRIC: retention of '%s' is flipped, first_time_t = %ld, last_time_t = %ld", string2str(rm->id), min_first_time_t, max_last_time_t); - time_t tmp = min_first_time_t; - min_first_time_t = max_last_time_t; - max_last_time_t = tmp; + SWAP(min_first_time_t, max_last_time_t); } // check if retention changed @@ -402,7 +423,7 @@ static void rrdinstance_post_process_updates(RRDINSTANCE *ri, bool force, RRD_FL worker_is_busy(WORKER_JOB_PP_INSTANCE); time_t min_first_time_t = LONG_MAX, max_last_time_t = 0; - size_t metrics_active = 0, metrics_deleted = 0; + size_t metrics_active = 0, metrics_deleted = 0, metrics_no_tier0 = 0; bool live_retention = true, currently_collected = false; if(dictionary_entries(ri->rrdmetrics) > 0) { RRDMETRIC *rm; @@ -418,6 +439,9 @@ static void rrdinstance_post_process_updates(RRDINSTANCE *ri, bool force, RRD_FL if(unlikely(!rrd_flag_check(rm, RRD_FLAG_LIVE_RETENTION))) live_retention = false; + if(unlikely(rrd_flag_check(rm, RRD_FLAG_NO_TIER0_RETENTION))) + metrics_no_tier0++; + if (unlikely((rrdmetric_should_be_deleted(rm)))) { metrics_deleted++; continue; @@ -437,6 +461,11 @@ static void rrdinstance_post_process_updates(RRDINSTANCE *ri, bool force, RRD_FL dfe_done(rm); } + if(metrics_no_tier0 && metrics_no_tier0 == metrics_active) + rrd_flag_set(ri, RRD_FLAG_NO_TIER0_RETENTION); + else + rrd_flag_clear(ri, RRD_FLAG_NO_TIER0_RETENTION); + if(unlikely(live_retention && !rrd_flag_check(ri, RRD_FLAG_LIVE_RETENTION))) rrd_flag_set(ri, RRD_FLAG_LIVE_RETENTION); else if(unlikely(!live_retention && rrd_flag_check(ri, RRD_FLAG_LIVE_RETENTION))) @@ -500,7 +529,98 @@ static void rrdinstance_post_process_updates(RRDINSTANCE *ri, bool force, RRD_FL rrd_flag_unset_updated(ri); } -static void rrdcontext_post_process_updates(RRDCONTEXT *rc, bool force, RRD_FLAGS reason, bool worker_jobs) { +static bool rrdinstance_forcefully_clear_retention(RRDCONTEXT *rc, size_t count, const char *descr) { + if(!count) return false; + + RRDHOST *host = rc->rrdhost; + + time_t from_s = LONG_MAX; + time_t to_s = 0; + + size_t instances_deleted = 0; + size_t metrics_deleted = 0; + RRDINSTANCE *ri; + dfe_start_read(rc->rrdinstances, ri) { + if(!rrd_flag_check(ri, RRD_FLAG_NO_TIER0_RETENTION) || rrd_flag_is_collected(ri) || ri->rrdset) + continue; + + size_t metrics_cleared = 0; + RRDMETRIC *rm; + dfe_start_read(ri->rrdmetrics, rm) { + if(!rrd_flag_check(rm, RRD_FLAG_NO_TIER0_RETENTION) || rrd_flag_is_collected(rm) || rm->rrddim) + continue; + + rrdmetric_update_retention(rm); + + if(rm->first_time_s < from_s) + from_s = rm->first_time_s; + + if(rm->last_time_s > to_s) + to_s = rm->last_time_s; + + for (size_t tier = 0; tier < nd_profile.storage_tiers; tier++) { + STORAGE_ENGINE *eng = host->db[tier].eng; + eng->api.metric_retention_delete_by_id(host->db[tier].si, rm->uuid); + } + + metrics_cleared++; + metrics_deleted++; + rrdmetric_update_retention(rm); + rrdmetric_trigger_updates(rm, __FUNCTION__ ); + } + dfe_done(rm); + + if(metrics_cleared) { + rrdinstance_trigger_updates(ri, __FUNCTION__ ); + instances_deleted++; + + if(--count == 0) + break; + } + } + dfe_done(ri); + + if(metrics_deleted) { + char from_txt[128], to_txt[128]; + + if(!from_s || from_s == LONG_MAX) + snprintfz(from_txt, sizeof(from_txt), "%s", "NONE"); + else + rfc3339_datetime_ut(from_txt, sizeof(from_txt), from_s * USEC_PER_SEC, 0, true); + + if(!to_s) + snprintfz(to_txt, sizeof(to_txt), "%s", "NONE"); + else + rfc3339_datetime_ut(to_txt, sizeof(to_txt), to_s * USEC_PER_SEC, 0, true); + + ND_LOG_STACK lgs[] = { + ND_LOG_FIELD_TXT(NDF_MODULE, "extreme cardinality protection"), + ND_LOG_FIELD_STR(NDF_NIDL_NODE, rc->rrdhost->hostname), + ND_LOG_FIELD_STR(NDF_NIDL_CONTEXT, rc->id), + ND_LOG_FIELD_UUID(NDF_MESSAGE_ID, &extreme_cardinality_msgid), + ND_LOG_FIELD_END(), + }; + ND_LOG_STACK_PUSH(lgs); + + nd_log(NDLS_DAEMON, NDLP_NOTICE, + "EXTREME CARDINALITY PROTECTION: host '%s', context '%s', %s: " + "forcefully cleared the retention of %zu metrics and %zu instances, " + "having non-tier0 retention from %s to %s.", + rrdhost_hostname(rc->rrdhost), + string2str(rc->id), + descr, + metrics_deleted, instances_deleted, + from_txt, to_txt); + + return true; + } + + return false; +} + +static bool rrdcontext_post_process_updates(RRDCONTEXT *rc, bool force, RRD_FLAGS reason, bool worker_jobs) { + bool ret = false; + if(reason != RRD_FLAG_NONE) rrd_flag_set_updated(rc, reason); @@ -511,7 +631,7 @@ static void rrdcontext_post_process_updates(RRDCONTEXT *rc, bool force, RRD_FLAG size_t min_priority_not_collected = LONG_MAX; size_t min_priority = LONG_MAX; time_t min_first_time_t = LONG_MAX, max_last_time_t = 0; - size_t instances_active = 0, instances_deleted = 0; + size_t instances_active = 0, instances_deleted = 0, instances_no_tier0 = 0; bool live_retention = true, currently_collected = false, hidden = true; if(dictionary_entries(rc->rrdinstances) > 0) { RRDINSTANCE *ri; @@ -535,6 +655,9 @@ static void rrdcontext_post_process_updates(RRDCONTEXT *rc, bool force, RRD_FLAG continue; } + if(unlikely(rrd_flag_check(ri, RRD_FLAG_NO_TIER0_RETENTION))) + instances_no_tier0++; + bool ri_collected = rrd_flag_is_collected(ri); if(ri_collected && !rrd_flag_check(ri, RRD_FLAG_MERGED_COLLECTED_RI_TO_RC)) { @@ -571,6 +694,25 @@ static void rrdcontext_post_process_updates(RRDCONTEXT *rc, bool force, RRD_FLAG } dfe_done(ri); + if(extreme_cardinality.enabled && + extreme_cardinality.db_rotations && + instances_no_tier0 >= extreme_cardinality.instances_count) { + size_t percent = (100 * instances_no_tier0 / instances_active); + if(percent >= extreme_cardinality.active_vs_archived_percentage) { + size_t to_keep = extreme_cardinality.active_vs_archived_percentage * instances_active / 100; + to_keep = MAX(to_keep, extreme_cardinality.instances_count); + size_t to_remove = instances_no_tier0 > to_keep ? instances_no_tier0 - to_keep : 0; + + if(to_remove) { + char buf[256]; + snprintfz(buf, sizeof(buf), + "total active instances %zu, not in tier0 %zu, ephemerality %zu%%", + instances_active, instances_no_tier0, percent); + ret = rrdinstance_forcefully_clear_retention(rc, to_remove, buf); + } + } + } + if(min_priority_collected != LONG_MAX) // use the collected priority min_priority = min_priority_collected; @@ -669,6 +811,8 @@ static void rrdcontext_post_process_updates(RRDCONTEXT *rc, bool force, RRD_FLAG rrd_flag_unset_updated(rc); rrdcontext_unlock(rc); + + return ret; } void rrdcontext_queue_for_post_processing(RRDCONTEXT *rc, const char *function __maybe_unused, RRD_FLAGS flags __maybe_unused) { @@ -1016,6 +1160,19 @@ void *rrdcontext_main(void *ptr) { heartbeat_t hb; heartbeat_init(&hb, RRDCONTEXT_WORKER_THREAD_HEARTBEAT_USEC); + extreme_cardinality.enabled = inicfg_get_boolean( + &netdata_config, CONFIG_SECTION_DB, "extreme cardinality protection", + nd_profile.storage_tiers > 1 && default_rrd_memory_mode == RRD_DB_MODE_DBENGINE + ); + + extreme_cardinality.instances_count = inicfg_get_number_range( + &netdata_config, CONFIG_SECTION_DB, "extreme cardinality keep instances", + (long long)extreme_cardinality.instances_count, 1, 1000000); + + extreme_cardinality.active_vs_archived_percentage = inicfg_get_number_range( + &netdata_config, CONFIG_SECTION_DB, "extreme cardinality min ephemerality", + (long long)extreme_cardinality.active_vs_archived_percentage, 0, 100); + while (service_running(SERVICE_CONTEXT)) { worker_is_idle(); heartbeat_next(&hb); @@ -1025,6 +1182,7 @@ void *rrdcontext_main(void *ptr) { usec_t now_ut = now_realtime_usec(); if(rrdcontext_next_db_rotation_ut && now_ut > rrdcontext_next_db_rotation_ut) { + extreme_cardinality.db_rotations++; rrdcontext_recalculate_retention_all_hosts(); rrdcontext_garbage_collect_for_all_hosts(); rrdcontext_next_db_rotation_ut = 0; diff --git a/src/database/engine/datafile.h b/src/database/engine/datafile.h index 54857947ec5214..69bea06b04e585 100644 --- a/src/database/engine/datafile.h +++ b/src/database/engine/datafile.h @@ -20,7 +20,7 @@ struct rrdengine_instance; #error MIN_DATAFILE_SIZE > MAX_DATAFILE_SIZE #endif -#define MIN_DATAFILE_SIZE (4LU * 1024LU * 1024LU) +#define MIN_DATAFILE_SIZE (512LU * 1024LU) #define MAX_DATAFILES (65536 * 4) /* Supports up to 64TiB for now */ #define TARGET_DATAFILES (100) diff --git a/src/database/engine/metric.c b/src/database/engine/metric.c index 6ba5cc833fd116..c0765b2c8f34a5 100644 --- a/src/database/engine/metric.c +++ b/src/database/engine/metric.c @@ -459,6 +459,12 @@ ALWAYS_INLINE time_t mrg_metric_get_first_time_s(MRG *mrg __maybe_unused, METRIC return mrg_metric_get_first_time_s_smart(mrg, metric); } +void mrg_metric_clear_retention(MRG *mrg __maybe_unused, METRIC *metric) { + __atomic_store_n(&metric->first_time_s, 0, __ATOMIC_RELAXED); + __atomic_store_n(&metric->latest_time_s_clean, 0, __ATOMIC_RELAXED); + __atomic_store_n(&metric->latest_time_s_hot, 0, __ATOMIC_RELAXED); +} + ALWAYS_INLINE_HOT void mrg_metric_get_retention(MRG *mrg __maybe_unused, METRIC *metric, time_t *first_time_s, time_t *last_time_s, uint32_t *update_every_s) { time_t clean = __atomic_load_n(&metric->latest_time_s_clean, __ATOMIC_RELAXED); time_t hot = __atomic_load_n(&metric->latest_time_s_hot, __ATOMIC_RELAXED); @@ -490,7 +496,7 @@ ALWAYS_INLINE bool mrg_metric_set_clean_latest_time_s(MRG *mrg __maybe_unused, M } // returns true when metric still has retention -ALWAYS_INLINE bool mrg_metric_zero_disk_retention(MRG *mrg __maybe_unused, METRIC *metric) { +ALWAYS_INLINE bool mrg_metric_has_zero_disk_retention(MRG *mrg __maybe_unused, METRIC *metric) { Word_t section = mrg_metric_section(mrg, metric); bool do_again = false; size_t countdown = 5; diff --git a/src/database/engine/metric.h b/src/database/engine/metric.h index 5ab2aa9e18fb57..a1dfa6504d60be 100644 --- a/src/database/engine/metric.h +++ b/src/database/engine/metric.h @@ -53,7 +53,7 @@ bool mrg_metric_release_and_delete(MRG *mrg, METRIC *metric); Word_t mrg_metric_id(MRG *mrg, METRIC *metric); nd_uuid_t *mrg_metric_uuid(MRG *mrg, METRIC *metric); -UUIDMAP_ID mrg_metric_uuidmap_id_dup(MRG *mrg __maybe_unused, METRIC *metric); +UUIDMAP_ID mrg_metric_uuidmap_id_dup(MRG *mrg, METRIC *metric); Word_t mrg_metric_section(MRG *mrg, METRIC *metric); bool mrg_metric_set_first_time_s(MRG *mrg, METRIC *metric, time_t first_time_s); @@ -71,7 +71,8 @@ uint32_t mrg_metric_get_update_every_s(MRG *mrg, METRIC *metric); void mrg_metric_expand_retention(MRG *mrg, METRIC *metric, time_t first_time_s, time_t last_time_s, uint32_t update_every_s); void mrg_metric_get_retention(MRG *mrg, METRIC *metric, time_t *first_time_s, time_t *last_time_s, uint32_t *update_every_s); -bool mrg_metric_zero_disk_retention(MRG *mrg __maybe_unused, METRIC *metric); +bool mrg_metric_has_zero_disk_retention(MRG *mrg, METRIC *metric); +void mrg_metric_clear_retention(MRG *mrg, METRIC *metric); #ifdef NETDATA_INTERNAL_CHECKS bool mrg_metric_set_writer(MRG *mrg, METRIC *metric); diff --git a/src/database/engine/pdc.c b/src/database/engine/pdc.c index 796b1f99a72701..74ec62a3bb2df7 100644 --- a/src/database/engine/pdc.c +++ b/src/database/engine/pdc.c @@ -895,7 +895,7 @@ static ALWAYS_INLINE struct page_details *epdl_get_pd_load_link_list_from_metric return pd_list; } -static void epdl_extent_loading_error_log(struct rrdengine_instance *ctx, EPDL *epdl, struct rrdeng_extent_page_descr *descr, const char *msg) { +static void epdl_extent_loading_error_log(struct rrdengine_instance *ctx, EPDL *epdl, struct rrdeng_extent_page_descr *descr, const char *msg, ND_LOG_FIELD_PRIORITY priority) { char uuid[UUID_STR_LEN] = ""; time_t start_time_s = 0; time_t end_time_s = 0; @@ -950,7 +950,7 @@ static void epdl_extent_loading_error_log(struct rrdengine_instance *ctx, EPDL * log_date(end_time_str, LOG_DATE_LENGTH, end_time_s); nd_log_limit_static_global_var(erl, 1, 0); - nd_log_limit(&erl, NDLS_DAEMON, NDLP_ERR, + nd_log_limit(&erl, NDLS_DAEMON, priority, "DBENGINE: error while reading extent from datafile %u of tier %d, at offset %" PRIu64 " (%u bytes) " "%s from %ld (%s) to %ld (%s) %s%s: " "%s", @@ -1011,7 +1011,7 @@ static bool epdl_populate_pages_from_extent_data( (payload_length != trailer_offset - payload_offset) || (data_length != payload_offset + payload_length + sizeof(*trailer)) ) { - epdl_extent_loading_error_log(ctx, epdl, NULL, "header is INVALID"); + epdl_extent_loading_error_log(ctx, epdl, NULL, "header is INVALID", NDLP_ERR); return false; } @@ -1020,7 +1020,7 @@ static bool epdl_populate_pages_from_extent_data( if (unlikely(crc32cmp(trailer->checksum, crc))) { ctx_io_error(ctx); have_read_error = true; - epdl_extent_loading_error_log(ctx, epdl, NULL, "CRC32 checksum FAILED"); + epdl_extent_loading_error_log(ctx, epdl, NULL, "CRC32 checksum FAILED", NDLP_ERR); } if(worker) @@ -1081,7 +1081,7 @@ static bool epdl_populate_pages_from_extent_data( if(!page_length || !start_time_s) { char log[200 + 1]; snprintfz(log, sizeof(log) - 1, "page %u (out of %u) is EMPTY", i, count); - epdl_extent_loading_error_log(ctx, epdl, &header->descr[i], log); + epdl_extent_loading_error_log(ctx, epdl, &header->descr[i], log, NDLP_ERR); continue; } @@ -1090,7 +1090,7 @@ static bool epdl_populate_pages_from_extent_data( if(!metric) { char log[200 + 1]; snprintfz(log, sizeof(log) - 1, "page %u (out of %u) has unknown UUID", i, count); - epdl_extent_loading_error_log(ctx, epdl, &header->descr[i], log); + epdl_extent_loading_error_log(ctx, epdl, &header->descr[i], log, NDLP_DEBUG); continue; } mrg_metric_release(main_mrg, metric); @@ -1126,7 +1126,7 @@ static bool epdl_populate_pages_from_extent_data( snprintfz(log, sizeof(log) - 1, "page %u (out of %u) offset %u + page length %zu, " "exceeds the uncompressed buffer size %u", i, count, page_offset, vd.page_length, uncompressed_payload_length); - epdl_extent_loading_error_log(ctx, epdl, &header->descr[i], log); + epdl_extent_loading_error_log(ctx, epdl, &header->descr[i], log, NDLP_ERR); pgd = PGD_EMPTY; stats_load_invalid_page++; diff --git a/src/database/engine/rrdengine.c b/src/database/engine/rrdengine.c index a2944e22034455..19dcea484fe088 100644 --- a/src/database/engine/rrdengine.c +++ b/src/database/engine/rrdengine.c @@ -1181,7 +1181,7 @@ static void update_metrics_first_time_s(struct rrdengine_instance *ctx, struct r zero_disk_retention++; // there is no retention for this metric - bool has_retention = mrg_metric_zero_disk_retention(main_mrg, uuid_first_t_entry->metric); + bool has_retention = mrg_metric_has_zero_disk_retention(main_mrg, uuid_first_t_entry->metric); if (!has_retention) { time_t first_time_s = mrg_metric_get_first_time_s(main_mrg, uuid_first_t_entry->metric); time_t last_time_s = mrg_metric_get_latest_time_s(main_mrg, uuid_first_t_entry->metric); diff --git a/src/database/engine/rrdengineapi.c b/src/database/engine/rrdengineapi.c index a1dd7408e68d1f..795b47394e793c 100755 --- a/src/database/engine/rrdengineapi.c +++ b/src/database/engine/rrdengineapi.c @@ -1005,6 +1005,21 @@ bool rrdeng_metric_retention_by_id(STORAGE_INSTANCE *si, UUIDMAP_ID id, time_t * return true; } +void rrdeng_metric_retention_delete_by_id(STORAGE_INSTANCE *si, UUIDMAP_ID id) { + struct rrdengine_instance *ctx = (struct rrdengine_instance *)si; + if (unlikely(!ctx)) { + netdata_log_error("DBENGINE: invalid STORAGE INSTANCE to %s()", __FUNCTION__); + return; + } + + METRIC *metric = mrg_metric_get_and_acquire_by_id(main_mrg, id, (Word_t)ctx); + if (unlikely(!metric)) + return; + + mrg_metric_clear_retention(main_mrg, metric); + mrg_metric_release(main_mrg, metric); +} + uint64_t rrdeng_disk_space_max(STORAGE_INSTANCE *si) { struct rrdengine_instance *ctx = (struct rrdengine_instance *)si; return ctx->config.max_disk_space; diff --git a/src/database/engine/rrdengineapi.h b/src/database/engine/rrdengineapi.h index 1f7dabaeb2ee25..0dbbc115cecf81 100644 --- a/src/database/engine/rrdengineapi.h +++ b/src/database/engine/rrdengineapi.h @@ -6,7 +6,7 @@ #include "rrdengine.h" #define RRDENG_MIN_PAGE_CACHE_SIZE_MB (8) -#define RRDENG_MIN_DISK_SPACE_MB (256) +#define RRDENG_MIN_DISK_SPACE_MB (25) #define RRDENG_DEFAULT_TIER_DISK_SPACE_MB (1024) #define RRDENG_NR_STATS (38) @@ -78,6 +78,7 @@ void rrdeng_quiesce(struct rrdengine_instance *ctx); bool rrdeng_metric_retention_by_id(STORAGE_INSTANCE *si, UUIDMAP_ID id, time_t *first_entry_s, time_t *last_entry_s); bool rrdeng_metric_retention_by_uuid(STORAGE_INSTANCE *si, nd_uuid_t *dim_uuid, time_t *first_entry_s, time_t *last_entry_s); +void rrdeng_metric_retention_delete_by_id(STORAGE_INSTANCE *si, UUIDMAP_ID id); extern STORAGE_METRICS_GROUP *rrdeng_metrics_group_get(STORAGE_INSTANCE *si, nd_uuid_t *uuid); extern void rrdeng_metrics_group_release(STORAGE_INSTANCE *si, STORAGE_METRICS_GROUP *smg); diff --git a/src/database/ram/rrddim_mem.c b/src/database/ram/rrddim_mem.c index 2c4bc3385abacc..f53b6efea4d5e6 100644 --- a/src/database/ram/rrddim_mem.c +++ b/src/database/ram/rrddim_mem.c @@ -155,6 +155,10 @@ bool rrddim_metric_retention_by_id(STORAGE_INSTANCE *si __maybe_unused, UUIDMAP_ return true; } +void rrddim_retention_delete_by_id(STORAGE_INSTANCE *si __maybe_unused, UUIDMAP_ID id __maybe_unused) { + ; +} + void rrddim_store_metric_change_collection_frequency(STORAGE_COLLECT_HANDLE *sch, int update_every) { struct mem_collect_handle *ch = (struct mem_collect_handle *)sch; struct mem_metric_handle *mh = (struct mem_metric_handle *)ch->smh; diff --git a/src/database/ram/rrddim_mem.h b/src/database/ram/rrddim_mem.h index c8a941e39d47aa..c01db97fb4d65b 100644 --- a/src/database/ram/rrddim_mem.h +++ b/src/database/ram/rrddim_mem.h @@ -30,6 +30,7 @@ void rrddim_metric_release(STORAGE_METRIC_HANDLE *smh); bool rrddim_metric_retention_by_id(STORAGE_INSTANCE *si, UUIDMAP_ID id, time_t *first_entry_s, time_t *last_entry_s); bool rrddim_metric_retention_by_uuid(STORAGE_INSTANCE *si, nd_uuid_t *uuid, time_t *first_entry_s, time_t *last_entry_s); +void rrddim_retention_delete_by_id(STORAGE_INSTANCE *si, UUIDMAP_ID id); STORAGE_METRICS_GROUP *rrddim_metrics_group_get(STORAGE_INSTANCE *si, nd_uuid_t *uuid); void rrddim_metrics_group_release(STORAGE_INSTANCE *si, STORAGE_METRICS_GROUP *smg); diff --git a/src/database/storage-engine.c b/src/database/storage-engine.c index c0501e50fde1dc..e34a32f863c1d6 100644 --- a/src/database/storage-engine.c +++ b/src/database/storage-engine.c @@ -19,6 +19,7 @@ static STORAGE_ENGINE engines[] = { .metric_release = rrddim_metric_release, .metric_retention_by_id = rrddim_metric_retention_by_id, .metric_retention_by_uuid = rrddim_metric_retention_by_uuid, + .metric_retention_delete_by_id = rrddim_retention_delete_by_id, } }, { @@ -33,6 +34,7 @@ static STORAGE_ENGINE engines[] = { .metric_release = rrddim_metric_release, .metric_retention_by_id = rrddim_metric_retention_by_id, .metric_retention_by_uuid = rrddim_metric_retention_by_uuid, + .metric_retention_delete_by_id = rrddim_retention_delete_by_id, } }, { @@ -47,6 +49,7 @@ static STORAGE_ENGINE engines[] = { .metric_release = rrddim_metric_release, .metric_retention_by_id = rrddim_metric_retention_by_id, .metric_retention_by_uuid = rrddim_metric_retention_by_uuid, + .metric_retention_delete_by_id = rrddim_retention_delete_by_id, } }, #ifdef ENABLE_DBENGINE @@ -62,6 +65,7 @@ static STORAGE_ENGINE engines[] = { .metric_release = rrdeng_metric_release, .metric_retention_by_id = rrdeng_metric_retention_by_id, .metric_retention_by_uuid = rrdeng_metric_retention_by_uuid, + .metric_retention_delete_by_id = rrdeng_metric_retention_delete_by_id, } }, #endif diff --git a/src/database/storage-engine.h b/src/database/storage-engine.h index 2d9244fb0b853d..e4113285cd7663 100644 --- a/src/database/storage-engine.h +++ b/src/database/storage-engine.h @@ -68,6 +68,7 @@ typedef struct storage_engine_api { STORAGE_METRIC_HANDLE *(*metric_dup)(STORAGE_METRIC_HANDLE *); bool (*metric_retention_by_id)(STORAGE_INSTANCE *si, UUIDMAP_ID id, time_t *first_entry_s, time_t *last_entry_s); bool (*metric_retention_by_uuid)(STORAGE_INSTANCE *si, nd_uuid_t *uuid, time_t *first_entry_s, time_t *last_entry_s); + void (*metric_retention_delete_by_id)(STORAGE_INSTANCE *si, UUIDMAP_ID id); } STORAGE_ENGINE_API; typedef struct storage { diff --git a/src/libnetdata/inicfg/inicfg.h b/src/libnetdata/inicfg/inicfg.h index be9035e12b3e5d..ec9a71f8504496 100644 --- a/src/libnetdata/inicfg/inicfg.h +++ b/src/libnetdata/inicfg/inicfg.h @@ -208,7 +208,6 @@ msec_t inicfg_set_duration_ms(struct config *root, const char *section, const ch time_t inicfg_get_duration_seconds(struct config *root, const char *section, const char *name, time_t default_value); time_t inicfg_set_duration_seconds(struct config *root, const char *section, const char *name, time_t value); -unsigned inicfg_get_duration_days(struct config *root, const char *section, const char *name, unsigned default_value); -unsigned inicfg_set_duration_days(struct config *root, const char *section, const char *name, unsigned value); +time_t inicfg_get_duration_days_to_seconds(struct config *root, const char *section, const char *name, unsigned default_value_seconds); #endif // LIBNETDATA_INICFG_H diff --git a/src/libnetdata/inicfg/inicfg_api.c b/src/libnetdata/inicfg/inicfg_api.c index bf3cbc2313588d..ae4740d94f2ba0 100644 --- a/src/libnetdata/inicfg/inicfg_api.c +++ b/src/libnetdata/inicfg/inicfg_api.c @@ -166,13 +166,13 @@ msec_t inicfg_set_duration_ms(struct config *root, const char *section, const ch return value; } -static STRING *reformat_duration_days(STRING *value) { +static STRING *reformat_duration_days_to_seconds(STRING *value) { int64_t result = 0; - if(!duration_parse_days(string2str(value), &result)) + if(!duration_parse(string2str(value), &result, "d", "s")) return value; char buf[128]; - if(duration_snprintf_days(buf, sizeof(buf), result) > 0 && string_strcmp(value, buf) != 0) { + if(duration_snprintf(buf, sizeof(buf), result, "s", false) > 0 && string_strcmp(value, buf) != 0) { string_freez(value); return string_strdupz(buf); } @@ -180,34 +180,30 @@ static STRING *reformat_duration_days(STRING *value) { return value; } -unsigned inicfg_get_duration_days(struct config *root, const char *section, const char *name, unsigned default_value) { +time_t inicfg_get_duration_days_to_seconds(struct config *root, const char *section, const char *name, unsigned default_value_seconds) { char default_str[128]; - duration_snprintf_days(default_str, sizeof(default_str), (int)default_value); + duration_snprintf(default_str, sizeof(default_str), (int)default_value_seconds, "s", false); struct config_option *opt = inicfg_get_raw_value( - root, section, name, default_str, CONFIG_VALUE_TYPE_DURATION_IN_DAYS, reformat_duration_days); + root, section, name, default_str, + CONFIG_VALUE_TYPE_DURATION_IN_DAYS_TO_SECONDS, + reformat_duration_days_to_seconds); + if(!opt) - return default_value; + return default_value_seconds; const char *s = string2str(opt->value); int64_t result = 0; - if(!duration_parse_days(s, &result)) { - inicfg_set_raw_value(root, section, name, default_str, CONFIG_VALUE_TYPE_DURATION_IN_DAYS); + if(!duration_parse(s, &result, "d", "s")) { + inicfg_set_raw_value(root, section, name, default_str, CONFIG_VALUE_TYPE_DURATION_IN_DAYS_TO_SECONDS); netdata_log_error("config option '[%s].%s = %s' is configured with an invalid duration", section, name, s); - return default_value; + return default_value_seconds; } return (unsigned)ABS(result); } -unsigned inicfg_set_duration_days(struct config *root, const char *section, const char *name, unsigned value) { - char str[128]; - duration_snprintf_days(str, sizeof(str), value); - inicfg_set_raw_value(root, section, name, str, CONFIG_VALUE_TYPE_DURATION_IN_DAYS); - return value; -} - long long inicfg_get_number(struct config *root, const char *section, const char *name, long long value) { char buffer[100]; sprintf(buffer, "%lld", value); diff --git a/src/libnetdata/inicfg/inicfg_conf_file.c b/src/libnetdata/inicfg/inicfg_conf_file.c index 3a552071213dd7..c56371f8f21734 100644 --- a/src/libnetdata/inicfg/inicfg_conf_file.c +++ b/src/libnetdata/inicfg/inicfg_conf_file.c @@ -19,7 +19,7 @@ ENUM_STR_MAP_DEFINE(CONFIG_VALUE_TYPES) = { { .id = CONFIG_VALUE_TYPE_BOOLEAN_ONDEMAND, .name ="yes, no, or auto", }, { .id = CONFIG_VALUE_TYPE_DURATION_IN_SECS, .name ="duration (seconds)", }, { .id = CONFIG_VALUE_TYPE_DURATION_IN_MS, .name ="duration (ms)", }, - { .id = CONFIG_VALUE_TYPE_DURATION_IN_DAYS, .name ="duration (days)", }, + { .id = CONFIG_VALUE_TYPE_DURATION_IN_DAYS_TO_SECONDS, .name ="duration (days)", }, { .id = CONFIG_VALUE_TYPE_SIZE_IN_BYTES, .name ="size (bytes)", }, { .id = CONFIG_VALUE_TYPE_SIZE_IN_MB, .name ="size (MiB)", }, }; diff --git a/src/libnetdata/inicfg/inicfg_internals.h b/src/libnetdata/inicfg/inicfg_internals.h index 9f3a4b397bbff2..fd08a7aa8c7370 100644 --- a/src/libnetdata/inicfg/inicfg_internals.h +++ b/src/libnetdata/inicfg/inicfg_internals.h @@ -22,7 +22,7 @@ typedef enum __attribute__((packed)) { CONFIG_VALUE_TYPE_BOOLEAN_ONDEMAND, CONFIG_VALUE_TYPE_DURATION_IN_SECS, CONFIG_VALUE_TYPE_DURATION_IN_MS, - CONFIG_VALUE_TYPE_DURATION_IN_DAYS, + CONFIG_VALUE_TYPE_DURATION_IN_DAYS_TO_SECONDS, CONFIG_VALUE_TYPE_SIZE_IN_BYTES, CONFIG_VALUE_TYPE_SIZE_IN_MB, } CONFIG_VALUE_TYPES; diff --git a/src/libnetdata/parsers/duration.c b/src/libnetdata/parsers/duration.c index 16dc5170c37e83..9da36f28bb0f06 100644 --- a/src/libnetdata/parsers/duration.c +++ b/src/libnetdata/parsers/duration.c @@ -82,7 +82,7 @@ inline int64_t duration_round_to_resolution(int64_t value, int64_t resolution) { // ------------------------------------------------------------------------------------------------------------------- // parse a duration string -bool duration_parse(const char *duration, int64_t *result, const char *default_unit) { +bool duration_parse(const char *duration, int64_t *result, const char *default_unit, const char *output_unit) { if (!duration || !*duration) { *result = 0; return false; @@ -94,6 +94,12 @@ bool duration_parse(const char *duration, int64_t *result, const char *default_u return false; } + const struct duration_unit *du_out = duration_find_unit(output_unit); + if(!du_out) { + *result = 0; + return false; + } + int64_t sign = 1; const char *s = duration; while (isspace((uint8_t)*s)) s++; @@ -155,10 +161,16 @@ bool duration_parse(const char *duration, int64_t *result, const char *default_u v *= sign; - if(du_def->multiplier == 1) + // Convert the final value from nanoseconds to the desired output unit + // and apply appropriate rounding + if(du_out->multiplier == 1) *result = v; - else - *result = duration_round_to_resolution(v, du_def->multiplier); + else { + // First convert to the output unit + NETDATA_DOUBLE converted = (NETDATA_DOUBLE)v / (NETDATA_DOUBLE)du_out->multiplier; + // Then round to nearest integer in the output unit + *result = (int64_t)round(converted); + } return true; } diff --git a/src/libnetdata/parsers/duration.h b/src/libnetdata/parsers/duration.h index b95da5d2f5b03b..140bf54090c210 100644 --- a/src/libnetdata/parsers/duration.h +++ b/src/libnetdata/parsers/duration.h @@ -8,14 +8,14 @@ int64_t duration_round_to_resolution(int64_t value, int64_t resolution); // duration (string to number) -bool duration_parse(const char *duration, int64_t *result, const char *default_unit); -#define duration_parse_nsec_t(duration, ns_ptr) duration_parse(duration, ns_ptr, "ns") -#define duration_parse_usec_t(duration, us_ptr) duration_parse(duration, us_ptr, "us") -#define duration_parse_msec_t(duration, ms_ptr) duration_parse(duration, ms_ptr, "ms") -#define duration_parse_time_t(duration, secs_ptr) duration_parse(duration, secs_ptr, "s") -#define duration_parse_mins(duration, mins_ptr) duration_parse(duration, mins_ptr, "m") -#define duration_parse_hours(duration, hours_ptr) duration_parse(duration, hours_ptr, "h") -#define duration_parse_days(duration, days_ptr) duration_parse(duration, days_ptr, "d") +bool duration_parse(const char *duration, int64_t *result, const char *default_unit, const char *output_unit); +#define duration_parse_nsec_t(duration, ns_ptr) duration_parse(duration, ns_ptr, "ns", "ns") +#define duration_parse_usec_t(duration, us_ptr) duration_parse(duration, us_ptr, "us", "us") +#define duration_parse_msec_t(duration, ms_ptr) duration_parse(duration, ms_ptr, "ms", "ms") +#define duration_parse_time_t(duration, secs_ptr) duration_parse(duration, secs_ptr, "s", "s") +#define duration_parse_mins(duration, mins_ptr) duration_parse(duration, mins_ptr, "m", "m") +#define duration_parse_hours(duration, hours_ptr) duration_parse(duration, hours_ptr, "h", "h") +#define duration_parse_days(duration, days_ptr) duration_parse(duration, days_ptr, "d", "d") // duration (number to string) ssize_t duration_snprintf(char *dst, size_t dst_size, int64_t value, const char *unit, bool add_spaces); diff --git a/src/libnetdata/uuid/uuid.h b/src/libnetdata/uuid/uuid.h index c0cee1c0d9305f..c21b719f96d1c6 100644 --- a/src/libnetdata/uuid/uuid.h +++ b/src/libnetdata/uuid/uuid.h @@ -36,6 +36,7 @@ ND_UUID_DEFINE(sensors_state_transition_msgid, 0x8d, 0xda, 0xf5, 0xba, 0x33, 0xa ND_UUID_DEFINE(log_flood_protection_msgid, 0xec, 0x87, 0xa5, 0x61, 0x20, 0xd5, 0x43, 0x1b, 0xac, 0xe5, 0x1e, 0x2f, 0xb8, 0xbb, 0xa2, 0x43); ND_UUID_DEFINE(netdata_startup_msgid, 0x1e, 0x60, 0x61, 0xa9, 0xfb, 0xd4, 0x45, 0x01, 0xb3, 0xcc, 0xc3, 0x68, 0x11, 0x9f, 0x2b, 0x69); ND_UUID_DEFINE(aclk_connection_msgid, 0xac, 0xb3, 0x3c, 0xb9, 0x57, 0x78, 0x47, 0x6b, 0xaa, 0xc7, 0x02, 0xeb, 0x7e, 0x4e, 0x15, 0x1d); +ND_UUID_DEFINE(extreme_cardinality_msgid, 0xd1, 0xf5, 0x96, 0x06, 0xdd, 0x4d, 0x41, 0xe3, 0xb2, 0x17, 0xa0, 0xcf, 0xca, 0xe8, 0xe6, 0x32); ND_UUID UUID_generate_from_hash(const void *payload, size_t payload_len); diff --git a/src/registry/registry_init.c b/src/registry/registry_init.c index 20e1a90c8bd00a..1df64dbf9c0a7c 100644 --- a/src/registry/registry_init.c +++ b/src/registry/registry_init.c @@ -93,7 +93,8 @@ int registry_init(void) { // configuration options registry.save_registry_every_entries = (unsigned long long)inicfg_get_number(&netdata_config, CONFIG_SECTION_REGISTRY, "registry save db every new entries", 1000000); - registry.persons_expiration = inicfg_get_duration_days(&netdata_config, CONFIG_SECTION_REGISTRY, "registry expire idle persons", 365) * 86400; + registry.persons_expiration = inicfg_get_duration_days_to_seconds( + &netdata_config, CONFIG_SECTION_REGISTRY, "registry expire idle persons", 365 * 86400); registry.registry_domain = inicfg_get(&netdata_config, CONFIG_SECTION_REGISTRY, "registry domain", ""); registry.registry_to_announce = inicfg_get(&netdata_config, CONFIG_SECTION_REGISTRY, "registry to announce", "https://registry.my-netdata.io"); registry.hostname = inicfg_get(&netdata_config, CONFIG_SECTION_REGISTRY, "registry hostname", netdata_configured_hostname);