protection against extreme cardinality (netdata#19486)

* protection against extreme cardinality * do not cleanup everything, but only what is required to reach 50% * check for collection flags and settings to prevent deletion * 1000 to 10000 * 1000 old instances * log extreme cardinality cleanup * add configuration for extreme cardinality * add documentation about extreme cardinality protection * allow retention to be configured in durations less than a day * min db size 32MiB * extreme cardinality working * fixed cleanup * check for extreme cardinality on rotation, not on startup * added option to enable/disable extreme cardinality protection * enable it by default * make unknown UUID errors and debug log * extreme cardinality logs more information about the current status * enable protection conditions * never delete more than requested
webfutureiorepo · Feb 6, 2025 · 8fa7b2f · 8fa7b2f
1 parent 95ffceb
commit 8fa7b2f
Show file tree

Hide file tree

Showing 27 changed files with 412 additions and 89 deletions.
diff --git a/docs/extreme-cardinality-protection.md b/docs/extreme-cardinality-protection.md
@@ -0,0 +1,114 @@
+# Extreme Cardinality Protection in Netdata
+
+Netdata’s tiered storage is designed to efficiently retain metric data and metadata for long periods. However, when extreme cardinality occurs—often unintentionally through misconfigurations or inadvertent practices (e.g., spawning many short-lived docker containers or using unbounded label values)—the long-term retention of metadata can lead to excessive resource consumption.
+
+To protect Netdata from extreme cardinality, Netdata has an automated protection. This document explains **why** this protection is needed, **how** it works, **how** to configure it, and **how** to verify its operation.
+
+## Why Extreme Cardinality Protection is Needed
+
+Extreme cardinality refers to the explosion in the number of unique time series generated when metrics are combined with a wide range of labels or dimensions. In modern observability platforms like Netdata, metrics aren’t just simple numeric values—they come with metadata (labels, tags, dimensions) that help contextualize the data. When these labels are overly dynamic or unbounded (for example, when using unique identifiers such as session IDs, user IDs, or ephemeral container names), combined with a very long retention, like the one provided by Netdata, the system ends up tracking an enormous number of unique series.
+
+Despite the fact that Netdata performs better than most other observability solution, extreme cardinality has a few implications:
+
+-   **Resource Consumption:** The system needs to remember and index vast amounts of metadata, increasing its memory footprint.
+-   **Performance Degradation:** When performing long-term queries (days, weeks, months), the system needs to query a vast number of time-series leading to slower query responses.
+-   **Operational Complexity:** High cardinality makes it harder to manage, visualize, and analyze data. Dashboards can become cluttered.
+-   **Scalability Challenges:** As the number of time series grows, the resources required for maintaining aggregation points (Netdata parents) increase too.
+
+## Definition of Metrics Ephemerality
+
+Metrics ephemerality is the percentage of metrics that is no longer actively collected (old) compared to the total metrics available (sum of currently collected metrics and old metrics).
+
+- **Old Metrics** = The number of unique time-series that were once collected, but not currently.
+- **Current Metrics** = The number of unique time-series actively being collected.
+
+High Ephemerality (close to 100%): The system frequently generates new unique metrics for a short period, indicating a high turnover in metrics.
+Low Ephemerality (close to 0%): The system maintains a stable set of metrics over time, with little change in the total number of unique series.
+
+## How The Netdata Protection Works
+
+The mechanism kicks in during tier0 (high-resolution) database rotations (i.e., when the oldest tier0 samples are deleted) and proceeds as follows:
+
+1. **Counting Instances with Zero Tier0 Retention:**
+    - For each context (e.g., containers, disks, network interfaces, etc), Netdata counts the number of instances that have **ZERO** retention in tier0.
+
+2. **Threshold Verification:**
+    - If the number of instances with zero tier0 retention is **greater than or equal to 1000** (the default threshold) **and** these instances make up more than **50%** (the default Ephemerality threshold) of the total instances in that context, further action is taken.
+
+3. **Forceful Clearing in Long-Term Storage:**
+    - The system forcefully clears the retention of the excess time-series. This action automatically triggers the deletion of the associated metadata. So, Netdata "forgets" them. Their samples are still on disk, but they are no longer accessible.
+
+4. **Retention Rules:**
+    - **Protected Data:**
+        - Metrics that are actively collected (and thus present in tier0) are never deleted.
+        - A context with fewer than 1000 instances (as presented in the Netdata dashboards at the NIDL bar of the charts) is considered safe and is not modified.
+    - **Clean-up Trigger:**
+        - Only metrics that have lost their tier0 retention in a context that meets the thresholds (≥1000 instances and >50% ephemerality) will have their long-term retention cleared.
+
+## Configuration
+
+You can control the protection mechanism via the following settings in the `netdata.conf` file under the `[db]` section:
+
+```ini
+[db]
+    extreme cardinality protection = yes
+    extreme cardinality keep instances = 1000
+    extreme cardinality min ephemerality = 50
+```
+
+-   **extreme cardinality keep instances:**  
+    The minimum number of instances per context that should be kept. The default value is **1000**.
+
+-   **extreme cardinality min ephemerality:**  
+    The minimum percentage (in percent) of instances in a context that have zero tier0 retention to trigger the cleanup. The default value is **50%**.
+
+
+**Recommendations:**
+
+-   If you have samples in tier0, you also have their corresponding long-term data and metadata. Ensure that tier0 retention is configured properly.
+-   If you expect to have more than 1000 instances per context per node (for example, more than 1000 containers, disks, network interfaces, database tables, etc.), adjust these settings to suit your specific environment.
+
+## How to Check Its Work
+
+When the protection mechanism is activated, Netdata logs a detailed message. The log entry includes:
+
+-   The host name.
+-   The context affected.
+-   The number of metrics and instances that had their retention forcefully cleared.
+-   The time range for which the non-tier0 retention was deleted.
+
+### Example Log Message
+
+```
+EXTREME CARDINALITY PROTECTION: on host '<HOST>', for context '<CONTEXT>': forcefully cleared the retention of <METRICS_COUNT> metrics and <INSTANCES_COUNT> instances, having non-tier0 retention from <START_TIME> to <END_TIME>.
+```
+
+This log message is tagged with the following message ID for easy identification:
+
+```
+MESSAGE_ID=d1f59606dd4d41e3b217a0cfcae8e632
+```
+
+### Verification Steps
+
+1.  **Using System Logs:**
+
+    You can use `journalctl` (or your system’s log viewer) to search for the message ID:
+
+```
+journalctl --namespace=netdata MESSAGE_ID=d1f59606dd4d41e3b217a0cfcae8e632
+``` 
+
+2.  **Netdata Logs Dashboard:**
+
+    Navigate to the Netdata Logs dashboard. On the right side under `MESSAGE_ID`, select **"Netdata extreme cardinality"** to filter only those messages.
+
+## Summary
+
+The extreme cardinality protection mechanism in Netdata is designed to automatically safeguard your system against the potential issues caused by excessive metric metadata retention. It does so by:
+
+-   Automatically counting instances without tier0 retention.
+-   Checking against configurable thresholds.
+-   Forcefully clearing long-term retention (and metadata) when thresholds are exceeded.
+
+By properly configuring tier0 and adjusting the `extreme cardinality` settings in `netdata.conf`, you can ensure that your system remains both efficient and protected, even when extreme cardinality issues occur.
diff --git a/src/collectors/systemd-journal.plugin/systemd-journal-annotations.c b/src/collectors/systemd-journal.plugin/systemd-journal-annotations.c
@@ -618,6 +618,7 @@ static void netdata_systemd_journal_message_ids_init(void) {
     msgid_into_dict("8ddaf5ba33a74078b609250db1e951f3", "Sensor state transition");
     msgid_into_dict("ec87a56120d5431bace51e2fb8bba243", "Netdata log flood protection");
     msgid_into_dict("acb33cb95778476baac702eb7e4e151d", "Netdata Cloud connection");
+    msgid_into_dict("d1f59606dd4d41e3b217a0cfcae8e632", "Netdata extreme cardinality");
 }
 
 void netdata_systemd_journal_transform_message_id(FACETS *facets __maybe_unused, BUFFER *wb, FACETS_TRANSFORMATION_SCOPE scope __maybe_unused, void *data __maybe_unused) {

diff --git a/src/daemon/config/netdata-conf-db.c b/src/daemon/config/netdata-conf-db.c
@@ -3,12 +3,13 @@
 #include "netdata-conf-db.h"
 #include "daemon/common.h"
 
+#define DAYS 86400
 int default_rrd_history_entries = RRD_DEFAULT_HISTORY_ENTRIES;
 
 bool dbengine_enabled = false; // will become true if and when dbengine is initialized
 bool dbengine_use_direct_io = true;
 static size_t storage_tiers_grouping_iterations[RRD_STORAGE_TIERS] = {1, 60, 60, 60, 60};
-static double storage_tiers_retention_days[RRD_STORAGE_TIERS] = {14, 90, 2 * 365, 2 * 365, 2 * 365};
+static time_t storage_tiers_retention_time_s[RRD_STORAGE_TIERS] = {14 * DAYS, 90 * DAYS, 2 * 365 * DAYS, 2 * 365 * DAYS, 2 * 365 * DAYS};
 
 time_t rrdset_free_obsolete_time_s = 3600;
 time_t rrdhost_free_orphan_time_s = 3600;
@@ -275,12 +276,13 @@ void netdata_conf_dbengine_init(const char *hostname) {
         disk_space_mb = inicfg_get_size_mb(&netdata_config, CONFIG_SECTION_DB, dbengineconfig, disk_space_mb);
 
         snprintfz(dbengineconfig, sizeof(dbengineconfig) - 1, "dbengine tier %zu retention time", tier);
-        storage_tiers_retention_days[tier] = inicfg_get_duration_days(&netdata_config, 
-            CONFIG_SECTION_DB, dbengineconfig, new_dbengine_defaults ? storage_tiers_retention_days[tier] : 0);
+        storage_tiers_retention_time_s[tier] = inicfg_get_duration_days_to_seconds(
+            &netdata_config, CONFIG_SECTION_DB,
+            dbengineconfig, new_dbengine_defaults ? storage_tiers_retention_time_s[tier] : 0);
 
         tiers_init[tier].disk_space_mb = (int) disk_space_mb;
         tiers_init[tier].tier = tier;
-        tiers_init[tier].retention_seconds = (size_t) (86400.0 * storage_tiers_retention_days[tier]);
+        tiers_init[tier].retention_seconds = (size_t) storage_tiers_retention_time_s[tier];
         strncpyz(tiers_init[tier].path, dbenginepath, FILENAME_MAX);
         tiers_init[tier].ret = 0;
 

diff --git a/src/database/contexts/api_v2_contexts_agents.c b/src/database/contexts/api_v2_contexts_agents.c
@@ -5,6 +5,17 @@
 
 void build_info_to_json_object(BUFFER *b);
 
+static time_t round_retention(time_t retention_seconds) {
+    if(retention_seconds > 60 * 86400)
+        retention_seconds = HOWMANY(retention_seconds, 86400) * 86400;
+    else if(retention_seconds > 86400)
+        retention_seconds = HOWMANY(retention_seconds, 3600) * 3600;
+    else
+        retention_seconds = HOWMANY(retention_seconds, 60) * 60;
+
+    return retention_seconds;
+}
+
 void buffer_json_agents_v2(BUFFER *wb, struct query_timings *timings, time_t now_s, bool info, bool array) {
     if(!now_s)
         now_s = now_realtime_sec();
@@ -128,9 +139,9 @@ void buffer_json_agents_v2(BUFFER *wb, struct query_timings *timings, time_t now
 
             buffer_json_add_array_item_object(wb);
             buffer_json_member_add_uint64(wb, "tier", tier);
-            char human_retention[128];
-            duration_snprintf_time_t(human_retention, sizeof(human_retention), (stime_t)group_seconds);
-            buffer_json_member_add_string(wb, "granularity", human_retention);
+            char human_duration[128];
+            duration_snprintf_time_t(human_duration, sizeof(human_duration), (stime_t)group_seconds);
+            buffer_json_member_add_string(wb, "granularity", human_duration);
 
             buffer_json_member_add_uint64(wb, "metrics", storage_engine_metrics(eng->seb, localhost->db[tier].si));
             buffer_json_member_add_uint64(wb, "samples", storage_engine_samples(eng->seb, localhost->db[tier].si));
@@ -148,18 +159,10 @@ void buffer_json_agents_v2(BUFFER *wb, struct query_timings *timings, time_t now
                 buffer_json_member_add_time_t(wb, "to", now_s);
                 buffer_json_member_add_time_t(wb, "retention", retention);
 
-                if(retention < 60)
-                    duration_snprintf_time_t(human_retention, sizeof(human_retention), retention);
-                else if(retention < 24 * 60 * 60) {
-                    int64_t rounded_retention_mins = duration_round_to_resolution(retention, 60);
-                    duration_snprintf_mins(human_retention, sizeof(human_retention), rounded_retention_mins);
-                }
-                else {
-                    int64_t rounded_retention_hours = duration_round_to_resolution(retention, 3600);
-                    duration_snprintf_hours(human_retention, sizeof(human_retention), rounded_retention_hours);
-                }
+                duration_snprintf(human_duration, sizeof(human_duration),
+                                  round_retention(retention), "s", false);
 
-                buffer_json_member_add_string(wb, "retention_human", human_retention);
+                buffer_json_member_add_string(wb, "retention_human", human_duration);
 
                 if(used || max) { // we have disk space information
                     time_t time_retention = 0;
@@ -169,17 +172,18 @@ void buffer_json_agents_v2(BUFFER *wb, struct query_timings *timings, time_t now
                     time_t space_retention = (time_t)((NETDATA_DOUBLE)(now_s - first_time_s) * 100.0 / percent);
                     time_t actual_retention = MIN(space_retention, time_retention ? time_retention : space_retention);
 
-                    duration_snprintf_hours(human_retention, sizeof(human_retention),
-                                            (int)duration_round_to_resolution(time_retention, 3600));
+                    duration_snprintf(
+                        human_duration, sizeof(human_duration),
+                                            (int)time_retention, "s", false);
 
                     buffer_json_member_add_time_t(wb, "requested_retention", time_retention);
-                    buffer_json_member_add_string(wb, "requested_retention_human", human_retention);
+                    buffer_json_member_add_string(wb, "requested_retention_human", human_duration);
 
-                    duration_snprintf_hours(human_retention, sizeof(human_retention),
-                                            (int)duration_round_to_resolution(actual_retention, 3600));
+                    duration_snprintf(human_duration, sizeof(human_duration),
+                                      (int)round_retention(actual_retention), "s", false);
 
                     buffer_json_member_add_time_t(wb, "expected_retention", actual_retention);
-                    buffer_json_member_add_string(wb, "expected_retention_human", human_retention);
+                    buffer_json_member_add_string(wb, "expected_retention_human", human_duration);
                 }
             }
             buffer_json_object_close(wb);

diff --git a/src/database/contexts/contexts-loading.c b/src/database/contexts/contexts-loading.c
@@ -18,7 +18,7 @@ static void rrdinstance_load_dimension_callback(SQL_DIMENSION_DATA *sd, void *da
 
     UUIDMAP_ID id = uuidmap_create(sd->dim_id);
     time_t min_first_time_t = LONG_MAX, max_last_time_t = 0;
-    get_metric_retention_by_id(host, id, &min_first_time_t, &max_last_time_t);
+    get_metric_retention_by_id(host, id, &min_first_time_t, &max_last_time_t, NULL);
     if((!min_first_time_t || min_first_time_t == LONG_MAX) && !max_last_time_t) {
         uuidmap_free(id);
         th_zero_retention_metrics++;

diff --git a/src/database/contexts/internal.h b/src/database/contexts/internal.h
@@ -60,6 +60,7 @@ typedef enum __attribute__ ((__packed__)) {
     RRD_FLAG_UPDATE_REASON_UNUSED                  = (1 << 22), // this context is not used anymore
     RRD_FLAG_UPDATE_REASON_DB_ROTATION             = (1 << 23), // this context changed because of a db rotation
 
+    RRD_FLAG_NO_TIER0_RETENTION                    = (1 << 28),
     RRD_FLAG_MERGED_COLLECTED_RI_TO_RC             = (1 << 29),
 
     // action to perform on an object
@@ -475,7 +476,7 @@ void rrdcontext_update_from_collected_rrdinstance(RRDINSTANCE *ri);
 
 void rrdcontext_garbage_collect_single_host(RRDHOST *host, bool worker_jobs);
 
-void get_metric_retention_by_id(RRDHOST *host, UUIDMAP_ID id, time_t *min_first_time_t, time_t *max_last_time_t);
+void get_metric_retention_by_id(RRDHOST *host, UUIDMAP_ID id, time_t *min_first_time_t, time_t *max_last_time_t, bool *tier0_retention);
 
 void rrdcontext_delete_after_loading(RRDHOST *host, RRDCONTEXT *rc);
 void rrdcontext_initial_processing_after_loading(RRDCONTEXT *rc);

diff --git a/src/database/contexts/metric.c b/src/database/contexts/metric.c
@@ -111,6 +111,7 @@ static void rrdmetric_delete_callback(const DICTIONARY_ITEM *item __maybe_unused
 static bool rrdmetric_conflict_callback(const DICTIONARY_ITEM *item __maybe_unused, void *old_value, void *new_value, void *rrdinstance __maybe_unused) {
     RRDMETRIC *rm     = old_value;
     RRDMETRIC *rm_new = new_value;
+    rm_new->ri = rm->ri;
 
     internal_error(rm->id != rm_new->id,
                    "RRDMETRIC: '%s' cannot change id to '%s'",
@@ -131,9 +132,9 @@ static bool rrdmetric_conflict_callback(const DICTIONARY_ITEM *item __maybe_unus
 
         time_t new_first_time_s = 0;
         time_t new_last_time_s = 0;
-        if(rrdmetric_update_retention(rm)) {
-            new_first_time_s = rm->first_time_s;
-            new_last_time_s = rm->last_time_s;
+        if(rrdmetric_update_retention(rm_new)) {
+            new_first_time_s = rm_new->first_time_s;
+            new_last_time_s = rm_new->last_time_s;
         }
 
         internal_error(true,