Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global error handler cleanup - Metrics SDK #2185

Closed
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
704b848
initial commit
lalitb Oct 8, 2024
b8cb6af
change name for exponential histogram
lalitb Oct 8, 2024
a0b6eee
Merge branch 'main' into global-error-handler-cleanup-metrics-sdk
lalitb Oct 9, 2024
acf97fa
rever changes
lalitb Oct 9, 2024
7b48f14
Update opentelemetry-sdk/src/metrics/internal/mod.rs
lalitb Oct 9, 2024
ac61b79
convert otel_error to otel_warn for cardinality limit
lalitb Oct 9, 2024
a42d516
log both existing and new instrument values
lalitb Oct 9, 2024
dbaa7f5
Merge branch 'main' into global-error-handler-cleanup-metrics-sdk
lalitb Oct 9, 2024
73fca4d
build error
lalitb Oct 9, 2024
ee5c5f5
Merge branch 'global-error-handler-cleanup-metrics-sdk' of github.com…
lalitb Oct 9, 2024
3aa97cf
remove formatting in wrapper macros
lalitb Oct 10, 2024
de54afe
Merge branch 'main' into global-error-handler-cleanup-metrics-sdk
lalitb Oct 23, 2024
e62de04
Merge branch 'main' into global-error-handler-cleanup-metrics-sdk
lalitb Oct 23, 2024
bda9faa
review comments
lalitb Oct 23, 2024
4a34922
resolve conflict
lalitb Oct 23, 2024
0087c24
lint error
lalitb Oct 23, 2024
115e73f
add comment for exponential histogram
lalitb Oct 23, 2024
56682a4
fix
lalitb Oct 23, 2024
7e48cbf
Update opentelemetry-sdk/src/metrics/internal/mod.rs
lalitb Oct 23, 2024
d81b374
use message for otel_warn
lalitb Oct 23, 2024
4b09c92
merge conflict
lalitb Oct 23, 2024
949930e
Update opentelemetry-sdk/src/metrics/meter.rs
lalitb Oct 23, 2024
6dcc9eb
Update opentelemetry-sdk/src/metrics/internal/mod.rs
lalitb Oct 23, 2024
4b42fe8
Update opentelemetry-sdk/src/metrics/meter.rs
lalitb Oct 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions opentelemetry-sdk/src/metrics/internal/aggregate.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@
size < STREAM_CARDINALITY_LIMIT as usize
}

/// Checks whether aggregator has hit cardinality limit for metric streams
pub(crate) fn cardinality_limit() -> usize {
STREAM_CARDINALITY_LIMIT as usize
}

Check warning on line 22 in opentelemetry-sdk/src/metrics/internal/aggregate.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/internal/aggregate.rs#L20-L22

Added lines #L20 - L22 were not covered by tests

/// Receives measurements to be aggregated.
pub(crate) trait Measure<T>: Send + Sync + 'static {
fn call(&self, measurement: T, attrs: &[KeyValue]);
Expand Down
14 changes: 10 additions & 4 deletions opentelemetry-sdk/src/metrics/internal/exponential_histogram.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
use std::{collections::HashMap, f64::consts::LOG2_E, sync::Mutex, time::SystemTime};

use once_cell::sync::Lazy;
use opentelemetry::{metrics::MetricsError, KeyValue};
use opentelemetry::{otel_error, KeyValue};

use crate::{
metrics::data::{self, Aggregation, Temporality},
Expand Down Expand Up @@ -100,9 +100,15 @@
if (self.scale - scale_delta as i8) < EXPO_MIN_SCALE {
// With a scale of -10 there is only two buckets for the whole range of f64 values.
// This can only happen if there is a max size of 1.
opentelemetry::global::handle_error(MetricsError::Other(
"exponential histogram scale underflow".into(),
));
otel_error!(

Check warning on line 103 in opentelemetry-sdk/src/metrics/internal/exponential_histogram.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/internal/exponential_histogram.rs#L103

Added line #L103 was not covered by tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the inner workings enough to give a strong opinion - but unless this is a auto recoverable error, this can flood the error log.

Copy link
Member Author

@lalitb lalitb Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As what I can understand this part of code, this error occurs with restrictive max_size configuration, while the application is recording measurements with values that are far apart than what allowed by max_size. And error would be logged whenever the faulty measurement is recorded. If these faulty measurements are not frequent, the error log won't be flooded, else it can. Again, either some kind of throttling or simply flag to log only once need to be added. Let me know what you suggest, else I can keep TODO to revisit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we are 100% sure this cannot cause flooding of logs, lets remove the log from here, and leave a TODO to add logging once we understand more.

name: "ExponentialHistogramDataPoint.Scale.Underflow",
current_scale = self.scale,
scale_delta = scale_delta,
max_size = self.max_size,
min_scale = EXPO_MIN_SCALE,
record_min_max = self.record_min_max,
record_sum = self.record_sum

Check warning on line 110 in opentelemetry-sdk/src/metrics/internal/exponential_histogram.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/internal/exponential_histogram.rs#L105-L110

Added lines #L105 - L110 were not covered by tests
);
return;
}
// Downscale
Expand Down
11 changes: 6 additions & 5 deletions opentelemetry-sdk/src/metrics/internal/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@
use std::sync::atomic::{AtomicBool, AtomicI64, AtomicU64, AtomicUsize, Ordering};
use std::sync::{Arc, RwLock};

use aggregate::is_under_cardinality_limit;
use aggregate::{cardinality_limit, is_under_cardinality_limit};
pub(crate) use aggregate::{AggregateBuilder, ComputeAggregation, Measure};
pub(crate) use exponential_histogram::{EXPO_MAX_SCALE, EXPO_MIN_SCALE};
use once_cell::sync::Lazy;
use opentelemetry::metrics::MetricsError;
use opentelemetry::{global, otel_warn, KeyValue};
use opentelemetry::{otel_warn, KeyValue};

use crate::metrics::AttributeSet;

Expand Down Expand Up @@ -146,9 +146,10 @@
let new_tracker = AU::new_atomic_tracker(self.buckets_count);
O::update_tracker(&new_tracker, measurement, index);
trackers.insert(STREAM_OVERFLOW_ATTRIBUTES.clone(), Arc::new(new_tracker));
global::handle_error(MetricsError::Other("Warning: Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric until next collect will not be logged.".into()));
otel_warn!( name: "ValueMap.measure",
message = "Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric until next collect will not be logged."
//TODO - include name of meter, instrument
otel_warn!( name: "MetricCardinalityLimitReached",
error = format!("{}", MetricsError::Other("Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric until next collect will not be logged.".into())),
cardinality_limit = cardinality_limit() as u64,

Check warning on line 152 in opentelemetry-sdk/src/metrics/internal/mod.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/internal/mod.rs#L151-L152

Added lines #L151 - L152 were not covered by tests
);
}
}
Expand Down
6 changes: 2 additions & 4 deletions opentelemetry-sdk/src/metrics/manual_reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
};

use opentelemetry::{
global,
metrics::{MetricsError, Result},
otel_error,
};

use super::{
Expand Down Expand Up @@ -84,9 +84,7 @@
if inner.sdk_producer.is_none() {
inner.sdk_producer = Some(pipeline);
} else {
global::handle_error(MetricsError::Config(
"duplicate reader registration, did not register manual reader".into(),
))
otel_error!(name: "ManualReader.RegisterPipeline.DuplicateRegistration");

Check warning on line 87 in opentelemetry-sdk/src/metrics/manual_reader.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/manual_reader.rs#L87

Added line #L87 was not covered by tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info/debug only. Even if a user gets this message, they won't know what to do. Its helpful to us only.

}
});
}
Expand Down
27 changes: 15 additions & 12 deletions opentelemetry-sdk/src/metrics/meter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
use std::{borrow::Cow, sync::Arc};

use opentelemetry::{
global,
metrics::{
noop::{NoopAsyncInstrument, NoopSyncInstrument},
AsyncInstrumentBuilder, Counter, Gauge, Histogram, HistogramBuilder, InstrumentBuilder,
InstrumentProvider, MetricsError, ObservableCounter, ObservableGauge,
ObservableUpDownCounter, Result, UpDownCounter,
},
otel_error,
};

use crate::instrumentation::Scope;
Expand Down Expand Up @@ -74,7 +74,7 @@
{
let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit);
if let Err(err) = validation_result {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateCounter.ValidationError", error = format!("{}", err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree Error is the right severity for this, but this is still not so user-friendly.
Since this is user facing, the internal log should be clearly actionable, along with the impact of this error.
In this case, the impact is - measurements reported using this instrument will be ignore.
The "why" is also not so obvious.
As an end-user, I'd prefer to see something like

Name: InstrumentCreationFailed
InstrumentName: A_non_asccii_Metric
MeterName: MeterName
Reason: Instrument Name contains non-ascii charactors. Link to spec, optionally.
Message: Measurements from this instruments will be ignored.

return Ok(Counter::new(Arc::new(NoopSyncInstrument::new())));
}

Expand All @@ -90,7 +90,7 @@
{
Ok(counter) => Ok(counter),
Err(err) => {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateCounter.Error", error = format!("{}", err));

Check warning on line 93 in opentelemetry-sdk/src/metrics/meter.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/meter.rs#L93

Added line #L93 was not covered by tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as https://github.com/open-telemetry/opentelemetry-rust/pull/2185/files#r1796240076
I dont think a user would know the difference between ValidationError vs Error.

Ok(Counter::new(Arc::new(NoopSyncInstrument::new())))
}
}
Expand All @@ -106,7 +106,7 @@
{
let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit);
if let Err(err) = validation_result {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateObservableCounter.ValidationError", error = format!("{}", err));
return Ok(ObservableCounter::new(Arc::new(NoopAsyncInstrument::new())));
}

Expand All @@ -119,6 +119,7 @@
)?;

if ms.is_empty() {
otel_error!(name: "SdkMeter.CreateObservableCounter.Error", error = format!("{}", MetricsError::Other("no measures found".into())));

Check warning on line 122 in opentelemetry-sdk/src/metrics/meter.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/meter.rs#L122

Added line #L122 was not covered by tests
return Ok(ObservableCounter::new(Arc::new(NoopAsyncInstrument::new())));
}

Expand All @@ -143,7 +144,7 @@
{
let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit);
if let Err(err) = validation_result {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateObservableUpDownCounter.ValidationError", error = format!("{}", err));
return Ok(ObservableUpDownCounter::new(Arc::new(
NoopAsyncInstrument::new(),
)));
Expand All @@ -158,6 +159,7 @@
)?;

if ms.is_empty() {
otel_error!(name: "SdkMeter.CreateObservableUpDownCounter.Error", error = format!("{}",MetricsError::Other("no measures found".into())));

Check warning on line 162 in opentelemetry-sdk/src/metrics/meter.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/meter.rs#L162

Added line #L162 was not covered by tests
return Ok(ObservableUpDownCounter::new(Arc::new(
NoopAsyncInstrument::new(),
)));
Expand All @@ -184,7 +186,7 @@
{
let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit);
if let Err(err) = validation_result {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateObservableGauge.ValidationError", error = format!("{}", err));
return Ok(ObservableGauge::new(Arc::new(NoopAsyncInstrument::new())));
}

Expand All @@ -197,6 +199,7 @@
)?;

if ms.is_empty() {
otel_error!(name: "SdkMeter.CreateObservableGauge.Error",error = format!("{}", MetricsError::Other("no measures found".into())));

Check warning on line 202 in opentelemetry-sdk/src/metrics/meter.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/meter.rs#L202

Added line #L202 was not covered by tests
return Ok(ObservableGauge::new(Arc::new(NoopAsyncInstrument::new())));
}

Expand All @@ -221,7 +224,7 @@
{
let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit);
if let Err(err) = validation_result {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateUpDownCounter.ValidationError", error = format!("{}",err));
return Ok(UpDownCounter::new(Arc::new(NoopSyncInstrument::new())));
}

Expand All @@ -237,7 +240,7 @@
{
Ok(updown_counter) => Ok(updown_counter),
Err(err) => {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateUpDownCounter.Error", error = format!("{}", err));

Check warning on line 243 in opentelemetry-sdk/src/metrics/meter.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/meter.rs#L243

Added line #L243 was not covered by tests
Ok(UpDownCounter::new(Arc::new(NoopSyncInstrument::new())))
}
}
Expand All @@ -253,7 +256,7 @@
{
let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit);
if let Err(err) = validation_result {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateGauge.ValidationError", error = format!("{}", err));
return Ok(Gauge::new(Arc::new(NoopSyncInstrument::new())));
}

Expand All @@ -269,7 +272,7 @@
{
Ok(gauge) => Ok(gauge),
Err(err) => {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateGauge.Error", error = format!("{}",err));

Check warning on line 275 in opentelemetry-sdk/src/metrics/meter.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/meter.rs#L275

Added line #L275 was not covered by tests
Ok(Gauge::new(Arc::new(NoopSyncInstrument::new())))
}
}
Expand All @@ -285,7 +288,7 @@
{
let validation_result = validate_instrument_config(builder.name.as_ref(), &builder.unit);
if let Err(err) = validation_result {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateHistogram.ValidationError", error = format!("{}", err));
return Ok(Histogram::new(Arc::new(NoopSyncInstrument::new())));
}

Expand All @@ -301,7 +304,7 @@
{
Ok(histogram) => Ok(histogram),
Err(err) => {
global::handle_error(err);
otel_error!(name: "SdkMeter.CreateHistogram.Error", error = format!("{}",err));

Check warning on line 307 in opentelemetry-sdk/src/metrics/meter.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/meter.rs#L307

Added line #L307 was not covered by tests
Ok(Histogram::new(Arc::new(NoopSyncInstrument::new())))
}
}
Expand Down
5 changes: 2 additions & 3 deletions opentelemetry-sdk/src/metrics/meter_provider.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,8 @@
};

use opentelemetry::{
global,
metrics::{noop::NoopMeter, Meter, MeterProvider, MetricsError, Result},
KeyValue,
otel_error, KeyValue,
};

use crate::{instrumentation::Scope, Resource};
Expand Down Expand Up @@ -137,7 +136,7 @@
// shutdown(), then we don't need to call shutdown again.
if !self.is_shutdown.load(Ordering::Relaxed) {
if let Err(err) = self.shutdown() {
global::handle_error(err);
otel_error!(name: "SdkMeterProvider.Drop.ShutdownError", error = format!("{}",err));

Check warning on line 139 in opentelemetry-sdk/src/metrics/meter_provider.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/meter_provider.rs#L139

Added line #L139 was not covered by tests
}
}
}
Expand Down
24 changes: 13 additions & 11 deletions opentelemetry-sdk/src/metrics/pipeline.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,8 @@
};

use opentelemetry::{
global,
metrics::{MetricsError, Result},
KeyValue,
otel_warn, KeyValue,
};

use crate::{
Expand Down Expand Up @@ -414,15 +413,18 @@
if existing == id {
return;
}

global::handle_error(MetricsError::Other(format!(
"duplicate metric stream definitions, names: ({} and {}), descriptions: ({} and {}), kinds: ({:?} and {:?}), units: ({:?} and {:?}), and numbers: ({} and {})",
existing.name, id.name,
existing.description, id.description,
existing.kind, id.kind,
existing.unit, id.unit,
existing.number, id.number,
)))
otel_warn!(name: "Instrument.DuplicateMetricStreamDefinitions",
name = format!("{}",id.name),
description = format!("{}", id.description),
kind = format!("{:?}", id.kind),
unit = format!("{}",id.unit),
number = format!("{}", id.number),
existing_name = format!("{}", existing.name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip logging this set of attributes twice. The event name says that we have a duplicate stream so it should already be understood that we have two identical sets of attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very sure of the logic here. Can you check once. As we return at line 414 if they are same.

existing_desc = format!("{}", existing.description),
existing_kind = format!("{:?}", existing.kind),
existing_unit = format!("{}", existing.unit),
existing_number = format!("{}", existing.number),

Check warning on line 426 in opentelemetry-sdk/src/metrics/pipeline.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/pipeline.rs#L416-L426

Added lines #L416 - L426 were not covered by tests
);
}
}
}
Expand Down
27 changes: 16 additions & 11 deletions opentelemetry-sdk/src/metrics/view.rs
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
use super::instrument::{Instrument, Stream};
use glob::Pattern;
use opentelemetry::{
global,
metrics::{MetricsError, Result},
otel_error,
};

fn empty_view(_inst: &Instrument) -> Option<Stream> {
Expand Down Expand Up @@ -102,19 +102,23 @@
/// ```
pub fn new_view(criteria: Instrument, mask: Stream) -> Result<Box<dyn View>> {
if criteria.is_empty() {
global::handle_error(MetricsError::Config(format!(
"no criteria provided, dropping view. mask: {mask:?}"
)));
otel_error!(name: "CreateView.ValidationError", error = format!("{}",MetricsError::Config(
"no criteria provided, dropping view".to_string())),
criteria = format!("{:?}", criteria),
mask = format!("{:?}", mask)

Check warning on line 108 in opentelemetry-sdk/src/metrics/view.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/view.rs#L105-L108

Added lines #L105 - L108 were not covered by tests
);
return Ok(Box::new(empty_view));
}
let contains_wildcard = criteria.name.contains(['*', '?']);
let err_msg_criteria = criteria.clone();

let match_fn: Box<dyn Fn(&Instrument) -> bool + Send + Sync> = if contains_wildcard {
if mask.name != "" {
global::handle_error(MetricsError::Config(format!(
"name replacement for multiple instruments, dropping view, criteria: {criteria:?}, mask: {mask:?}"
)));
otel_error!(name: "CreateView.ValidationError", error = format!("{}", MetricsError::Config(
"name replacement for wildcard instrument, dropping view".to_string())),
criteria = format!("{:?}", criteria),
mask = format!("{:?}", mask)

Check warning on line 120 in opentelemetry-sdk/src/metrics/view.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/view.rs#L117-L120

Added lines #L117 - L120 were not covered by tests
);
return Ok(Box::new(empty_view));
}

Expand All @@ -138,10 +142,11 @@
match ma.validate() {
Ok(_) => agg = Some(ma.clone()),
Err(err) => {
global::handle_error(MetricsError::Other(format!(
"{}, proceeding as if view did not exist. criteria: {:?}, mask: {:?}",
err, err_msg_criteria, mask
)));
otel_error!(name: "CreateView.ValidationError",
error = format!("{}",err),
criteria = format!("{:?}", err_msg_criteria),
mask = format!("{:?}", mask),

Check warning on line 148 in opentelemetry-sdk/src/metrics/view.rs

View check run for this annotation

Codecov / codecov/patch

opentelemetry-sdk/src/metrics/view.rs#L146-L148

Added lines #L146 - L148 were not covered by tests
);
return Ok(Box::new(empty_view));
}
}
Expand Down
Loading
Loading