Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Global metrics provider gets shutdown when local provider is dropped #1661

Closed
bwalk-at-ibm opened this issue Apr 13, 2024 · 4 comments · Fixed by #1663
Closed

[Bug]: Global metrics provider gets shutdown when local provider is dropped #1661

bwalk-at-ibm opened this issue Apr 13, 2024 · 4 comments · Fixed by #1663
Assignees
Labels
bug Something isn't working triage:todo Needs to be traiged.

Comments

@bwalk-at-ibm
Copy link

bwalk-at-ibm commented Apr 13, 2024

What happened?

#1623 introduced Drop implementation for SdkMeterProvider for proper shutdown semantics. This however introduced a bug with the preferred usage pattern like it is demonstrated in the demo.

Here's what I assume happens: when building the metrics provider we register a copy of the provider as the global metrics provider. The SdkMeterProvider holds references to the underlying data like the pipelines in an Arc, so this behaviour is all good. However, when any provider (like the original provider) is dropped (like it happens in the example code and which seems to be the expected pattern because it is encouraged to use the global metrics provider) and since Drop is implemented on the SdkMeterProvider will call shutdown() which will terminate the associated pipelines which are shared with the registered global metrics provider. Hence, we see no metrics and get an error when the global metrics provider gets dropped at application end.

API Version

0.22

SDK Version

0.22.1

What Exporters are you seeing the problem on?

OTLP

Relevant log output

No response

@bwalk-at-ibm bwalk-at-ibm added bug Something isn't working triage:todo Needs to be traiged. labels Apr 13, 2024
@bwalk-at-ibm
Copy link
Author

bwalk-at-ibm commented Apr 13, 2024

Here's the output of the collector when running the demo basic-otlp as is: collector-output-broken.log. Notably see that there are no metrics being collected.

Here's a workaround patch that keeps the provider created in the build() function in scope and passes it to the global metrics provider:

diff --git i/opentelemetry-otlp/examples/basic-otlp/src/main.rs w/opentelemetry-otlp/examples/basic-otlp/src/main.rs
index 4f4320ca..ff64cfe3 100644
--- i/opentelemetry-otlp/examples/basic-otlp/src/main.rs
+++ w/opentelemetry-otlp/examples/basic-otlp/src/main.rs
@@ -49,7 +49,10 @@ fn init_metrics() -> Result<(), MetricsError> {
         )]))
         .build();
     match provider {
-        Ok(_provider) => Ok(()),
+        Ok(provider) => {
+            global::set_meter_provider(provider);
+            Ok(())
+        }
         Err(err) => Err(err),
     }
 }
diff --git i/opentelemetry-otlp/src/metric.rs w/opentelemetry-otlp/src/metric.rs
index 75e968b0..5c12f5d3 100644
--- i/opentelemetry-otlp/src/metric.rs
+++ w/opentelemetry-otlp/src/metric.rs
@@ -241,8 +241,6 @@ where
 
         let provider = provider.build();
 
-        global::set_meter_provider(provider.clone());
-
         Ok(provider)
     }
 }

With this patch applied, the collector log looks as expected and metrics are shown: collector-output-with-patch.log

This is all against current master (d5392dc).

I think the intended behaviour is to call shutdown() only when the last reference to SdkMeterProvider is dropped. To my knowledge this is not possible to do in Rust in a thread-safe manner. Instead Drop should not be implemented for SdkMeterProvider, but instead for Pipeline (or the relevant object that requires proper shutdown semantics, as fas as I can see only the reader is involved so far) so that multiple references to the same SdkMeterProvider can be hold and dropped without side-effects.

@lalitb
Copy link
Member

lalitb commented Apr 15, 2024

@bwalk-at-ibm Your analysis is correct. We shouldn't have Drop implemented for SdkMeterProvider.
One of the option could be to replace the existing code:

#[derive(Clone, Debug)]
pub struct SdkMeterProvider {
    pipes: Arc<Pipelines>,
    meters: Arc<Mutex<HashMap<Scope, Arc<SdkMeter>>>>,
    is_shutdown: Arc<AtomicBool>,
}
impl Drop for SdkMeterProvider {
    fn drop(&mut self) {
        if let Err(err) = self.shutdown() {
            global::handle_error(err);
        }
    }
}

with

#[derive(Clone, Debug)]
pub struct SdkMeterProvider {
    inner: Arc<SdkMeterProviderInner>,
}

pub struct SdkMeterProviderInner {
    pipes: Pipelines,
    meters: Mutex<HashMap<Scope, Arc<SdkMeter>>>,
    is_shutdown: AtomicBool,
}

impl Drop for SdkMeterProviderInner {
    fn drop(&mut self) {
        if let Err(err) = self.shutdown() {
            global::handle_error(err);
        }
    }
}

This will ensure that the shutdown is only invoked when all the references to SdkMeterProvider are dropped.

Apart from this, separate from this issue, we also need to discuss -

  • The function opentelemetry_otlp::new_pipeline() call chain also sets the global static provider. Should it maintain this behavior, or should it be adjusted to simply create and return the provider/meter without modifying the static provider? This will get rid of cloning.

  • Should the convenience interface opentelemetry_otlp::new_pipeline() be removed altogether, and let the application use the standard way of creating the MeterProvider and Meters?

@bwalk-at-ibm
Copy link
Author

That looks like an acceptable solution.

  • The function opentelemetry_otlp::new_pipeline() call chain also sets the global static provider. Should it maintain this behavior, or should it be adjusted to simply create and return the provider/meter without modifying the static provider? This will get rid of cloning.

I think it would be preferable to remove the implicit call to global::set_meter_provider() and make it mandatory and explicit to the caller. I got very confused the first time I used the OTLP exporter.

  • Should the convenience interface opentelemetry_otlp::new_pipeline() be removed altogether, and let the application use the standard way of creating the MeterProvider and Meters?

The only other reference I have is from the opentelemetry-stdout exporter and if we could manage to align the required setup for both, then that would be helpful.

@cijothomas
Copy link
Member

Apart from this, separate from this issue, we also need to discuss -

The function opentelemetry_otlp::new_pipeline() call chain also sets the global static provider. Should it maintain this behavior, or should it be adjusted to simply create and return the provider/meter without modifying the static provider? This will get rid of cloning.

Should the convenience interface opentelemetry_otlp::new_pipeline() be removed altogether, and let the application use the standard way of creating the MeterProvider and Meters?

Yes to both! Lets continue discussing in #1592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage:todo Needs to be traiged.
Projects
None yet
3 participants