You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following up from the Index Build Service Data Plane design, this document proposes a design for publishing metrics from our service to pluggable metric stores (CloudWatch, etc). We will first discuss the approach of how we plan to publish metrics, such as which existing tools we can leverage and the pros and cons of each. Then we will expand on how we're leveraging the recommended tool for exporting metrics, and also break down Metric Collection: from what they're used for, how to compute them and how to instrument the service to collect them.
Assumptions
The Index Build Service has three supported endpoints: _build, _status, _cancel
The Index Build Service will be vended as a single worker image, deployed via Docker compose or Kubernetes
User (OS distribution provider) is responsible for control plane functions
This includes deploying the service, load balancing and throttling requests to the REST API interface, and API authentication and authorization. The client communication between KNN and the remote build service supports basic auth out of the box.
Metric Publish Requirements
The metric tool used must be easily containerized to run within the service image. The metric tool will collect metrics only from this single service image.
The metric component should be lightweight and minimize the additional memory and storage to operate. Thus, one of the reasons for publishing metrics to external metric store rather than storing them locally.
Metrics could be published to various external metric stores via simple configurations, avoiding single vendor lock-in
The metric publishing workflow to external metric stores should be decoupled from the build service process. This ensures metric publishing doesn’t block the build service critical path
Users should be able to disable metric aggregation using a Docker environment variable. More details in Appendix.
Recommended Approach: Use OpenTelemetry to forward metrics to various external metric stores
OpenTelemetry is an open-source observability framework licensed under Apache 2.0 and provides a set of APIs, libraries, agents, and instrumentation for capturing and exporting metrics, logs, and traces to various metric backends, including Prometheus, CloudWatch, etc. This approach will leverage OpenTelemetry for performing asynchronous metric forwarding to various pluggable metric stores.
The high level flow will be to use the OpenTelemetry SDK for creating/updating metrics (counters, gauges, histograms, etc) and periodically forwarding metrics to custom exporters corresponding to metric backends (CloudWatch, etc). The custom exporters will then publish the metrics to external metric stores. The process of forwarding to the exporter and exporter sending to metric stores happen in the background of the build service and are handled by OpenTelemetry.
OpenTelemetry also provides the Collector (agent that runs alongside our service) to forward metrics to multiple metric backends, but the reason we avoided that option is because it would require an additional side-car container to run the Collector. We want to minimize the overhead for collecting metrics, so we decided on using the SDK with custom exporter for a lightweight setup.
Pros:
SDK: The SDK helps standardize metrics (counters, histograms, etc) and manages the entire end-to-end flow from when metrics are created, to updated, to pushed to external metric store. It takes care of all the complexities of maintaining custom metrics
Credible/Standardization: The project is backed by the Cloud Native Computing Foundation (CNCF), with large and active open-source community ensuring broad industry support (from major cloud providers AWS, Azure, etc) and continuous development. It’s designed to be a standardized, vendor-agnostic observability framework
Flexibility to support multiple backends: OpenTelemetry is designed to support multiple backends (Prometheus, Datadog, CloudWatch, etc.) via exporters/Collector and is highly extensible.
Async metric forwarding: OpenTelemetry supports asynchronous metric collection and forwarding natively, with the Python SDK providing async capabilities, including built-in batching
Async error retries: OpenTelemetry offers customizable retry mechanisms and error handling when metrics fail to be sent.
Forwarding custom metrics: OpenTelemetry has a flexible system for defining and forwarding custom metrics. We can define custom instruments and track custom data in multiple formats.
Rich Ecosystem: Can handle tracing, metrics, and logging within a single system, making it future-proof in case we want to enable logging/tracing in the future.
Support for FastAPI: OpenTelemetry has built-in support for instrumenting FastAPI endpoints
Cons:
Overhead to support multiple backends: OpenTelemetry doesn’t support exporters to all external metric stores in its SDK (such as CloudWatch), so there are two alternatives: build custom exporter or use OpenTelemetry Collector to forward metrics. The drawback is a more complex setup compared to directly using an out-of-the-box exporter from the SDK
Custom Exporter: This approach extends the base Exporter class to export metrics to an external store. It would require us to implement error handling and transforming metrics from OTel format to a format that’s compatible with the external store.
OpenTelemetry Collector: The Collector is an agent that runs alongside our service as a separate Docker image and needs to be configured (listen to endpoints, error retry) to listen for metrics from the SDK and then forward them to metric stores. For instance, the AWS Distro of OpenTelemetry (ADOT) offers the ADOT Collector to forward metrics to CloudWatch. It is well documented and a standard practice to use Collectors in production environments. However, using the Collector requires additional configuration/overhead compared to directly using a custom SDK exporter to forward metrics to external stores
Collecting GPU Utilization: OpenTelemetry doesn’t natively support collecting GPU system metrics, so it would require additional setup (background thread, or another tool/library) to fetch GPU utilization metrics
Why Choose?
Provides SDK to create/update/export metrics, managing entire end-to-end flow of metric updating/publishing
Is the industry/universal vendor-agnostic standard for collecting and forwarding telemetry, making it more future-proof to potential changing requirements. Meanwhile, Telegraf, the closest alternative, is created by InfluxData and not as universal due to its tight integration with the InfluxData eco-system
Apache 2.0 license
Alternatives Considered
1. Telegraf with Log File Input Plugin
Telegraf is an open-source agent licensed under MIT that collects metrics and sends them to a variety of outputs. It support multiple input plugins from which it collect metrics, such as from log files, http, message queues, etc. It also supports output plugins for numerous metric stores like InfluxDB, Prometheus, Datadog, CloudWatch, and more. Telegraf is designed to be lightweight and asynchronous. It can be run as a standalone service, collecting and forwarding metrics in parallel with our service.
The high level flow is the build service will write custom metrics to log files, and the Telegraf agent will fetch metrics from the log files to then publish metrics to external store. Telegraf supports various input plugins other than log files, such as http, message queues, unix sockets, etc. The reason for choosing log files as input plugin is to avoid all the additional failure scenarios of pushing metrics over the network.
Pros:
Most of the pros of OpenTelemetry: Flexibility to support multiple backends, Async metric forwarding, Async error retries, Forwarding custom metrics
GPU Utilization Metric: Supports input plugin for collecting GPU system metrics
Cons:
Standardization: Telegraf is a specific tool created by InfluxData with tight integration with that eco-system. While it supports common protocols like StatsD, Prometheus, and Graphite, it is not a universal standard like OpenTelemetry. This makes Telegraf less flexible when trying to integrate with diverse observability tools or industry-wide instrumentation efforts.
Point of Failure: Telegraf relies on input plugins (like log files, HTTP endpoints, etc.), which can introduce additional points of failure. If an input plugin fails (ex. network or file write failures), metrics may be lost. We need to manually set up a way for Telegraf to fetch the metrics, such as a log file, or an endpoint, or a queue. Which introduces more complexity, failure scenarios and inconsistent metric formats
Not Fully Asynchronous: Some parts of Telegraf, like the input plugins, might block or introduce latency if there are issues with metric collection.
Why Not?
Is not a universal standard like OpenTelemetry as Telegraf is optimized and tightly coupled to the InfluxData ecosystem
Exposes more points of failure than OpenTelemetry
2. StatsD
StatsD is a simple, fast network daemon that listens for metrics, like counters and timers, sent over UDP or TCP. It can forward metrics to different stores such as Datadog, Graphite, and others. The high level flow is the StatsD client will send metrics to the StatsD server (via UDP/TCP). The StatsD server aggregates the metrics and forwards them to the external metric store.
Pros:
Most of the pros of Telegraf: Flexibility to support multiple backends, Async metric forwarding, Forwarding custom metrics
Simple and Lightweight: StatsD is very lightweight and easy to integrate with applications. It’s particularly suited for quick, minimal setup.
Cons:
Basic Features: StatsD is a simple protocol and lacks the full feature set of other tools like OpenTelemetry or Telegraf, especially around complex metrics, observability, batching and retries.
Async error retries: Built-in retry mechanisms are not available out-of-the-box. We would need to implement custom retry logic for failed metric forwarding, which can block service. If the StatsD server is unreachable, metrics can be lost unless additional logic is implemented to handle retries.
Why Not?
Not a fully-featured observability framework like OpenTelemetry, so it’s often used in conjunction with Telegraf or OpenTelemetry for more robust observability solutions. So it’s not sufficient as a future-proof solution
3. Custom Metric Component
This approach builds the metric component from scratch without utilizing existing tools. It would involve the build service temporarily pushing/storing metrics into a temp storage (ex. queue), and then background processes/threads consuming that queue to push metrics to external stores. We would need to design how to standardize collecting counter, gauge, histogram metrics and the generic interface to push metrics to various pluggable backends.
High Level Design:
Pros:
Control/Customization: We would have complete control over the entire code flow of how metrics are computed, collected and published to external stores
Cons:
Complexity: Building this system from scratch introduces a lot of complexities that we need to manage, such as how many background threads to deploy, how to efficiently store metrics in a temp location, handling manual error retries, and ensure that metric publishing is decoupled from the build service critical path
Re-Inventing the Wheel: There are existing tools that provide the exact use-case for us with more robust testing, functionality and interface
4. Others
CloudWatch Agent, MetricBeat, Prometheus Agent, Graphite, DataDog Client, and New Relic do not support multiple backends.
Other options like FluentD (logging) or M3 (time-series database) can support metric forwarding to multiple backends, but are not built for our use-case
Metrics
The metrics for the build service can be divided into three categories: Counter, Gauge and Histogram. Use-cases of each metric are also listed.
Counter: A value that accumulates over time, it only ever goes up
4xx and 5xx API errors count
Troubleshooting: Helps with investigating failures as error codes may give hint to failures
Service Health/Availability: Monitor health of build service, as high error counts can indicate critical failures in service health and availability.
Object Store Upload/Download success/failure count
External Store Availability: Monitors availability of external object store and interactions with build service is healthy. Frequent upload/download failures may signal issues like service outages or misconfigurations
Data Integrity Monitoring: Tracking success/failure counts also monitors whether build service may be experiencing data loss due to uploads
GPU Index Build success/failure count
Operational Health: Frequent build failures might indicate issues with GPU resources, model configurations, or underlying hardware.
Performance Optimization: High failure rates might necessitate reviewing and optimizing the GPU-based indexing process.
CPU to GPU Index Conversion success/failure count
System Health: High failure rates can indicate issues like poor memory management, slow communication between CPU and GPU, or bugs in the conversion algorithm.
Build Request Success/Failure count
High-Level Success/Failure Tracking: Monitors the overall success or failure of build requests (entire gpu build workflow), which is crucial for ensuring that our entire build system is working as intended.
User Experience: The success/failure ratio impacts the reliability and user satisfaction of our service.
Gauge: An instantaneous value that can go up or down over time. It reflects a value at a particular moment in time
GPU utilization
Resource Optimization: Helps ensure that our GPUs are being utilized effectively. Low utilization may indicate that the GPU is underused, potentially resulting in wasted resources. Help us make decisions about scaling GPU resources up or down
Cost Management: Since GPU resources are expensive, monitoring utilization helps ensure that we're getting the most out of them and can adjust workloads accordingly.
Histogram: A record of the distribution of measurements over time. It allows us to track the count, sum and the distribution of values in defined buckets
Total API request latency per API
Performance Monitoring: The latency of our API requests is crucial for ensuring that the system meets performance requirements defined in SLAs
Troubleshooting: Latency metrics help identify bottlenecks in our system
Total Object Store Upload/Download time
Operational Optimization: Helps optimize the efficiency of data uploads and downloads by identifying slow external dependencies.
Total GPU Index Build time
Resource Efficiency: Tracking GPU index build time helps ensure that GPUs are being utilized effectively and helps identify performance bottlenecks.
Benchmark: Comparing the build times between GPU and CPU gives insight into how our new service compares to existing build process
Total CPU to GPU Index Conversion time
Performance Bottleneck Detection: Long conversion times could indicate inefficiencies in transferring data between CPU and GPU, affecting performance.
Total Build Request time to completion or failure
High-Level Performance Optimization: Give insight into time of build request (entire workflow), indicating need to identify system bottlenecks and ensures that end-to-end build times are within acceptable limits
# Setup the Meter Provider
meter_provider = MeterProvider()
meter = meter_provider.get_meter(__name__)
# Create a Counter metric
request_counter = meter.create_counter(
"requests_total", description="Total number of requests"
)
# Create a Histogram metric
latency_histogram = meter.create_histogram(
"request_latency_seconds", description="Histogram of request latencies"
)
# Create a Gauge metric
memory_gauge = meter.create_gauge("memory_usage", description="Gauge of memory utilization")
# Example of recording metrics
request_counter.add(1, {"service": "build-service"}) # Increment the counter
latency_histogram.record(0.2, {"service": "build-service"}) # Record latency in seconds
latency_histogram.record(0.5, {"service": "build-service"})
memory_gauge.set(75.4) # Set value for gauge
How Metrics are Exported in OpenTelemetry
This section explains how OpenTelemetry publishes metrics to the backends and what role our custom CloudWatch exporter plays in the workflow. As mentioned before, OpenTelemetry doesn’t provide CloudWatch exporters in the SDK, so we will build a custom exporter CloudWatchMetricsExporter to forward metrics. For the sake of clarity in the initial implementation, we’re going to be using the CloudWatchMetricsExporter in the diagram below, but we are supporting other pluggable exporters to different backends.
High Level Diagram
Component Definition
Build Service Process/Threads
Refers to the threads of the index build service that will be performing the graph build, upload/download vectors, etc. These threads will be responsible for collecting metrics and updating metrics using the SDK
BuildServiceMetrics
This is a singleton class that will house all the metrics we’re collecting and also configure the SDK exporters to push batched metrics
Provides a factory method to initialize the desired metric backend exporter based on Docker environment variable. For the initial implementation, we will be defaulting to the CloudWatch exporter.
MeterProvider
Is responsible for managing the Meter that is used to create, update and manage lifecycles of the metric instruments (Counter, Gauge, Histogram, etc)
It essentially acts as the internal SDK state of the metrics we’re collecting, so when we call counter.add(1) we’re updating the internal state within MeterProvider
The PeriodicExportingMetricReader is added to the MeterProvider to be registered/configured so that the metric readers (PeriodicExportingMetricReader) can collect metrics from the SDK on demand
Provides a force_flush method to immediately flush buffered metrics by invoking the Exporter. This is useful when our service is terminating in between export intervals and there are still metrics in the buffer
PeriodicExportingMetricReader
Metric reader that periodically collects and exports the buffered metrics data to a specified Exporter
The periodic export allows us to batch collected metrics to forward together for efficient data transfer
The metric reader also has an explicit timeout parameter export_timeout_millis for collecting and exporting metrics. If the timeout is reached, the metric reader will abort and try again on the next periodic interval
For the POC, the export interval will be 10 seconds and the export timeout will be 3 seconds
CloudWatchMetricsExporter
Responsible for parsing/transforming the batched metrics to publish to external metric backends (CloudWatch)
The exporter also has the same explicit timeout export_timeout_millis as the metric reader for exporting batched metrics to CloudWatch
The exporter will use exponential backoff for error retries, and the amount of time allowed for retries is capped by export_timeout_millis , ensuring exporting is not blocked indefinitely
External Metric Store
The external metric stores (CloudWatch, etc) that we’re publishing metrics to
Code Flow
First we need to instrument the service with OpenTelemetry SDK. Metrics will be created via MeterProvider and SDK PeriodicExportingMetricReader/CloudWatchMetricsExporter will be configured in the BuildServiceMetrics
Build Service Process/Threads updates a metric, for instance counter.add(1) , and SDK will update the internal state of that metric
Once batch interval time is reached, the PeriodicExportingMetricReader will collect buffered metrics and forward them to the CloudWatchMetricsExporter
CloudWatchMetricsExporter will export the batch of metrics to the backend and perform error retry with exponential backoff
How OpenTelemetry metric types are transformed to CloudWatch format in the CloudWatchMetricsExporter
This section covers how the Exporter transforms OpenTelemetry metrics (Counter, Gauge, Histogram) to formats that are compatible with CloudWatch. The CloudWatch Unit of each metric is provided in sections below.
For Counter and Gauge, the OpenTelemetry data structure provides a value that can be parsed like:
metric_dict = {
'MetricName': metric.name,
'Dimensions': [{'Name': 'MetricScope', 'Value': 'ServiceMetric'}],
}
#Counters
if isinstance(metric.data, Sum):
metric_dict['Value'] = metric.data.data_points[0].value #parse count value
metric_dict['Unit'] = 'Count'
metric_data_batch.append(metric_dict.copy())
#Gauges
elif isinstance(metric.data, Gauge):
metric_dict['Value'] = metric.data.data_points[0].value #parse gauge value
metric_dict['Unit'] = 'Percent'
metric_data_batch.append(metric_dict.copy())
.
.
.
.
if metric_data_batch:
self.client.put_metric_data(
Namespace=self.namespace,
MetricData=metric_data_batch
)
print(f"Successfully sent {len(metric_data_batch)} metrics to CloudWatch.")
For Histogram, the OpenTelemetry data structure provides the sum of the distribution. Using OpenTelemetry’s Histogram helps avoid the complexities of computing/maintaining a separate running sum during runtime. It also provides flexibility in the future if we want to compute additional metrics using the count, buckets, etc that the Histogram provides.
#Histograms
if isinstance(metric.data, Histogram):
sum = metric.data.data_points[0].sum
metric_dict['Value'] = sum
metric_dict['Unit'] = 'Seconds'
metric_data_batch.append(metric_dict.copy())
.
.
.
.
if metric_data_batch:
self.client.put_metric_data(
Namespace=self.namespace,
MetricData=metric_data_batch
)
print(f"Successfully sent {len(metric_data_batch)} metrics to CloudWatch.")
Sample Code of Simple Metric Collection/Exporting
This section gives a skeleton structure of a functional setup in which metrics are collected, updated and exported to CloudWatch
# Custom CloudWatch exporter class
class CloudWatchMetricsExporter(MetricExporter):
def __init__(self,
namespace: str,
preferred_temporality: Dict[type, AggregationTemporality] = None,
preferred_aggregation: Dict[
type, "opentelemetry.sdk.metrics.view.Aggregation"
] = None
):
super().__init__(
preferred_temporality=preferred_temporality,
preferred_aggregation=preferred_aggregation,
)
self.namespace = namespace
self.client = boto3.client(
'cloudwatch',
region_name="us-east-1",
aws_access_key_id='',
aws_secret_access_key='',
aws_session_token=''
)
self.max_retries = 3
def export(
self,
metrics_data: MetricsData,
timeout_millis: float = 10_000,
**kwargs,
) -> MetricExportResult:
try:
metric_data_batch = []
for resource_metric in metrics_data.resource_metrics:
for scope_metric in resource_metric.scope_metrics:
for metric in scope_metric.metrics:
#Counters
if isinstance(metric.data, Sum):
#Gauge
elif isinstance(metric.data, Gauge):
#Histogram
elif isinstance(metric.data, Histogram):
if metric_data_batch:
self.client.put_metric_data(
Namespace=self.namespace,
MetricData=metric_data_batch
)
print(f"Successfully sent {len(metric_data_batch)} metrics to CloudWatch.")
return MetricExportResult.SUCCESS
except Exception as e:
print(f"Failed to export to CloudWatch: {e}")
return MetricExportResult.FAILURE
def force_flush(self, timeout_millis: float = 10_000) -> bool:
return True
def shutdown(self, timeout_millis: float = 30_000, **kwargs) -> None:
# Cleanup any resources, close connections, etc.
# Create the custom CloudWatch exporter
cloudwatch_exporter = CloudWatchMetricsExporter(namespace="TestingMetrics")
# Create a PeriodicExportingMetricReader to export metrics at set intervals
reader = PeriodicExportingMetricReader(cloudwatch_exporter, export_interval_millis=10000, export_timeout_millis=3000) # Export every 10 seconds
# Initialize MeterProvider with a reader
meter_provider = MeterProvider(metric_readers=[reader])
# Create metrics
meter = meter_provider.get_meter("my-meter")
counter = meter.create_counter("my_counter", description="A sample counter metric")
counter2 = meter.create_counter("my_counter2", description="A sample counter metric2")
gauge = meter.create_gauge("memory_usage", description="A sample gauge metric")
histogram = meter.create_histogram("response_time", description="A sample histogram metric")
# Simulate metric updates
def simulate_metrics():
for _ in range(20):
counter.add(1)
counter2.add(2)
gauge.set(75.4) # Example memory usage in MB
histogram.record(150) # Record a response time of 150ms
print("Metrics Updated")
time.sleep(1) # Wait for 1 second before the next increment
# Start simulating metric collection
simulate_metrics()
Failure Scenarios
This section will focus on failure scenarios for metric publishing. At a high level, we should handle error retries asynchronously to not block the build service critical path.
1. Service updating metric fails
This occurs when the build service thread tries to update a metric, such as counters.add(1) and the operation fails. When we update a metric via the SDK, we update some internal state of the metric and add it to a buffer for export. Failure in this operation is unlikely, but can occur in the following cases
Out of Memory or Exhausted System Resources: If system runs out of memory due to high volume of metrics being batched and exported. However this is unlikely as the SDK is designed to handle high throughput without blocking the service
Invalid Argument: If invalid argument is passed to the metrics, it could lead to an exception. But is unlikely as the code to update metrics will be tested thoroughly
In this case, there isn’t a graceful way to retry incrementing a counter, etc. So metric may not be updated in the SDK, but chances of this happening is very low
2. SDK exporter sending metric fails
This occurs when the batching interval is met and the SDK exporter attempts to push metrics to the metric backend
Metric Backend is down or congested: The external backend (ex. CloudWatch, Prometheus) is down or temporarily unreachable.
Authorization/Authentication failures: The exporter may fail to authenticate or authorize the request to push data to the backend (ex. incorrect AWS credentials).
API rate limits: The external backend may hit its rate limit, causing the exporter to fail to push metrics.
Connection issues: Network issues preventing the exporter from connecting to the backend, or timeouts during the connection attempt.
If the exporter detects failure, it will perform error retry with exponential backoff until the export timeout is reached. If error timeout is reached, then will drop the metrics and try again on next export interval. Having an export timeout ensures that export will not block indefinitely. It is also crucial that the export timeout is strictly less than the export interval.
3. Service Terminates In-Between Export Intervals
Since the build service is transient, it might terminate in between export intervals, in which some metrics are in the SDK buffer but not yet exported to the metric backend
The MeterProvider provides a force_flush method that will trigger an immediate/synchronous export of any metrics that are currently buffered in the SDK, without waiting for the periodic export interval. When the service cleans up for termination, this force_flush will be called to ensure no metrics are lost.
The force_flush is a synchronous method, but also provides a configuration timeout, meaning failures will not block service termination indefinitely.
How to Compute Metrics
This section covers how each metric is computed in the build service workflow and then updated via the SDK.
FastAPI Metrics
4xx and 5xx API errors count
OpenTelemetry: Counter
CloudWatch Unit: Count
Total API request latency per API
OpenTelemetry: Histogram
CloudWatch Unit: Seconds
We want control over which custom metrics to collect and how they are collected, so we’re going to utilize the FastAPI-native decorator @app.middleware("http") to compute our API metrics. The @app.middleware("http") decorator is provided by FastAPI and will run before and after each request is processed by an endpoint, allowing us to compute request latency and error counts.
Alternative: OpenTelemetry offers built-in/automated instrumentation with FastAPI via FastAPIInstrumentor (details), but is primarily built for tracing. It automatically sets up traces for FastAPI, all configured via environment variables without explicit code changes. It’s possible to configure FastAPIInstrumentor for collecting metrics, but since we don’t plan on collecting traces, we will not be using FastAPIInstrumentor
Sample code for computing and collecting metric using @app.middleware("http"):
app = FastAPI()
# Middleware to track 4xx/5xx errors and latency using @app.middleware
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
end_time = time.time()
# Track latency (Seconds)
latency = end_time - start_time
# Record latency to histogram
request_latency.record(latency)
# Increment error counts based on status code
if 400 <= response.status_code < 500:
error_4xx_counter.add(1)
elif 500 <= response.status_code < 600:
error_5xx_counter.add(1)
return response
# Example FastAPI endpoints
@router.post("/_build")
async def build_index():
pass
GPU Utilization Metric
GPU utilization
OpenTelemetry: Async Gauge
CloudWatch Unit: Percent
Unlike other metrics, GPU Utilization (Async Gauge) isn’t explicitly computed and collected as part of the build service workflow. This means that this metric needs to be fetched in periodic intervals and then exported.
OpenTelemetry provides an instrument called Asynchronous Gauge, which is same as the Gauge, but is collected once for each export. It’s used when we don’t have access to the continuous changes, but only to the aggregated value. We provide Async Gauge with a callback method to fetch GPU Utilization using the pynvml library. More implementation details on Async Gauge usage can be found in OpenTelemetry documentation.
Alternative: Use a Gauge and spin up a background thread that will utilize the pynvml library to collect the metric in 10 second intervals. However, Async Gauge is an instrument built for collecting metrics like GPU utilization, so why re-invent the wheel. Also, it doesn’t make sense to update GPU utilization metric multiple times within an export interval, since only the last updated value of the Gauge will be exported.
Sample code for using pynvml to fetch metric:
import pynvml
# Initialize NVML
pynvml.nvmlInit()
try:
# Get the number of GPUs
device_count = pynvml.nvmlDeviceGetCount()
if device_count == 0:
pass
else:
# Iterate through each GPU to get utilization info
for i in range(device_count):
# Get handle for the GPU
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
# Get the GPU utilization (percentage)
utilization = pynvml.nvmlDeviceGetUtilizationRates(handle)
# GPU utilization -> utilization.gpu
# GPU memory utilization -> utilization.memory
# Get memory usage
memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
# Total memory -> memory_info.total
# Free memory -> memory_info.free
# Used memory -> memory_info.used
except pynvml.NVMLError as error:
print("An error occurred: {error}")
finally:
# Shutdown NVML after usage
pynvml.nvmlShutdown()
Object Store Upload/Download success/failure count
OpenTelemetry: Counter
CloudWatch Unit: Count
The metric will be incremented as part of the build service code flow each time object store operations succeed/fail.
Total Object Store Upload/Download time
OpenTelemetry: Histogram
CloudWatch Unit: Seconds
The metric will be computed by a timer in the code flow of object store operations, then recorded in the metric’s Histogram
GPU Index Build Metrics
GPU Index Build success/failure count
OpenTelemetry: Counter
CloudWatch Unit: Count
The metric will be incremented as part of build service code flow each time GPU index build operation succeeds/fails
Total GPU Index Build time
OpenTelemetry: Histogram
CloudWatch Unit: Seconds
The metric will be computed by a timer in the code flow of GPU index build operation, then recorded in the metric’s Histogram
CPU/GPU Conversion Metrics
CPU to GPU Index Conversion success/failure count
OpenTelemetry: Counter
CloudWatch Unit: Count
The metric will be incremented as part of the build service code flow each time the conversion operation succeeds/fails
Total CPU to GPU Index Conversion time
OpenTelemetry: Histogram
CloudWatch Unit: Seconds
The metric will be computed by a timer in the code flow of CPU/GPU conversion operation, then recorded in the metric’s Histogram
Build Job Metrics
Build Request Success/Failure count
OpenTelemetry: Counter
CloudWatch Unit: Count
The metric will be incremented as part of the build service code flow at any potential failure point in the entire build workflow. For instance, if object store download fails, causing the build to fail, then failure count will increment. If entire workflow succeeds and we mark build job as completed, then success count will increment.
Total Build Request time to completion or failure
OpenTelemetry: Histogram
CloudWatch Unit: Seconds
The metric will be computed by a timer that starts when the build job is started by the build service background thread, and stopped either due to failure or when build completes and job is marked as completed. Then recorded in the metric’s Histogram
Authentication/Configuration
This section mainly focuses on the credentials/auth/configurations needed to published metrics to external stores.
For instance, aws credentials for CloudWatch will be accessed via environment variables set in the Docker image/compose. The user is responsible for providing these credentials and configuration setup settings.
Language and Service Choice
Language Choice: Python
As mentioned in the Remote Index Build Service Design, the language choice of the Index Build Service will be in python, so the metric component will maintain this choice for direct compatibility.
Metric Services: CloudWatch
The metric component is designed to be pluggable by various external metric services, but we will be implementing the POC using CloudWatch.
Future Improvements
Appendix
POC for Metric Component
This section goes over a simple POC of the metrics component that collects API metrics: error counts and latency.
The flow of this POC is whenever a request comes in for the /build, /cancel, /status endpoints, the FastAPI middleware will be triggered to compute and collect metrics. Metrics and exporters are initialized in the BuildServiceMetrics. And the custom exporter for CloudWatch is in the CloudWatchMetricsExporter.
Docker setting to disable metrics
This allows customers to enable or disable metrics for remote index build service. This is for the use-case when customers don’t want to spend additional compute/storage resources for metric aggregation
The approach will have a Docker environment variable that the index build service will read when spinning up. If metric aggregation is enabled then build service will go ahead and set up metrics and exporters via the SDK. Else, don’t setup.
The text was updated successfully, but these errors were encountered:
Background Information
For additional context: opensearch-project/k-NN#2545
Overview
Following up from the Index Build Service Data Plane design, this document proposes a design for publishing metrics from our service to pluggable metric stores (CloudWatch, etc). We will first discuss the approach of how we plan to publish metrics, such as which existing tools we can leverage and the pros and cons of each. Then we will expand on how we're leveraging the recommended tool for exporting metrics, and also break down Metric Collection: from what they're used for, how to compute them and how to instrument the service to collect them.
Assumptions
Metric Publish Requirements
Recommended Approach: Use OpenTelemetry to forward metrics to various external metric stores
OpenTelemetry is an open-source observability framework licensed under Apache 2.0 and provides a set of APIs, libraries, agents, and instrumentation for capturing and exporting metrics, logs, and traces to various metric backends, including Prometheus, CloudWatch, etc. This approach will leverage OpenTelemetry for performing asynchronous metric forwarding to various pluggable metric stores.
The high level flow will be to use the OpenTelemetry SDK for creating/updating metrics (counters, gauges, histograms, etc) and periodically forwarding metrics to custom exporters corresponding to metric backends (CloudWatch, etc). The custom exporters will then publish the metrics to external metric stores. The process of forwarding to the exporter and exporter sending to metric stores happen in the background of the build service and are handled by OpenTelemetry.
OpenTelemetry also provides the Collector (agent that runs alongside our service) to forward metrics to multiple metric backends, but the reason we avoided that option is because it would require an additional side-car container to run the Collector. We want to minimize the overhead for collecting metrics, so we decided on using the SDK with custom exporter for a lightweight setup.
Pros:
Cons:
Why Choose?
Alternatives Considered
1. Telegraf with Log File Input Plugin
Telegraf is an open-source agent licensed under MIT that collects metrics and sends them to a variety of outputs. It support multiple input plugins from which it collect metrics, such as from log files, http, message queues, etc. It also supports output plugins for numerous metric stores like InfluxDB, Prometheus, Datadog, CloudWatch, and more. Telegraf is designed to be lightweight and asynchronous. It can be run as a standalone service, collecting and forwarding metrics in parallel with our service.
The high level flow is the build service will write custom metrics to log files, and the Telegraf agent will fetch metrics from the log files to then publish metrics to external store. Telegraf supports various input plugins other than log files, such as http, message queues, unix sockets, etc. The reason for choosing log files as input plugin is to avoid all the additional failure scenarios of pushing metrics over the network.
Pros:
Cons:
Why Not?
2. StatsD
StatsD is a simple, fast network daemon that listens for metrics, like counters and timers, sent over UDP or TCP. It can forward metrics to different stores such as Datadog, Graphite, and others. The high level flow is the StatsD client will send metrics to the StatsD server (via UDP/TCP). The StatsD server aggregates the metrics and forwards them to the external metric store.
Pros:
Cons:
Why Not?
3. Custom Metric Component
This approach builds the metric component from scratch without utilizing existing tools. It would involve the build service temporarily pushing/storing metrics into a temp storage (ex. queue), and then background processes/threads consuming that queue to push metrics to external stores. We would need to design how to standardize collecting counter, gauge, histogram metrics and the generic interface to push metrics to various pluggable backends.
High Level Design:
Pros:
Cons:
4. Others
CloudWatch Agent, MetricBeat, Prometheus Agent, Graphite, DataDog Client, and New Relic do not support multiple backends.
Other options like FluentD (logging) or M3 (time-series database) can support metric forwarding to multiple backends, but are not built for our use-case
Metrics
The metrics for the build service can be divided into three categories: Counter, Gauge and Histogram. Use-cases of each metric are also listed.
Counter: A value that accumulates over time, it only ever goes up
4xx and 5xx API errors count
Object Store Upload/Download success/failure count
GPU Index Build success/failure count
CPU to GPU Index Conversion success/failure count
Build Request Success/Failure count
Gauge: An instantaneous value that can go up or down over time. It reflects a value at a particular moment in time
GPU utilization
Histogram: A record of the distribution of measurements over time. It allows us to track the count, sum and the distribution of values in defined buckets
Total API request latency per API
Total Object Store Upload/Download time
Total GPU Index Build time
Total CPU to GPU Index Conversion time
Total Build Request time to completion or failure
How Metrics are Created/Updated in OpenTelemetry
Sample code creating/updating counters/histograms/gauges:
How Metrics are Exported in OpenTelemetry
This section explains how OpenTelemetry publishes metrics to the backends and what role our custom CloudWatch exporter plays in the workflow. As mentioned before, OpenTelemetry doesn’t provide CloudWatch exporters in the SDK, so we will build a custom exporter CloudWatchMetricsExporter to forward metrics. For the sake of clarity in the initial implementation, we’re going to be using the CloudWatchMetricsExporter in the diagram below, but we are supporting other pluggable exporters to different backends.
High Level Diagram
Component Definition
Code Flow
How OpenTelemetry metric types are transformed to CloudWatch format in the CloudWatchMetricsExporter
This section covers how the Exporter transforms OpenTelemetry metrics (Counter, Gauge, Histogram) to formats that are compatible with CloudWatch. The CloudWatch Unit of each metric is provided in sections below.
For Counter and Gauge, the OpenTelemetry data structure provides a value that can be parsed like:
For Histogram, the OpenTelemetry data structure provides the sum of the distribution. Using OpenTelemetry’s Histogram helps avoid the complexities of computing/maintaining a separate running sum during runtime. It also provides flexibility in the future if we want to compute additional metrics using the count, buckets, etc that the Histogram provides.
Sample Code of Simple Metric Collection/Exporting
This section gives a skeleton structure of a functional setup in which metrics are collected, updated and exported to CloudWatch
Failure Scenarios
This section will focus on failure scenarios for metric publishing. At a high level, we should handle error retries asynchronously to not block the build service critical path.
1. Service updating metric fails
This occurs when the build service thread tries to update a metric, such as counters.add(1) and the operation fails. When we update a metric via the SDK, we update some internal state of the metric and add it to a buffer for export. Failure in this operation is unlikely, but can occur in the following cases
In this case, there isn’t a graceful way to retry incrementing a counter, etc. So metric may not be updated in the SDK, but chances of this happening is very low
2. SDK exporter sending metric fails
This occurs when the batching interval is met and the SDK exporter attempts to push metrics to the metric backend
If the exporter detects failure, it will perform error retry with exponential backoff until the export timeout is reached. If error timeout is reached, then will drop the metrics and try again on next export interval. Having an export timeout ensures that export will not block indefinitely. It is also crucial that the export timeout is strictly less than the export interval.
3. Service Terminates In-Between Export Intervals
Since the build service is transient, it might terminate in between export intervals, in which some metrics are in the SDK buffer but not yet exported to the metric backend
The MeterProvider provides a force_flush method that will trigger an immediate/synchronous export of any metrics that are currently buffered in the SDK, without waiting for the periodic export interval. When the service cleans up for termination, this force_flush will be called to ensure no metrics are lost.
The force_flush is a synchronous method, but also provides a configuration timeout, meaning failures will not block service termination indefinitely.
How to Compute Metrics
This section covers how each metric is computed in the build service workflow and then updated via the SDK.
FastAPI Metrics
We want control over which custom metrics to collect and how they are collected, so we’re going to utilize the FastAPI-native decorator @app.middleware("http") to compute our API metrics. The @app.middleware("http") decorator is provided by FastAPI and will run before and after each request is processed by an endpoint, allowing us to compute request latency and error counts.
Alternative: OpenTelemetry offers built-in/automated instrumentation with FastAPI via FastAPIInstrumentor (details), but is primarily built for tracing. It automatically sets up traces for FastAPI, all configured via environment variables without explicit code changes. It’s possible to configure FastAPIInstrumentor for collecting metrics, but since we don’t plan on collecting traces, we will not be using FastAPIInstrumentor
Sample code for computing and collecting metric using @app.middleware("http"):
GPU Utilization Metric
Unlike other metrics, GPU Utilization (Async Gauge) isn’t explicitly computed and collected as part of the build service workflow. This means that this metric needs to be fetched in periodic intervals and then exported.
OpenTelemetry provides an instrument called Asynchronous Gauge, which is same as the Gauge, but is collected once for each export. It’s used when we don’t have access to the continuous changes, but only to the aggregated value. We provide Async Gauge with a callback method to fetch GPU Utilization using the pynvml library. More implementation details on Async Gauge usage can be found in OpenTelemetry documentation.
Alternative: Use a Gauge and spin up a background thread that will utilize the pynvml library to collect the metric in 10 second intervals. However, Async Gauge is an instrument built for collecting metrics like GPU utilization, so why re-invent the wheel. Also, it doesn’t make sense to update GPU utilization metric multiple times within an export interval, since only the last updated value of the Gauge will be exported.
Sample code for using pynvml to fetch metric:
Object Store Metrics
The metric will be incremented as part of the build service code flow each time object store operations succeed/fail.
The metric will be computed by a timer in the code flow of object store operations, then recorded in the metric’s Histogram
GPU Index Build Metrics
The metric will be incremented as part of build service code flow each time GPU index build operation succeeds/fails
The metric will be computed by a timer in the code flow of GPU index build operation, then recorded in the metric’s Histogram
CPU/GPU Conversion Metrics
The metric will be incremented as part of the build service code flow each time the conversion operation succeeds/fails
The metric will be computed by a timer in the code flow of CPU/GPU conversion operation, then recorded in the metric’s Histogram
Build Job Metrics
The metric will be incremented as part of the build service code flow at any potential failure point in the entire build workflow. For instance, if object store download fails, causing the build to fail, then failure count will increment. If entire workflow succeeds and we mark build job as completed, then success count will increment.
The metric will be computed by a timer that starts when the build job is started by the build service background thread, and stopped either due to failure or when build completes and job is marked as completed. Then recorded in the metric’s Histogram
Authentication/Configuration
This section mainly focuses on the credentials/auth/configurations needed to published metrics to external stores.
For instance, aws credentials for CloudWatch will be accessed via environment variables set in the Docker image/compose. The user is responsible for providing these credentials and configuration setup settings.
Language and Service Choice
Language Choice: Python
As mentioned in the Remote Index Build Service Design, the language choice of the Index Build Service will be in python, so the metric component will maintain this choice for direct compatibility.
Metric Services: CloudWatch
The metric component is designed to be pluggable by various external metric services, but we will be implementing the POC using CloudWatch.
Future Improvements
Appendix
POC for Metric Component
This section goes over a simple POC of the metrics component that collects API metrics: error counts and latency.
The flow of this POC is whenever a request comes in for the /build, /cancel, /status endpoints, the FastAPI middleware will be triggered to compute and collect metrics. Metrics and exporters are initialized in the BuildServiceMetrics. And the custom exporter for CloudWatch is in the CloudWatchMetricsExporter.
Docker setting to disable metrics
The text was updated successfully, but these errors were encountered: