Skip to content

Commit 34b7690

Browse files
[chore] System Semantic Conventions Non-Normative Guidance (#1618)
Co-authored-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
1 parent f965a22 commit 34b7690

File tree

5 files changed

+392
-1
lines changed

5 files changed

+392
-1
lines changed

.github/CODEOWNERS

+1
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@
5353
/model/os/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers
5454
/model/process/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers @open-telemetry/semconv-security-approvers
5555
/model/system/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers
56+
/docs/non-normative/groups/system @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-system-approvers
5657

5758
# Mobile semantic conventions
5859
/docs/mobile/ @open-telemetry/specs-semconv-approvers @open-telemetry/semconv-mobile-approvers

.prettierignore

+4
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@
77
!/docs/cloud*/**
88
!/docs/attributes-registry*
99
!/docs/attributes-registry*/**
10+
!/docs/non-normative*
11+
!/docs/non-normative/groups*
12+
!/docs/non-normative/groups/system*
13+
!/docs/non-normative/groups/system*/**
1014
/model
1115
/schemas
1216
CHANGELOG.md
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
# System Semantic Conventions: Instrumentation Design Philosophy
2+
3+
The System Semantic Conventions are caught in a strange dichotomy that is unique
4+
among other semconv groups. While we want to make sure we cover obvious generic
5+
use cases, monitoring system health is a very old practice with lots of
6+
different existing strategies. While we can cover the basic use cases in cross
7+
platform ways, we want to make sure that users who specialize in certain
8+
platforms aren't left in the lurch; if users aren't given recommendations for
9+
particular types of data that isn't cross-platform and universal, they may come
10+
up with their own disparate ideas for how that instrumentation should look,
11+
leading to the kind of fracturing that the semantic conventions should be in
12+
place to avoid.
13+
14+
The following sections address some of the most common instrumentation design
15+
questions, and how we as a working group have opted to address them. In some
16+
cases they are unique to the common semantic conventions guidance due to our
17+
unique circumstance, and those cases will be called out specifically.
18+
19+
## Namespaces
20+
21+
Relevant discussions:
22+
[\#1161](https://github.com/open-telemetry/semantic-conventions/issues/1161)
23+
24+
The System Semantic Conventions generally cover the following namespaces:
25+
26+
- `system`
27+
- `process`
28+
- `host`
29+
- `memory`
30+
- `network`
31+
- `disk`
32+
- `memory`
33+
- `os`
34+
35+
Deciding on the namespace of a metric/attribute is generally informed by the
36+
following belief:
37+
38+
**The namespace of a metric/attribute should logically map to the Operating
39+
System concept being considered as the instrumentation source.**
40+
41+
The most obvious example of this is with language runtime metrics and `process`
42+
namespace metrics. Many of these metrics are very similar; most language
43+
runtimes provide some manner of `cpu.time`, `memory.usage` and similar metrics.
44+
If we were considering de-duplication as the top value in our design, it would
45+
follow that `process.cpu.time` and `process.memory.usage` should simply be
46+
referenced by any language runtime that might produce those metrics. However, as
47+
a working group we believe it is important that `process` namespace and runtime
48+
namespace metrics remain separate, because `process` metrics are meant to
49+
represent an **OS-level process as the instrumentation source**, whereas runtime
50+
metrics represent **the language runtime as the instrumentation source**.
51+
52+
In some cases this is simply a matter of making the instrumentation's purpose as
53+
clear as possible, but there are cases where attempts to share definitions
54+
across distinct instrumentation sources poses the potential for a clash. The
55+
concrete example of a time we accepted this consequence is with `cpu.mode`; the
56+
decision was to
57+
[unify all separate instances of `*.cpu.state` attributes into one shared `cpu.mode` attribute](https://github.com/open-telemetry/semantic-conventions/issues/1139).
58+
The consequence of this is that `cpu.mode` needs to have a broad enum in its
59+
root definition, with special exemptions in each different `ref` of `cpu.mode`,
60+
since `cpu.mode` used in `process.cpu.time` vs `container.cpu.time` vs
61+
`system.cpu.time` etc. has different subsets of the overall enum values. We
62+
decided as a group to accept the consequence in this case, however it isn't
63+
something we're keen on dealing with all over system semconv, as the
64+
instrumentation ends up polluted with so many edge cases in each namespace that
65+
it defeats the purpose of sharing the attribute in the first place.
66+
67+
## Two Class Design Strategy
68+
69+
Relevant discussions:
70+
[\#1403 (particular comment)](https://github.com/open-telemetry/semantic-conventions/issues/1403#issuecomment-2368815634)
71+
72+
We are considering two personas for system semconv instrumentation. If we have a
73+
piece of instrumentation, we decide which persona it is meant for and use that
74+
to make the decision for how we should name/treat that piece of instrumentation.
75+
76+
### General Class: A generalized cross-platform use case we want any user of instrumentation to be able to easily access
77+
78+
When instrumentation is meant for the General Class, we will strive to make the
79+
names and examples as prescriptive as possible. This instrumentation is what
80+
will drive the most important use cases we really want to cover with the system
81+
semantic conventions. Things like dashboards, alerts, and broader o11y setup
82+
tutorials will largely feature General Class instrumentation covering the [basic
83+
use cases][use cases doc] we have laid out as a group. We want this
84+
instrumentation to be very clear exactly how and when they should be used.
85+
General Class instrumentation will be recommended as **on by default**.
86+
87+
### Specialist Class: A more specific use case that specialists could enable to get more in-depth information that they already understand how to use
88+
89+
When instrumentation falls into the Specialist Class, we are assuming the target
90+
audience is already familiar with the concept and knows exactly what they are
91+
looking for and why. The goal for Specialist Class instrumentation is to ensure
92+
that users who have very specific and detailed needs are still covered by our
93+
semantic conventions so they don't need to go out of their way coming up with
94+
their own, risking the same kind of disparate instrumentation problem that
95+
semantic conventions are intended to solve. The main differences in how we
96+
handle Speciialist Class instrumentation are:
97+
98+
1. The names and resulting values will map directly to what a user would expect
99+
hunting down the information themselves. We will rarely be prescriptive in
100+
how the information should be used or how it should be broken down. For
101+
example, a metric to represent a process's cgroup would have the resulting
102+
value match exactly to what the result would be if the user called
103+
`cat /proc/PID/cgroup`.
104+
2. If a piece of instrumentation is specific to a particular operating system,
105+
the name of the operating system will be in the instrumentation name. See
106+
[Operating System in names](#operating-system-in-names) for more information.
107+
For example, a metric for a process's cgroup would be `process.linux.cgroup`,
108+
given that cgroups are a specific Linux kernel feature.
109+
110+
### Examples
111+
112+
Some General Class examples:
113+
114+
- Memory/CPU usage and utilization metrics
115+
- General disk and network metrics
116+
- Universal system/process information (names, identifiers, basic specs)
117+
118+
Some Specialist Class examples:
119+
120+
- Particular Linux features like special process/system information in procfs
121+
(see things like
122+
[/proc/meminfo](https://man7.org/linux/man-pages/man5/proc_meminfo.5.html) or
123+
[cgroups](https://man7.org/linux/man-pages/man7/cgroups.7.html))
124+
- Particular Windows features like special process information (see things like
125+
[Windows Handles](https://learn.microsoft.com/windows/win32/sysinfo/about-handles-and-objects),
126+
[Process Working Set](https://learn.microsoft.com/windows/win32/procthread/process-working-set))
127+
- Niche process information like open file descriptors, page faults, etc.
128+
129+
## Instrumentation Design Guide
130+
131+
When designing new instrumentation we will follow these steps as closely as
132+
possible:
133+
134+
### Choosing Instrumentation Class
135+
136+
In System Semantic Conventions, the most important questions when deciding
137+
whether a piece of instrumentation is General or Specialist would be:
138+
139+
- Is it cross-platform?
140+
- Does it support our [most important use cases][use cases doc] then we will
141+
make it general class
142+
143+
The answer to both these questions will likely need to be "Yes" for the
144+
instrumentation to be considered General Class. Since the General Class
145+
instrumentation is what we expect the widest audience to use, we will need to
146+
scrutinize it more closely to ensure all of it is as necessary and useful as
147+
possible.
148+
149+
If the answer to either one of these is "No", then we will likely consider it
150+
Specialist Class.
151+
152+
### Naming
153+
154+
For General Class, choose a name that most accurately descibes the general
155+
concept without biasing to a platform. Lean towards simplicity where possible,
156+
as this is the instrumentation that will be used by the widest audience; we want
157+
it to be as clear to understand and ergonomic to use as possible.
158+
159+
For Specialist Class, choose a name that most directly matches the words
160+
generally used to describe the concept in context. Since this instrumentation
161+
will be optional, and likely sought out by the people who already know exactly
162+
what they want out of it, we can prioritize matching the names as closely to
163+
their definition as possible. For specialist class metrics that are platform
164+
exclusive, we will include the OS in the namespace as a sub-namespace (not the
165+
root namespace) if it is unlikely that the same metric name could ever be
166+
applied in a cross-platform manner. See
167+
[this section](#operating-system-in-names) for more details.
168+
169+
### Value
170+
171+
For General Class, the value we can be prescriptive with the value of the
172+
instrumentation. We want to ensure General Class instrumentation most closely
173+
matches our vision for our general use cases, and we want to ensure that users
174+
who are not specialists and just want the most important basic information can
175+
acquire it as easily as possible using out-of-the-box semconv instrumentation.
176+
This means we are more likely within General Class instrumentation to make
177+
judgements about exactly what the value should be, and whether the value should
178+
be reshaped by instrumentation in any case when pulling the values from sources
179+
if it serves general purpose use cases.
180+
181+
For Specialist Class, we should strive not to be prescriptive and instead match
182+
the concept being modeled as closely as possible. We expect specialist class
183+
instrumentation to be enabled by the people who already understand it. In a
184+
System Semconv context, these may be things a user previously gathered manually
185+
or through existing OS tools that they want to model as OTLP.
186+
187+
### Case study: `process.cgroup`
188+
189+
Relevant discussions:
190+
[\#1357](https://github.com/open-telemetry/semantic-conventions/issues/1357),
191+
[\#1364 (particular thread)](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1730743509)
192+
193+
In the `hostmetricsreceiver`, there is a Resource Attribute called
194+
`process.cgroup`. How should this attribute be adopted in System Semantic
195+
Conventions?
196+
197+
Based on our definitions, this attribute would fall under Specialist Class:
198+
199+
- `cgroups` are a Linux-specific feature
200+
- It is not directly part of any of the default out-of-the-box usecases we want
201+
to cover
202+
203+
In this attribute's case, there are two important considerations when deciding
204+
on the name:
205+
206+
- The attribute is specialist class
207+
- It is Linux exclusive, and is unlikely to ever be introduced in other
208+
operating systems since the other major platforms have their own versions of
209+
it (Windows Job Objects, BSD Jails, etc)
210+
211+
This means we should pick a name that matches the verbiage used by specialists
212+
in context when referring to this concept. The way you would refer to this would
213+
be "a process's cgroup, collected from `/proc/<pid>/cgroup`". So we would start
214+
with the name `process.cgroup`. We also determined that this attribute is
215+
Linux-exclusive and are confident it will remain as such, so we land on the name
216+
`process.linux.cgroup`.
217+
218+
Since this metric falls under Specialist Class, we don't want to be too
219+
prescriptive about the value. A user who needs to know the `cgroup` of a process
220+
likely already has a pretty good idea of how to interpret it and use it further,
221+
and it would not be worth it for this Working Group to try and come up with
222+
every possible edge case for how it might be used. It is much simpler for this
223+
attribute, insofar as it falls under our purview, to simply reflect the value
224+
from the OS, i.e. the direct value from `cat /proc/<pid>/cgroup`. With cgroups
225+
in particular, there is high likelihood that more specialized semconv
226+
instrumentation could be developed, particularly in support of more specialized
227+
container runtime or systemd instrumentation. It's more useful for a working
228+
group developing special instrumentation that leverages cgroups to be more
229+
prescriptive about how the cgroup information should be interpreted and broken
230+
down with more specificity.
231+
232+
## Operating System in names
233+
234+
Relevant discussions:
235+
[\#1255](https://github.com/open-telemetry/semantic-conventions/issues/1255),
236+
[\#1364](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1852465994)
237+
238+
Monitoring operating systems is an old practice, and there are numerous heavily
239+
differing approaches within different platforms. There are lots of metrics, even
240+
considering common stats like memory usage, where there are platform-exclusive
241+
pieces of information that are only valuable to those who specialize in that
242+
platform.
243+
244+
Thus we have decided that any instrumentation that is:
245+
246+
1. Specific to a particular operating system
247+
2. Not meant to be part of what we consider our most important general use cases
248+
249+
will have the Operating System name as part of the namespace.
250+
251+
For example, there may be `process.linux`, `process.windows`, or `process.posix`
252+
names for metrics and attributes. We will not have root `linux.*`, `windows.*`,
253+
or `posix.*` namespaces. This is because of the principle we’re trying to uphold
254+
from the [Namespaces section](#namespaces); we still want the instrumentation
255+
source to be represented by the root namespace of the attribute/metric. If we
256+
had OS root namespaces, different sources like `system`, `process`, etc. could
257+
get very tangled within each OS namespace, defeating the intended design
258+
philosophy.
259+
260+
[use cases doc]: ./use-cases.md
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# **System Semantic Conventions: General Use Cases**
2+
3+
This document is a collection of the use cases that we want to cover with the
4+
System Semantic Conventions. The use cases outlined here inform the working
5+
group’s decisions around what instrumentation is considered **required**. Use
6+
cases in this document will be stated in a generic way that does not refer to
7+
any potentially existing instrumentation in semconv as of writing, such that
8+
when we do dig into specific instrumentation, we understand their importance
9+
based on our holistic view of expected use cases.
10+
11+
## _Legend_
12+
13+
`General Information` \= The information that should be discoverable either
14+
through the entity, metrics, or metric attributes.
15+
16+
`Dashboard` \= The information that should be attainable through metrics to
17+
create a comprehensive dashboard.
18+
19+
`Alerts` \= Some examples of common alerts that should be creatable with the
20+
available information.
21+
22+
## **Host**
23+
24+
A user should be able to monitor the health of a host, including monitoring
25+
resource consumption, unexpected errors due to resource exhaustion or
26+
malfunction of core components of a host or fleet of hosts (network stack,
27+
memory, CPU, etc.).
28+
29+
### General Information
30+
31+
- Machine name
32+
- ID (relevant to its context, could be a cloud provider ID or just base machine
33+
ID)
34+
- OS information (platform, version, architecture, etc)
35+
- CPU Information
36+
- Memory Capacity
37+
38+
### Dashboard
39+
40+
- Memory utilization
41+
- CPU utilization
42+
- Disk utilization
43+
- Disk throughput
44+
- Network traffic
45+
46+
### Alerts
47+
48+
- VM is down unexpectedly
49+
- Network activity spikes unexpectedly
50+
- Memory/CPU/Disk utilization goes above a % threshold
51+
52+
## Notes
53+
54+
The alerts in particular should be capable of being uniformly applied to a
55+
heterogenous fleet of hosts. We will value the nature of cross-platform
56+
instrumentation to allow for effective alerting across a fleet regardless of the
57+
potential mixture of operating system platforms within it.
58+
59+
The term `host` can mean different things in other contexts:
60+
61+
- The term `host` in a network context, a central machine that many others are
62+
networked to, or the term `host` in a virtualization context
63+
- The term `host` in a virtualization context, something that is hosting virtual
64+
guests such as VMs or containers
65+
66+
In this context, a host is generally considered to be some individual machine,
67+
physical or virtual. This can be extra confusing, because a unique machine
68+
`host` can also be a network `host` or virtualization `host` at the same time.
69+
This is a complexity we will have to accept due to the fact that the `host`
70+
namespace is deeply embedded in existing OpenTelemetry instrumentation and
71+
general verbiage. To the best of our ability, network and virtualization `host`
72+
instrumentation will be kept distinct by being within other namespaces that
73+
clearly denote which version of the term `host` is being referred to, while the
74+
root `host` namespace will refer to an individual machine.
75+
76+
## **Process**
77+
78+
A user should be able to monitor the health of an arbitrary process using data
79+
provided by the OS. Reasons a user may want this:
80+
81+
1. The process they want to monitor doesn't have in-process runtime-specific
82+
instrumentation enabled or is not instrumentable at all, such as an antivirus
83+
or another background process.
84+
2. They are monitoring lots of processes and want to have a set of uniform
85+
instrumentation for all of them.
86+
3. Personal preference/legacy reasons; they might already be using OS signals to
87+
monitor stuff and it's an easier lift for them to move to basic process
88+
instrumentation, then move to other specific semconv over time.
89+
90+
### General Information
91+
92+
- Process name
93+
- Pid
94+
- User/owner
95+
96+
### Dashboard
97+
98+
- Physical Memory usage and/or utilization
99+
- Virtual Memory usage
100+
- CPU usage and/or utilization
101+
- Disk throughput
102+
- Network throughput
103+
104+
### Alert
105+
106+
- Process stops unexpectedly
107+
- Memory/CPU usage/utilization goes above a threshold
108+
- Memory exclusively rises over a period of time (memory leak detection)
109+
110+
### Notes
111+
112+
On top of alerts and dashboards, we will also consider the basic benchmarking of
113+
a process to be a general usecase. The basic cross platform stats that can be
114+
provided in a cross-platform manner can also be effectively used for this, and
115+
we will consider that when making decisions about process instrumentation.

0 commit comments

Comments
 (0)