|
| 1 | +# System Semantic Conventions: Instrumentation Design Philosophy |
| 2 | + |
| 3 | +The System Semantic Conventions are caught in a strange dichotomy that is unique |
| 4 | +among other semconv groups. While we want to make sure we cover obvious generic |
| 5 | +use cases, monitoring system health is a very old practice with lots of |
| 6 | +different existing strategies. While we can cover the basic use cases in cross |
| 7 | +platform ways, we want to make sure that users who specialize in certain |
| 8 | +platforms aren't left in the lurch; if users aren't given recommendations for |
| 9 | +particular types of data that isn't cross-platform and universal, they may come |
| 10 | +up with their own disparate ideas for how that instrumentation should look, |
| 11 | +leading to the kind of fracturing that the semantic conventions should be in |
| 12 | +place to avoid. |
| 13 | + |
| 14 | +The following sections address some of the most common instrumentation design |
| 15 | +questions, and how we as a working group have opted to address them. In some |
| 16 | +cases they are unique to the common semantic conventions guidance due to our |
| 17 | +unique circumstance, and those cases will be called out specifically. |
| 18 | + |
| 19 | +## Namespaces |
| 20 | + |
| 21 | +Relevant discussions: |
| 22 | +[\#1161](https://github.com/open-telemetry/semantic-conventions/issues/1161) |
| 23 | + |
| 24 | +The System Semantic Conventions generally cover the following namespaces: |
| 25 | + |
| 26 | +- `system` |
| 27 | +- `process` |
| 28 | +- `host` |
| 29 | +- `memory` |
| 30 | +- `network` |
| 31 | +- `disk` |
| 32 | +- `memory` |
| 33 | +- `os` |
| 34 | + |
| 35 | +Deciding on the namespace of a metric/attribute is generally informed by the |
| 36 | +following belief: |
| 37 | + |
| 38 | +**The namespace of a metric/attribute should logically map to the Operating |
| 39 | +System concept being considered as the instrumentation source.** |
| 40 | + |
| 41 | +The most obvious example of this is with language runtime metrics and `process` |
| 42 | +namespace metrics. Many of these metrics are very similar; most language |
| 43 | +runtimes provide some manner of `cpu.time`, `memory.usage` and similar metrics. |
| 44 | +If we were considering de-duplication as the top value in our design, it would |
| 45 | +follow that `process.cpu.time` and `process.memory.usage` should simply be |
| 46 | +referenced by any language runtime that might produce those metrics. However, as |
| 47 | +a working group we believe it is important that `process` namespace and runtime |
| 48 | +namespace metrics remain separate, because `process` metrics are meant to |
| 49 | +represent an **OS-level process as the instrumentation source**, whereas runtime |
| 50 | +metrics represent **the language runtime as the instrumentation source**. |
| 51 | + |
| 52 | +In some cases this is simply a matter of making the instrumentation's purpose as |
| 53 | +clear as possible, but there are cases where attempts to share definitions |
| 54 | +across distinct instrumentation sources poses the potential for a clash. The |
| 55 | +concrete example of a time we accepted this consequence is with `cpu.mode`; the |
| 56 | +decision was to |
| 57 | +[unify all separate instances of `*.cpu.state` attributes into one shared `cpu.mode` attribute](https://github.com/open-telemetry/semantic-conventions/issues/1139). |
| 58 | +The consequence of this is that `cpu.mode` needs to have a broad enum in its |
| 59 | +root definition, with special exemptions in each different `ref` of `cpu.mode`, |
| 60 | +since `cpu.mode` used in `process.cpu.time` vs `container.cpu.time` vs |
| 61 | +`system.cpu.time` etc. has different subsets of the overall enum values. We |
| 62 | +decided as a group to accept the consequence in this case, however it isn't |
| 63 | +something we're keen on dealing with all over system semconv, as the |
| 64 | +instrumentation ends up polluted with so many edge cases in each namespace that |
| 65 | +it defeats the purpose of sharing the attribute in the first place. |
| 66 | + |
| 67 | +## Two Class Design Strategy |
| 68 | + |
| 69 | +Relevant discussions: |
| 70 | +[\#1403 (particular comment)](https://github.com/open-telemetry/semantic-conventions/issues/1403#issuecomment-2368815634) |
| 71 | + |
| 72 | +We are considering two personas for system semconv instrumentation. If we have a |
| 73 | +piece of instrumentation, we decide which persona it is meant for and use that |
| 74 | +to make the decision for how we should name/treat that piece of instrumentation. |
| 75 | + |
| 76 | +### General Class: A generalized cross-platform use case we want any user of instrumentation to be able to easily access |
| 77 | + |
| 78 | +When instrumentation is meant for the General Class, we will strive to make the |
| 79 | +names and examples as prescriptive as possible. This instrumentation is what |
| 80 | +will drive the most important use cases we really want to cover with the system |
| 81 | +semantic conventions. Things like dashboards, alerts, and broader o11y setup |
| 82 | +tutorials will largely feature General Class instrumentation covering the [basic |
| 83 | +use cases][use cases doc] we have laid out as a group. We want this |
| 84 | +instrumentation to be very clear exactly how and when they should be used. |
| 85 | +General Class instrumentation will be recommended as **on by default**. |
| 86 | + |
| 87 | +### Specialist Class: A more specific use case that specialists could enable to get more in-depth information that they already understand how to use |
| 88 | + |
| 89 | +When instrumentation falls into the Specialist Class, we are assuming the target |
| 90 | +audience is already familiar with the concept and knows exactly what they are |
| 91 | +looking for and why. The goal for Specialist Class instrumentation is to ensure |
| 92 | +that users who have very specific and detailed needs are still covered by our |
| 93 | +semantic conventions so they don't need to go out of their way coming up with |
| 94 | +their own, risking the same kind of disparate instrumentation problem that |
| 95 | +semantic conventions are intended to solve. The main differences in how we |
| 96 | +handle Speciialist Class instrumentation are: |
| 97 | + |
| 98 | +1. The names and resulting values will map directly to what a user would expect |
| 99 | + hunting down the information themselves. We will rarely be prescriptive in |
| 100 | + how the information should be used or how it should be broken down. For |
| 101 | + example, a metric to represent a process's cgroup would have the resulting |
| 102 | + value match exactly to what the result would be if the user called |
| 103 | + `cat /proc/PID/cgroup`. |
| 104 | +2. If a piece of instrumentation is specific to a particular operating system, |
| 105 | + the name of the operating system will be in the instrumentation name. See |
| 106 | + [Operating System in names](#operating-system-in-names) for more information. |
| 107 | + For example, a metric for a process's cgroup would be `process.linux.cgroup`, |
| 108 | + given that cgroups are a specific Linux kernel feature. |
| 109 | + |
| 110 | +### Examples |
| 111 | + |
| 112 | +Some General Class examples: |
| 113 | + |
| 114 | +- Memory/CPU usage and utilization metrics |
| 115 | +- General disk and network metrics |
| 116 | +- Universal system/process information (names, identifiers, basic specs) |
| 117 | + |
| 118 | +Some Specialist Class examples: |
| 119 | + |
| 120 | +- Particular Linux features like special process/system information in procfs |
| 121 | + (see things like |
| 122 | + [/proc/meminfo](https://man7.org/linux/man-pages/man5/proc_meminfo.5.html) or |
| 123 | + [cgroups](https://man7.org/linux/man-pages/man7/cgroups.7.html)) |
| 124 | +- Particular Windows features like special process information (see things like |
| 125 | + [Windows Handles](https://learn.microsoft.com/windows/win32/sysinfo/about-handles-and-objects), |
| 126 | + [Process Working Set](https://learn.microsoft.com/windows/win32/procthread/process-working-set)) |
| 127 | +- Niche process information like open file descriptors, page faults, etc. |
| 128 | + |
| 129 | +## Instrumentation Design Guide |
| 130 | + |
| 131 | +When designing new instrumentation we will follow these steps as closely as |
| 132 | +possible: |
| 133 | + |
| 134 | +### Choosing Instrumentation Class |
| 135 | + |
| 136 | +In System Semantic Conventions, the most important questions when deciding |
| 137 | +whether a piece of instrumentation is General or Specialist would be: |
| 138 | + |
| 139 | +- Is it cross-platform? |
| 140 | +- Does it support our [most important use cases][use cases doc] then we will |
| 141 | + make it general class |
| 142 | + |
| 143 | +The answer to both these questions will likely need to be "Yes" for the |
| 144 | +instrumentation to be considered General Class. Since the General Class |
| 145 | +instrumentation is what we expect the widest audience to use, we will need to |
| 146 | +scrutinize it more closely to ensure all of it is as necessary and useful as |
| 147 | +possible. |
| 148 | + |
| 149 | +If the answer to either one of these is "No", then we will likely consider it |
| 150 | +Specialist Class. |
| 151 | + |
| 152 | +### Naming |
| 153 | + |
| 154 | +For General Class, choose a name that most accurately descibes the general |
| 155 | +concept without biasing to a platform. Lean towards simplicity where possible, |
| 156 | +as this is the instrumentation that will be used by the widest audience; we want |
| 157 | +it to be as clear to understand and ergonomic to use as possible. |
| 158 | + |
| 159 | +For Specialist Class, choose a name that most directly matches the words |
| 160 | +generally used to describe the concept in context. Since this instrumentation |
| 161 | +will be optional, and likely sought out by the people who already know exactly |
| 162 | +what they want out of it, we can prioritize matching the names as closely to |
| 163 | +their definition as possible. For specialist class metrics that are platform |
| 164 | +exclusive, we will include the OS in the namespace as a sub-namespace (not the |
| 165 | +root namespace) if it is unlikely that the same metric name could ever be |
| 166 | +applied in a cross-platform manner. See |
| 167 | +[this section](#operating-system-in-names) for more details. |
| 168 | + |
| 169 | +### Value |
| 170 | + |
| 171 | +For General Class, the value we can be prescriptive with the value of the |
| 172 | +instrumentation. We want to ensure General Class instrumentation most closely |
| 173 | +matches our vision for our general use cases, and we want to ensure that users |
| 174 | +who are not specialists and just want the most important basic information can |
| 175 | +acquire it as easily as possible using out-of-the-box semconv instrumentation. |
| 176 | +This means we are more likely within General Class instrumentation to make |
| 177 | +judgements about exactly what the value should be, and whether the value should |
| 178 | +be reshaped by instrumentation in any case when pulling the values from sources |
| 179 | +if it serves general purpose use cases. |
| 180 | + |
| 181 | +For Specialist Class, we should strive not to be prescriptive and instead match |
| 182 | +the concept being modeled as closely as possible. We expect specialist class |
| 183 | +instrumentation to be enabled by the people who already understand it. In a |
| 184 | +System Semconv context, these may be things a user previously gathered manually |
| 185 | +or through existing OS tools that they want to model as OTLP. |
| 186 | + |
| 187 | +### Case study: `process.cgroup` |
| 188 | + |
| 189 | +Relevant discussions: |
| 190 | +[\#1357](https://github.com/open-telemetry/semantic-conventions/issues/1357), |
| 191 | +[\#1364 (particular thread)](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1730743509) |
| 192 | + |
| 193 | +In the `hostmetricsreceiver`, there is a Resource Attribute called |
| 194 | +`process.cgroup`. How should this attribute be adopted in System Semantic |
| 195 | +Conventions? |
| 196 | + |
| 197 | +Based on our definitions, this attribute would fall under Specialist Class: |
| 198 | + |
| 199 | +- `cgroups` are a Linux-specific feature |
| 200 | +- It is not directly part of any of the default out-of-the-box usecases we want |
| 201 | + to cover |
| 202 | + |
| 203 | +In this attribute's case, there are two important considerations when deciding |
| 204 | +on the name: |
| 205 | + |
| 206 | +- The attribute is specialist class |
| 207 | +- It is Linux exclusive, and is unlikely to ever be introduced in other |
| 208 | + operating systems since the other major platforms have their own versions of |
| 209 | + it (Windows Job Objects, BSD Jails, etc) |
| 210 | + |
| 211 | +This means we should pick a name that matches the verbiage used by specialists |
| 212 | +in context when referring to this concept. The way you would refer to this would |
| 213 | +be "a process's cgroup, collected from `/proc/<pid>/cgroup`". So we would start |
| 214 | +with the name `process.cgroup`. We also determined that this attribute is |
| 215 | +Linux-exclusive and are confident it will remain as such, so we land on the name |
| 216 | +`process.linux.cgroup`. |
| 217 | + |
| 218 | +Since this metric falls under Specialist Class, we don't want to be too |
| 219 | +prescriptive about the value. A user who needs to know the `cgroup` of a process |
| 220 | +likely already has a pretty good idea of how to interpret it and use it further, |
| 221 | +and it would not be worth it for this Working Group to try and come up with |
| 222 | +every possible edge case for how it might be used. It is much simpler for this |
| 223 | +attribute, insofar as it falls under our purview, to simply reflect the value |
| 224 | +from the OS, i.e. the direct value from `cat /proc/<pid>/cgroup`. With cgroups |
| 225 | +in particular, there is high likelihood that more specialized semconv |
| 226 | +instrumentation could be developed, particularly in support of more specialized |
| 227 | +container runtime or systemd instrumentation. It's more useful for a working |
| 228 | +group developing special instrumentation that leverages cgroups to be more |
| 229 | +prescriptive about how the cgroup information should be interpreted and broken |
| 230 | +down with more specificity. |
| 231 | + |
| 232 | +## Operating System in names |
| 233 | + |
| 234 | +Relevant discussions: |
| 235 | +[\#1255](https://github.com/open-telemetry/semantic-conventions/issues/1255), |
| 236 | +[\#1364](https://github.com/open-telemetry/semantic-conventions/pull/1364#discussion_r1852465994) |
| 237 | + |
| 238 | +Monitoring operating systems is an old practice, and there are numerous heavily |
| 239 | +differing approaches within different platforms. There are lots of metrics, even |
| 240 | +considering common stats like memory usage, where there are platform-exclusive |
| 241 | +pieces of information that are only valuable to those who specialize in that |
| 242 | +platform. |
| 243 | + |
| 244 | +Thus we have decided that any instrumentation that is: |
| 245 | + |
| 246 | +1. Specific to a particular operating system |
| 247 | +2. Not meant to be part of what we consider our most important general use cases |
| 248 | + |
| 249 | +will have the Operating System name as part of the namespace. |
| 250 | + |
| 251 | +For example, there may be `process.linux`, `process.windows`, or `process.posix` |
| 252 | +names for metrics and attributes. We will not have root `linux.*`, `windows.*`, |
| 253 | +or `posix.*` namespaces. This is because of the principle we’re trying to uphold |
| 254 | +from the [Namespaces section](#namespaces); we still want the instrumentation |
| 255 | +source to be represented by the root namespace of the attribute/metric. If we |
| 256 | +had OS root namespaces, different sources like `system`, `process`, etc. could |
| 257 | +get very tangled within each OS namespace, defeating the intended design |
| 258 | +philosophy. |
| 259 | + |
| 260 | +[use cases doc]: ./use-cases.md |
0 commit comments