|
| 1 | +:showtitle: |
| 2 | +:numbered: |
| 3 | + |
| 4 | += TUF Artifact Replication (a.k.a. TUF Repo Depot) |
| 5 | + |
| 6 | +The final output of our release process is a TUF repo consisting of all |
| 7 | +of the artifacts the product requires to run. For the update system to |
| 8 | +work, it needs access to those artifacts. There are some constraining |
| 9 | +factors: |
| 10 | + |
| 11 | +* Nexus is the only way into the system for these artifacts (either |
| 12 | + through direct upload from an operator, or a download initiated by |
| 13 | + Nexus to a service outside of the system). |
| 14 | +* Nexus has no persistent local storage, nor can it directly use the |
| 15 | + artifacts (OS and zone images, firmware, etc.) even if it did store |
| 16 | + them. |
| 17 | +* Sled Agent is generally what will directly use the artifacts (except |
| 18 | + for SP and ROT images, which MGS needs), and it can also manage its |
| 19 | + own local storage. |
| 20 | + |
| 21 | +Thus Nexus needs to accept artifacts from outside of the system and |
| 22 | +immediately offload them to individual Sled Agents for persistent |
| 23 | +storage and later use. |
| 24 | + |
| 25 | +We have chosen (see <<rfd424>>) the simplest possible implementation: |
| 26 | +every Sled Agent stores a copy of every artifact on each of its M.2 |
| 27 | +devices. This is storage inefficient but means that a Sled Agent can |
| 28 | +directly use those resources to create zones from updated images, |
| 29 | +install an updated OS, or manage the installation of updates on other |
| 30 | +components, without Nexus having to ensure that it distributed an |
| 31 | +artifact to a sled _before_ telling it to use it. A Nexus background |
| 32 | +task periodically ensures that all sleds have all artifacts. |
| 33 | + |
| 34 | +== Sled Agent implementation |
| 35 | + |
| 36 | +Sled Agent stores artifacts as a content-addressed store on an *update* |
| 37 | +dataset on each M.2 device: the file name of each stored artifact is its |
| 38 | +SHA-256 hash. |
| 39 | + |
| 40 | +It also stores an _artifact configuration_ in memory: a list of all |
| 41 | +artifact hashes that the sled should store, and a generation number. |
| 42 | +The generation number is owned by Nexus, which increments the generation |
| 43 | +number when the set of TUF repos on the system changes. Sled Agent |
| 44 | +prevents modifying the configuration without an increase in the |
| 45 | +generation number. |
| 46 | + |
| 47 | +Sled Agent offers the following APIs on the underlay network, intended |
| 48 | +for Nexus: |
| 49 | + |
| 50 | +* `artifact_config_get`: Get the current artifact configuration. |
| 51 | +* `artifact_config_put`: Put the artifact configuration that should be |
| 52 | + in effect. This API is idempotent (putting the same configuration does |
| 53 | + not change anything). Modified configurations must also increase the |
| 54 | + generation number. |
| 55 | +* `artifact_list`: List the artifacts present in the artifact |
| 56 | + configuration along with the count of available copies of each |
| 57 | + artifact across the *update* datasets. Also includes the current |
| 58 | + generation number. |
| 59 | +* `artifact_put`: Put the request body into the artifact store. |
| 60 | + Rejects the request if the artifact does not belong to the current |
| 61 | + configuration. |
| 62 | +* `artifact_copy_from_depot`: Sends a request to another Sled Agent (via |
| 63 | + the *TUF Repo Depot API*; see below) to fetch an artifact. The base |
| 64 | + URL for the source sled is chosen by the requester. This API responds |
| 65 | + after a successful HTTP response from the source sled and the copy |
| 66 | + proceeds asynchronously. Rejects the request if the artifact does not |
| 67 | + belong to the current configuration. |
| 68 | + |
| 69 | +Sled Agent also spawns another Dropshot API server called the *TUF Repo |
| 70 | +Depot API* which offers one API on the underlay network, intended for |
| 71 | +other Sled Agents: |
| 72 | + |
| 73 | +* `artifact_get_by_sha256`: Get the content of an artifact. |
| 74 | + |
| 75 | +In an asynchronous task called the _delete reconciler_, Sled Agent |
| 76 | +periodically scans the *update* datasets for artifacts that are not |
| 77 | +part of the present configuration and deletes them. Prior to each |
| 78 | +filesystem operation the task checks the configuration for presence of |
| 79 | +that artifact hash. The delete reconciler then waits for an artifact |
| 80 | +configuration change until running again. |
| 81 | + |
| 82 | +== Nexus implementation |
| 83 | + |
| 84 | +Nexus has a `tuf_artifact_replication` background task which runs this |
| 85 | +reliable persistent workflow: |
| 86 | + |
| 87 | +1. Collect the artifact configuration (the list of artifact hashes, and |
| 88 | + the current generation number) from the database. |
| 89 | +2. Call `artifact_config_put` on all sleds. Stop if any sled rejects the |
| 90 | + configuration (our information is already out of date). |
| 91 | +3. Call `artifact_list` on all sleds. Stop if any sled informs us of a |
| 92 | + newer generation number. |
| 93 | +4. Delete any local copies of repositories where all artifacts are |
| 94 | + sufficiently replicated across sleds. ("Sufficiently replicated" |
| 95 | + currently means that at least 3 sleds each have at least one copy.) |
| 96 | +5. For any artifacts this Nexus has a local copy of, send `artifact_put` |
| 97 | + requests to N random sleds, where N is the number of puts required to |
| 98 | + sufficienty replicate the artifact. |
| 99 | +6. Send `artifact_copy_from_depot` requests to all remaining sleds |
| 100 | + missing copies of an artifact. Nexus chooses the source sled randomly |
| 101 | + out of the list of sleds that have a copy of the artifact. |
| 102 | + |
| 103 | +In each task execution, Nexus will attempt to do all possible work |
| 104 | +that leads to every sled having a copy of the artifact. In the absence |
| 105 | +of random I/O errors, a repository will be fully replicated across |
| 106 | +all sleds in the system in the first execution, and the Nexus-local |
| 107 | +copy of the repository will be deleted in the second execution. |
| 108 | +`artifact_copy_from_depot` requests that require the presence of an |
| 109 | +artifact on a sled that does not yet have it are scheduled after all |
| 110 | +`artifact_put` requests complete. |
| 111 | + |
| 112 | +== Preventing conflicts and loss of artifacts |
| 113 | + |
| 114 | +The artifact configuration is used to prevent conflicts that may be |
| 115 | +caused by two Nexus instances running the `tuf_artifact_replication` |
| 116 | +background task simultaneously with different information. The worst |
| 117 | +case scenario for a conflict is the total loss of an artifact across the |
| 118 | +system, although there are lesser evils as well. This section describes |
| 119 | +a number of possible faults and the mitigations taken. |
| 120 | + |
| 121 | +=== Recently-uploaded repositories and artifact deletion |
| 122 | + |
| 123 | +When Sled Agent receives an artifact configuration change, the delete |
| 124 | +reconciler task begins scanning the *update* datasets for artifacts that |
| 125 | +are no longer required and deletes them. |
| 126 | + |
| 127 | +Nexus maintains its local copy of recently-uploaded repositories |
| 128 | +until it confirms (via the `artifact_list` operation) that all of the |
| 129 | +artifacts in the repository are sufficiently replicated (currently, at |
| 130 | +least 3 sleds each have at least 1 copy). |
| 131 | + |
| 132 | +If the `artifact_list` operation lists any artifacts that could be |
| 133 | +deleted asynchronously, Nexus could incorrectly assume that an artifact |
| 134 | +is sufficiently replicated when it is not. This could happen if a |
| 135 | +repository is deleted, and another repository containing the same |
| 136 | +artifact is uploaded while another Nexus is running the background task. |
| 137 | + |
| 138 | +The artifact configuration is designed to mitigate this. The |
| 139 | +`artifact_list` operation filters the list of artifacts to contain |
| 140 | +only artifacts present in the current configuration. The delete |
| 141 | +reconciler decides whether to delete a file by re-checking the current |
| 142 | +configuration. |
| 143 | + |
| 144 | +When Nexus receives the `artifact_list` response, it verifies that |
| 145 | +the generation number reported is the same as the configuration it put |
| 146 | +earlier in the same task execution. Because the response only contains |
| 147 | +artifacts belonging to the current configuration, and that list of |
| 148 | +artifacts is based on the same configuration Nexus believes is current, |
| 149 | +it can trust that none of those artifacts are about to be deleted and |
| 150 | +safely delete local copies of sufficiently-replicated artifacts. |
| 151 | + |
| 152 | +=== Loss of all sleds with the only copy |
| 153 | + |
| 154 | +There are two potential situations where we could lose the only copy of |
| 155 | +an artifact. The first is a Nexus instance crashing or being replaced |
| 156 | +before a local artifact can be put to any sleds. Crashes are difficult |
| 157 | +to mitigate, as artifacts are currently stored in randomly-named |
| 158 | +temporary directories that are non-trivial to recover on startup; |
| 159 | +consequently there is no mitigation for this problem today. During |
| 160 | +graceful removal of Nexus zones, a quiesced Nexus (see <<rfd459>> and |
| 161 | +<<omicron5677>>) should remain alive until all local artifacts are |
| 162 | +sufficiently replicated. |
| 163 | + |
| 164 | +The second potential situation is a loss of all sleds that an artifact |
| 165 | +is copied to after Nexus deletes its local copy. This is mostly |
| 166 | +mitigated by Nexus attempting to fully replicate all artifacts onto |
| 167 | +all sleds in every execution of the background task; if there are no |
| 168 | +I/O errors, it only takes one task execution to ensure a repository is |
| 169 | +present across the entire system. |
| 170 | + |
| 171 | +=== Unnecessary work |
| 172 | + |
| 173 | +`artifact_put` and `artifact_copy_from_depot` requests include the |
| 174 | +current generation as a query string parameter. If the generation does |
| 175 | +not match the current configuration, or the artifact is not present in |
| 176 | +the configuration, Sled Agent rejects the request. |
| 177 | + |
| 178 | +[bibliography] |
| 179 | +== References |
| 180 | + |
| 181 | +* [[[rfd424]]] Oxide Computer Company. |
| 182 | + https://rfd.shared.oxide.computer/rfd/424[TUF Repo Depot]. |
| 183 | +* [[[rfd459]]] Oxide Computer Company. |
| 184 | + https://rfd.shared.oxide.computer/rfd/424[Control plane component lifecycle]. |
| 185 | +* [[[omicron5677]]] oxidecomputer/omicron. |
| 186 | + https://github.com/oxidecomputer/omicron/issues/5677[nexus 'quiesce' support]. |
0 commit comments