Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial studio agent, client awareness and operation reporting #309

Merged
merged 105 commits into from
Feb 8, 2022
Merged
Changes from 1 commit
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
f4d71ae
Start exploring options around a studio agent
garypen Dec 14, 2021
b03ead8
Merge branch 'main' into usage-agent
garypen Dec 14, 2021
a1b6439
Experiment some more with various tracing options
garypen Dec 15, 2021
50a24d9
Merge branch 'main' into usage-agent
garypen Dec 15, 2021
083ca02
Some hackery to figure out what is happening
garypen Dec 16, 2021
4822655
Start integrating apollo telemetry client with router
garypen Dec 17, 2021
e5b8f2a
Add enough functionality to automate sending some trace data
garypen Dec 18, 2021
6ff030d
Focus efforts on the OpenTelemetry approach
garypen Dec 20, 2021
faa3272
Merge branch 'main' into usage-agent
garypen Dec 20, 2021
a63748f
Clean up integration of Apollo OT tracing
garypen Dec 20, 2021
85da9eb
Add usage_agent dependency to github repo
garypen Dec 20, 2021
ceff8dc
Revise git dependency for usage_agent
garypen Dec 20, 2021
5c03ffc
Make cargo deny happy
garypen Dec 20, 2021
06a6117
Add a usage server to the router
garypen Dec 20, 2021
f84cdb1
Update versions to pickup usage-agent changes
garypen Jan 4, 2022
c839ec9
Migrate usage-agent to router repo
garypen Jan 4, 2022
9280c35
Try to fix licensing complaints
garypen Jan 4, 2022
8def90a
Merge branch 'main' into usage-agent
garypen Jan 4, 2022
d60aaea
Pin encoding_rs to 0.8.29 and add usage_agent to deny.toml
garypen Jan 4, 2022
de28a3b
Further investigation of work for studio agent functionality
garypen Jan 6, 2022
8abaebc
Fix some complaints about licensing and clippy
garypen Jan 7, 2022
de78fa0
Merge branch 'main' into garypen/66-usage-agent
garypen Jan 7, 2022
24836db
Move telemetry from prepare to execute
garypen Jan 7, 2022
de66553
Throw away the custom logging investigation
garypen Jan 7, 2022
1e374b3
Start thinking about how to propagate client information
garypen Jan 7, 2022
9715096
Clean up some of the logging and fake statistics
garypen Jan 7, 2022
d0b0d45
Replace use of libflate with flate2
garypen Jan 10, 2022
96cea73
Make sure to seek before truncating
garypen Jan 10, 2022
31e2520
Add some sleeping to see if that helps the linux tests to pass
garypen Jan 10, 2022
1a339dc
Make clear what the call to set_global_tracer() is returning
garypen Jan 10, 2022
29a8f42
Merge branch 'main' into garypen/66-usage-agent
garypen Jan 10, 2022
4c370f3
Add support for configurable collectors
garypen Jan 11, 2022
348cb81
Merge branch 'main' into garypen/66-usage-agent
garypen Jan 11, 2022
b33fb06
Do some snapshot updates and tweaking to help cargo xtask all to pass
garypen Jan 11, 2022
4548890
Enable client detail recording
garypen Jan 12, 2022
084f55c
More cleaning up of code and make agent work with jaeger
garypen Jan 13, 2022
63391e4
Re-structure code in bid to figure out what is causing hangs
garypen Jan 17, 2022
8a049fa
resolve issues with configuration change responses
garypen Jan 18, 2022
b00bdda
some polishing
garypen Jan 18, 2022
7e76cf0
Merge branch 'main' into garypen/66-usage-agent
garypen Jan 19, 2022
91d6231
Improve internal server robustness
garypen Jan 19, 2022
f8bcdc9
Add "uname" support for windows
garypen Jan 19, 2022
52c7e09
pull in code review changes for 331/337
garypen Jan 19, 2022
fd82b60
Merge branch 'main' into garypen/66-usage-agent
garypen Jan 19, 2022
ca8f210
Remove hard-coded addresses and improve configurability
garypen Jan 19, 2022
eaa97d8
Add support for studio integration into router configuration
garypen Jan 20, 2022
297d8d1
flush opentelemtry events and clean up some code names
garypen Jan 20, 2022
346cc77
relayer not required if we don't have graph configuration
garypen Jan 20, 2022
244f7a9
add support for retries and triggering transfers
garypen Jan 21, 2022
663ce7b
rename usage-agent to apollo-relay
garypen Jan 21, 2022
e208be1
format the changes I made
garypen Jan 21, 2022
5cefecf
Fix enough stuff to keep xtask all happy
garypen Jan 21, 2022
8cd1ee4
rename server to relay
garypen Jan 21, 2022
ad289e3
Commit to renaming agent/server to relay
garypen Jan 21, 2022
9819840
Improve performance by lowering synchronisation
garypen Jan 24, 2022
d8d630b
Cleanup reqwest interaction a little
garypen Jan 24, 2022
e705d6f
Merge branch 'main' into garypen/66-usage-agent
garypen Jan 24, 2022
c0fafa9
Add support for overriding apollo ingress
garypen Jan 25, 2022
b99d796
code review feedback on backoff Duration
garypen Jan 25, 2022
eee04d9
Introduce DurationHistory struct to help with bucketing
garypen Jan 25, 2022
8cd56ec
port existing TS tests for DurationHistogram
garypen Jan 26, 2022
f0e4bd8
Merge branch 'main' into garypen/66-usage-agent
garypen Jan 26, 2022
99526d0
Improve normalization error reporting
garypen Jan 26, 2022
87314b7
fix bug where usage data is double transferred
garypen Jan 26, 2022
b2a5c28
rename relay to spaceport
garypen Jan 26, 2022
949bf03
improve comments and clean up some redundant cloning
garypen Jan 26, 2022
aaf8ba0
Improve performance and scalability of telemetry
garypen Jan 27, 2022
9e73d28
Merge branch 'main' into garypen/66-usage-agent
garypen Jan 27, 2022
3ffff59
remove comment and stop producing json for tracing subscribers
garypen Jan 27, 2022
8a86a88
move many spans from info to trace
garypen Jan 28, 2022
b87ff86
Move the telemetry recording a bit further up the stack
garypen Jan 28, 2022
3f873b6
Merge branch 'main' into garypen/66-usage-agent
garypen Jan 28, 2022
b2afdc0
Odd, merged with main but Cargo.lock hasn't been updated
garypen Jan 28, 2022
0d86b50
Odd, why is licenses.html not updated
garypen Jan 28, 2022
92988c6
don't report studio keys on debug output
garypen Jan 28, 2022
a2bb359
Final round of cleanup before starting first code review
garypen Jan 31, 2022
7026a33
Merge branch 'main' into garypen/66-usage-agent
garypen Jan 31, 2022
27caf08
some small changes to fit in with code review comments
garypen Jan 31, 2022
25ad732
review comment
garypen Feb 2, 2022
6c5be65
more review comments
garypen Feb 2, 2022
6ff86de
review comments
garypen Feb 2, 2022
a7d7a59
code review comments lead to spaceport re-structure
garypen Feb 2, 2022
220ccaa
address snapshot review comment
garypen Feb 2, 2022
9f2e421
Merge branch 'main' into garypen/66-usage-agent
garypen Feb 2, 2022
fd7f130
reverse out snapshot level change (back to DEBUG)
garypen Feb 2, 2022
6da5f92
use the same cargo versioning approach as other components
garypen Feb 2, 2022
7da9dcc
remember to fix licenses.html after changing cargo.toml
garypen Feb 2, 2022
78d2004
bryn's code review comment about coud location
garypen Feb 3, 2022
07001d2
improve transfer triggering
garypen Feb 3, 2022
1b15dad
Merge branch 'main' into garypen/66-usage-agent
garypen Feb 3, 2022
f1a78a9
remove snapshots which aren't on main
garypen Feb 3, 2022
13565b6
remove all snapshots not on main
garypen Feb 3, 2022
1f6ef44
merged with main so update licenses.html
garypen Feb 3, 2022
03d809b
review comments
garypen Feb 4, 2022
d5b2d7f
more review comments
garypen Feb 4, 2022
2b01014
code review comments
garypen Feb 4, 2022
2b78f00
add support for apollo telemetry when OTLP tracing enabled
garypen Feb 4, 2022
af58cb9
fix various strange errors relating to insta snapshots
garypen Feb 4, 2022
92b6df9
review comments
garypen Feb 4, 2022
d6fb85e
review comments
garypen Feb 4, 2022
2e3efba
remove op_name from normalized query cache
garypen Feb 4, 2022
adb2e7b
review comment to remove Ingress from a comment
garypen Feb 7, 2022
042f837
review comments
garypen Feb 8, 2022
b9ed1b8
address review comments
garypen Feb 8, 2022
2c83393
update comments around set global tracer
garypen Feb 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Improve normalization error reporting
Use the approved stats_report_key from the TS implementation, although I
may not be using them entirely correctly just now...

Also: fix the format error that David noted and remove the "space"
  • Loading branch information
garypen committed Jan 26, 2022
commit 99526d03a836508059669c742caefe020da622ac
49 changes: 28 additions & 21 deletions apollo-router/src/apollo_telemetry.rs
Original file line number Diff line number Diff line change
@@ -232,7 +232,7 @@ impl SpanExporter for Exporter {
tracing::debug!(index, %span.name, ?span.start_time, ?span.end_time);
if span.name == "graphql_request" {
tracing::debug!("span: {:?}", span);
if let Some(q) = span
if let Some(query) = span
.attributes
.get(&opentelemetry::Key::from_static_str("query"))
{
@@ -241,7 +241,7 @@ impl SpanExporter for Exporter {
span.end_time.duration_since(span.start_time).unwrap(),
1,
);
tracing::debug!("query: {}", q);
tracing::debug!("query: {}", query);

let not_found = Value::String(Cow::from("not found"));
let stats = ContextualizedStats {
@@ -268,7 +268,7 @@ impl SpanExporter for Exporter {
.attributes
.get(&opentelemetry::Key::from_static_str("operation_name"));
// XXX NEED TO NORMALISE THE QUERY
let key = normalize(operation_name, &q.as_str());
let key = normalize(operation_name, &query.as_str());

let msg = reporter
.submit_stats(graph.clone(), key, stats)
@@ -285,22 +285,31 @@ impl SpanExporter for Exporter {
}
}

fn normalize(op: Option<&opentelemetry::Value>, q: &str) -> String {
// If we don't have an operation name, no point normalizing
// it. Just return the unprocessed input.
// Taken from TS implementation
static GRAPHQL_PARSE_FAILURE: &str = "## GraphQLParseFailure\n";
static GRAPHQL_VALIDATION_FAILURE: &str = "## GraphQLValidationFailure\n";
static GRAPHQL_UNKNOWN_OPERATION_NAME: &str = "## GraphQLUnknownOperationName\n";

fn normalize(op: Option<&opentelemetry::Value>, query: &str) -> String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd highly encourage you to name this function something like "statsReportKey" as that's the name we use for this in our protocol. "normalize" is a very ambiguous name as there are many different operation normalization algorithms in our platform (and basically every time we use it it includes stripping comments, so none of them would yield a stats report key with the comment-ish first line).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename it to "stats_report_key". Rust prefers snake case for function names:
https://doc.rust-lang.org/1.0.0/style/style/naming/README.html

// If we don't have an operation name, we can't do anything useful
// with this query. Just return the appropriate error.
let op_name: String = match op {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand where this operation name comes from? It looks like it comes from an otel attribute which I think comes from

operation_name = %request.operation_name.clone().unwrap_or_else(|| "-".to_string()),

My naive interpretation of that code is that this is specifically the out of band operationName explicitly specified in a GraphQL request, ie like {"query": "query Foo { __typename }", "operationName": "Foo"}.

But in the very common case where a named query is provided without an explicit out of band operationName, ie a request like {"query": "query Foo { __typename }"}, my guess is that you're going to hit the "unknown operation name" case here. That's not correct for our reporting protocol: this should be treated as an operation named Foo (ie, the first line of the key should be # Foo).

I think maybe this was something that you were waiting on more apollo-rs support for before you can implement it properly? In that case this should probably be clearly called out via a comment as not currently implementing the right logic.

(If my cursory read of the code is wrong and this will properly get Foo for {"query": "query Foo { __typename }"}, my apologies!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. I'll raise an issue to add a comment for this. Is this a problem that could be solved by enhancing the router_bridge interface as your previously described? (I think not, but I'm asking anyway).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could — IMO it doesn't really make sense to "resolve" the operation name for a request until you've validated the request (because then you can rely on uniqueness of operation names, etc).

Some(v) => v.as_str().into_owned(),
None => return q.to_string(),
None => {
tracing::warn!("Could not identify operation name: {}", query);
return GRAPHQL_UNKNOWN_OPERATION_NAME.to_string();
}
};

let parser = Parser::new(q);
let parser = Parser::new(query);
// compress *before* parsing to modify whitespaces/comments
let ast = parser.compress().parse();
tracing::debug!("ast:\n {:?}", ast);
// If we can't parse the query, we definitely can't normalize it, so
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to do that but that led to (a) tons of cardinality coming from broken input and (b) unsanitized operations getting sent to us. Instead we have special stats report keys for the error cases of "parse error" (invalid GraphQL syntax), "validation error" (operation doesn't work with the schema), and "unknown operation name" (the document is valid but the separately-provided operation name doesn't match an operation in the doc, or isn't provided and there are multiple ops there).

https://github.com/apollographql/apollo-server/blob/d1b19d07af6d660e18021d7c14b91e7c555b4d73/packages/apollo-server-core/src/plugin/usageReporting/plugin.ts#L639-L645

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is WIP until the normalisation support in apollo-rs is available. I'll look at this again then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way I thought it might be helpful for me to be more explicit about what the three kinds of top-level errors mean, since it's a bit subtle.

A GraphQL request (over HTTP/JSON) consists of up to three fields:

  • query (poorly named as it can also contain mutations etc): a required GraphQL document
  • operationName: an optional operation name, used to select between multiple operations in query (and can also be specified even if there's only one, but it must match if provided)
  • variables: the data that varies between different executions of the same operation

(we also have added extensions to that; the spec only specifies this for responses but we use it for stuff like automatic persisted queries.)

GraphQL requests are executed in the context of a schema.

  • Whether or not something is a "parse error" is a function purely of query (and, I mean, the version of the graphql library/language in use of course), not of schema, operationName, or variables. It happens when the actual language in query fails to parse. (Servers can cache the result of parsing indefinitely.)
  • Whether or not something is a "validation error" is a function of query and schema, but not of operationName or variables. It happens when the properly-parsing document tries to do things that don't work with the given schema. It consists of a bunch of rules defined in the validation section of the spec. (Servers can cache the result of validation for as long as the underlying schema does not change.)
  • Whether or not something is an "unknown operation name" error is a function of query and operationName. According to how the spec is laid out (and how the reference implementation graphql-js works) checking for this is actually the first step of the execution phase, but we find it helpful to effectively separate it into its own page (for various reasons but largely because our usage reporting is based on the specific executed operation so we want to treat this kind of error as distinct from other execution errors).

The other kind of errors that can arise happen at execution time. Most of them are "field errors"; the one other kind of "request error" is if a variable value cannot be coerced to its declared type.

// just return the un-processed input
// just return the appropriate error.
if ast.errors().len() > 0 {
return q.to_string();
tracing::warn!("Could not parse query: {}", query);
return GRAPHQL_PARSE_FAILURE.to_string();
}
let doc = ast.document();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a step here that applies the standard GraphQL Validation algorithm? ie, the thing that checks that the query actually works with the graph, and that it has all sorts of necessary internal consistencies? Does this happen elsewhere in the router, or is it something that's waiting on apollo-rs support? ie, that means implementing all the logic in http://spec.graphql.org/draft/#sec-Validation

Anyway, this algorithm requires us to know whether the operation passes validation before checking for operation names. (Otherwise you can't assume things like "there's at most one operation for a given name" or "if there's an anonymous operation then it's the only one".) So that should go here. Though presumably this is less a matter of "we should run the validation operation in the middle of normalize and more "whatever code in the router parses and validates operations should put the status of those checks somewhere that the normalize code can read".

...

...

ok, I'm now realizing that the way that validation works in router today (confirmed with @abernix) is that we run the graphql validation in graphql-js over the bridge. (or actually... we might not actually run validation at all right now? but if so, that's a bug and the plan I believe is to do it in graphql-js for now. validate runs on version-0.x but not main.)

So I think what you need to do is extend the router bridge API so that the status of GraphQL validation comes back explicitly from the JS query planner, and that that boolean is made available to the telemetry code.

That said... if you're already making this normalize function dependent on something returned from JS... maybe the best short term plan is to just do the entire stats report key calculation in JS. I'd be happy to collaborate on making the stats report key code in apollo-server a bit more accessible to router-bridge code, or honestly copy-pasting it there would be fine too. This would mean you'd be able to have gateway-compatible stats report keys right now before apollo-rs gives you all the tools you need; implementing stats report key generation in Rust can be part of the broader project of porting JS stuff to Rust rather than a release blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no logic here to validate the graphQL. I think your suggestion to extend the router bridge makes most sense for this body of problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing::debug!("{}", doc.format());
@@ -310,25 +319,24 @@ fn normalize(op: Option<&opentelemetry::Value>, q: &str) -> String {
.into_iter()
.filter(|x| {
if let ast::Definition::OperationDefinition(op_def) = x {
match op_def.name() {
Some(v) => {
return v.text() == op_name;
}
None => {
return op_name == "-";
}
}
return match op_def.name() {
Some(v) => v.text() == op_name,
None => op_name == "-",
};
}
false
})
.collect();
tracing::debug!("required definitions: {:?}", required_definitions);
assert_eq!(required_definitions.len(), 1);
if required_definitions.len() != 1 {
tracing::warn!("Could not find required single definition: {}", query);
return GRAPHQL_VALIDATION_FAILURE.to_string();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether this should be a validation failure or an unknown operation name failure depends on exactly what's going on here, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. If we extend the router-bridge as described I presume we'll have the information we need and won't need to do this checking.

}
let required_definition = required_definitions.pop().unwrap();
tracing::debug!("required_definition: {:?}", required_definition);
// XXX Somehow find fragments...
let def = required_definition.format();
format!("# {} \n{}", op_name, def)
format!("# {}\n{}", op_name, def)
}

struct DurationHistogram {
@@ -360,7 +368,6 @@ impl DurationHistogram {
return DurationHistogram::MAXIMUM_SIZE;
}

println!("unbounded_bucket: {}", unbounded_bucket);
unbounded_bucket as usize
}

Original file line number Diff line number Diff line change
@@ -1,13 +1,8 @@
---
source: apollo-router/src/apollo_telemetry.rs
assertion_line: 279
assertion_line: 409
expression: normalized

---

{
user {
name
}
}
## GraphQLUnknownOperationName

Original file line number Diff line number Diff line change
@@ -1,13 +1,8 @@
---
source: apollo-router/src/apollo_telemetry.rs
assertion_line: 293
assertion_line: 423
expression: normalized

---

query {
user {
name
}
}
## GraphQLUnknownOperationName

Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
source: apollo-router/src/apollo_telemetry.rs
assertion_line: 308
assertion_line: 438
expression: normalized

---
# OpName
# OpName
query OpName { user { name } }
Original file line number Diff line number Diff line change
@@ -1,20 +1,8 @@
---
source: apollo-router/src/apollo_telemetry.rs
assertion_line: 329
assertion_line: 459
expression: normalized

---

{
user {
name
...Bar
}
}
fragment Bar on User {
asd
}
fragment Baz on User {
jkl
}
## GraphQLUnknownOperationName