- Products
- Solutions Use casesBy industry
- Developers
- Resources Connect
- Pricing
Let’s start with a joke:
A man comes to the foreign embassy and says:
– I am a spy, I want to surrender.
He is asked:
– Whose spy are you?
He says:
– I am an American spy.
– Well, then, you need to go to room #5.
He goes to the room #5:
– I am an American spy, I want to surrender.
– Do you have weapons?
– Yes, I do.
– Please go to room #7.
He goes to the room #7:
– I’m a spy, I want to surrender, I have weapons.
– Please go to room #10.
He goes to the room #10:
– I’m a spy, I want to surrender, I have weapons.
– Do you have communications equipment?
– Yes, I do.
– Please go to room #20.
He arrives to the room #20:
– I’m a spy, I have weapons and communications equipment, and I want to surrender.
He is asked:
– Do you have a mission?
– Yes, I do.
– Well, then go and execute your mission. Stop distracting people from their work!
This joke is mainly about the bureaucracy of large organizations, but I find the joke fascinating for another reason. In a complex microservices architecture, the same bureaucracy can also take place.
An HTTP request is passed from one service to another then another. When the request is rejected, we often lose the entire trace and never know where it went wrong.
Let’s imagine a scenario where you work for an online shopping website. One customer bought a lamp from your site on Jan 5, 2023, but the lamp was never delivered. They reach out to customer support and ask what happened. Thus the nightmare begins.
“Buying a lamp” is an HTTP request to place an order. You reach out to the API gateway team, and they quickly find the request. They told you the request was correctly passed down to backend service A. There are no issues at the API gateway layer.
Then you go to team A and ask for their help. They finally found the request log by looking through all logs on Jan 5. Again, no failures. The request is successfully passed down to internal service B.
You have to do the same for services B, C, D. However, team D cannot find any log related to this HTTP request. No one knows where the request went, and no one knows why it failed.
The microservices topology sometimes becomes so complex that nobody in the company has a full picture. No wonder why Elon Musk wants to shut down 80% of Twitter’s microservices:
The answer to the problem is distributed tracing. It is a method to observe the request propagate through microservices. With distributed tracing, a request leaves a trace of the footprint on each service it makes a request to. When you want to diagnose a request failure, you must find all the logs tagged with the request trace. It gives you a full picture of the request lifecycle.
OpenTelemetry is the most popular tool for distributed tracing. From the dashboard, you can see the Trace ID
, the journey of a request, and logs related to that trace. The trace also contains the latency of all service hops in the process. It becomes a lot easier to diagnose issues in a microservices system.
The essential components of OpenTelemetry are trace and span:
A span has attributes. You can attach custom info in a span, such as http.method
, user_id
, env
, etc. Span can have a parent-child relationship. If span A is a child of span B, this means function A is invoked inside function B.
This is a typical trace for an HTTP request:
The API gateway (blue) span is the parent (or grandparent) span of all other spans. It starts when receiving HTTP request from user and ends when the HTTP response is sent back. The span latency is 1000 milliseconds. Let’s break this down further:
Below is a trace of a Nylas request; look how similar it is to the previous example:
If an HTTP request can be traced, can we also trace a PubSub flow?
The answer is yes. Let’s take a look at another example. This is a trace for a flow connected by message queues (Google PubSub):
Let’s break down the above diagram further:
Let’s see the Nylas trace of a PubSub flow:
The first flow is synchronous. The API gateway span is the ancestor of all other spans. Meanwhile, the second flow is asynchronous. There is no common ancestor. The entire trace looks like a relay race.
Inside the same Golang service, traces are passed by the ctx
object. Between services, traces are passed by traceparent
HTTP header. We will talk more about the traceparent
header shortly.
Now we know the basic concept of OpenTelemetry. Let’s look at how to set up distributed tracing from scratch.
The first step is deploying OpenTelemetry collector. OpenTelemetry collector is a service that pulls traces from microservices and sends them to a monitoring dashboard (such as NewRelic, Jaeger, Prometheus, Coralogix).
This is how distributed tracing works in a microservice architecture:
In the diagram above, OpenTelemetry collector gathers traces from microservice A, B, C and then sends tracing to three monitoring services. OpenTelemetry itself does not store data, and neither does it provide a trace monitoring UI. You will need to pay for observability providers such as New Relic, Jaeger, or Coralogix.
Let’s look at the OpenTelemetry collector under the hood. It has three components:
Let’s update our diagram with more details:
The collector has one receiver, three processors, and three exporters.
OpenTelemetry collector can be deployed with Helm:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts helm install my-opentelemetry-collector open-telemetry/opentelemetry-collector -f values.yaml
The collector can be configured with Helm values.yaml
file, here is an example:
config: processors: # Tail sampling processor only samples error trace or slow trace tail_sampling: decision_wait: 10s policies: [ { name: errors-only-policy, type: status_code, status_code: { status_codes: [ ERROR ] } }, { name: latency-only-policy, type: latency, latency: { threshold_ms: 5000 } } ] # Limit memory to 75% memory_limiter: check_interval: 1s limit_percentage: 75 spike_limit_percentage: 25 # Export traces to Coralogix exporters: coralogix: application_name: "us-central1-prod" subsystem_name: "us-central1-prod" # Register receivers, processors, and exporters service: pipelines: traces: receivers: [ otlp ] processors: [ memory_limiter, tail_sampling, k8sattributes, batch ] exporters: [ coralogix, otlp ] # Other Helm values resources: requests: cpu: 2 memory: 8Gi limits: cpu: 2 memory: 8Gi
After you run the helm install
command, OpenTelemetry collector will be available in the url my-opentelemetry-collector.default.svc.cluster.local:4317
. With the collector in place, we can start sending traces from microservices.
In a Go microservice, you need to initialize tracing in the main.go
function:
package main // Use service name as tracer name var tracer = otel.Tracer("my-server") func main() { // Connect to collector ctx, cancel := context.WithTimeout(context.Background(), time.Second) defer cancel() conn, err := grpc.DialContext(ctx, "my-opentelemetry-collector.default.svc.cluster.local:4317", grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock()) if err != nil { panic(err) } // Set up a trace exporter exporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithGRPCConn(conn)) // Handle error and create tracer provider if err != nil { panic(err) } tp := sdktrace.NewTracerProvider( sdktrace.WithSampler(sdktrace.AlwaysSample()), sdktrace.WithBatcher(exporter), sdktrace.WithResource( resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceNameKey.String("my-server"), )), ) otel.SetTracerProvider(tp) otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{})) // Shut down tracing gracefully defer func() { if err := tp.Shutdown(context.Background()); err != nil { log.Printf("Error shutting down tracer provider: %v", err) } }() // Start server // ... }
Now the server can send a trace to OpenTelemetry collector. Let’s look at how to pass traces between services.
By W3C standard, trace data is passed from client to server using traceparent
HTTP header:
Traceparent: <version>-<trace-id>-<parent-id>-<trace-flags>
Let’s break down an example header:
Traceparent: 00-351772fff98e2108663b0d55f8eb5466-46158e065b2f87cb-01
00
. As of this blog, the latest version is still 00
.351772fff98e2108663b0d55f8eb5466
. This does not change for the lifecycle of a request.46158e065b2f87cb
. This is the span of the HTTP client who sent the request.01
. As of this blog, the only valid flag is 01
, which means the trace is sampled.The full specification of HTTP tracing is written here: https://www.w3.org/TR/trace-context/. The good news is we don’t have to manually write or parse this header. It is already handled by OpenTelemetry library.
In Nylas, we use Go Fiber for all public APIs. Fiber has an OpenTelemetry middleware called otelfiber
. It automatically parses traceparent
header, and creates a new span with HTTP endpoint as its name.
The parsed trace and the new span are put in the fiber context (c *fiber.Ctx
). You can call ctx = c.UserContext()
to translate fiber context c
into a Go context ctx
. Then, you can create child spans from ctx
.
Here is an example Fiber server with tracing enabled:
package main func main() { // Create fiber server app := fiber.New() // otel middleware app.Use(otelfiber.Middleware("my-server")) // Define routes app.Get("/hello", helloWorld) // Listen and serve err = app.Listen(":3000") if err != nil { panic(err) } } func helloWorld(c *fiber.Ctx) error { // Translate fiber context into Go context ctx := c.UserContext() // Create span for handler thisCtx, span := tracer.Start( ctx, "helloWorld", oteltrace.WithAttributes( attribute.String("some-attribute-key", "some-attribute-value"), ), ) defer span.End() // Return response return c.JSON(fiber.Map{"hello": "world"}) }
You will see a span created with name /hello
. In helloWorld
function, we created a child span named helloWorld
.
HTTP client can serialize a Go context object into traceparent
header automatically. When making a request, the HTTP client creates a new span with url as span name. Here is an example:
// Create client with otel client := http.Client{Transport: otelhttp.NewTransport(http.DefaultTransport)} // Create a request with Go context req, err := http.NewRequestWithContext(ctx, "GET", "http://some-url.com", nil) if err != nil { return "", err } // Send request res, err := client.Do(req) if err != nil { return "", err } defer res.Body.Close() // Read response body, err := io.ReadAll(res.Body) if err != nil { return "", err } fmt.Println(string(body))
A span http://some-url.com
will be created. Its parent span is the one carried by Go context object.
In grpc, trace is passed with metadata traceparent
. Again, this is automatically handled by OpenTelemetry library.
// Create grpc listener lis, err := net.Listen("tcp", fmt.Sprintf("0.0.0.0:%d", port)) if err != nil { panic(err) } // Append OpenTelemetry Interceptor grpcServer := grpc.NewServer( grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor()), grpc.StreamInterceptor(otelgrpc.StreamServerInterceptor()), ) // Register grpc server rpc.RegisterMyGrpcServer(grpcServer, myGrpcServer) // Start server err = grpcServer.Serve(lis) if err != nil { panic(err) }
Upon receiving a grpc request. The traceparent
metadata is deserialized. Then the server automatically creates a span named <grpc-service-name>/<grpc-action-name>
, and put it in the Go context object.
conn, err := grpc.Dial( "["+grpcHostName+"]:"+grpcPort, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()), grpc.WithStreamInterceptor(otelgrpc.StreamClientInterceptor()), ) if err != nil { return panic(err) }
Similar to HTTP 1.1 service, grpc client also creates a span called <grpc-service-name>/<grpc-action-name>
, and serialize it into traceparent
attribute.
Google PubSub does not offer a standard tracing library, so we have to write our own. We manually create a PubSub attribute called traceparent
in each message. Then we handle serialization and deserialization with our own library.
This is an example library for publishing a message:
func BeforePublishMessage(ctx context.Context, tracer trace.Tracer, topicID string, msg *pubsub.Message) (context.Context, trace.Span) { opts := []trace.SpanStartOption{ trace.WithSpanKind(trace.SpanKindProducer), trace.WithAttributes( // customizable span attributes semconv.MessagingSystemKey.String("pubsub"), semconv.MessagingDestinationKey.String(topicID), semconv.MessagingDestinationKindTopic, ), } ctx, span := tracer.Start(ctx, fmt.Sprintf("%s send", topicID), opts...) if msg.Attributes == nil { msg.Attributes = make(map[string]string) } // propagate Span across process boundaries otel.GetTextMapPropagator().Inject(ctx, propagation.MapCarrier(msg.Attributes)) return ctx, span } func AfterPublishMessage(span trace.Span, messageID string, err error) { if err != nil { span.RecordError(err) span.SetStatus(codes.Error, err.Error()) } else { span.SetAttributes(semconv.MessagingMessageIDKey.String(messageID)) } }
This is how you can use it to publish message with tracing:
// Create a new span ctx, span := telemetry.BeforePublishMessage(ctx, tracer, topicName, &msg) defer span.End() // Publish messageID, err := topicObj.Publish(ctx, &msg).Get(ctx) telemetry.AfterPublishMessage(span, messageID, err) if err != nil { return err }
This is an example library to receive a PubSub message:
type PubSubHandler = func(context.Context, *pubsub.Message) func WrapPubSubHandlerWithTelemetry(tracer trace.Tracer, topicID string, handler PubSubHandler) PubSubHandler { return func(ctx context.Context, msg *pubsub.Message) { // create span ctx, span := beforePubSubHandlerInvoke(ctx, tracer, topicID, msg) defer span.End() // call actual handler function handler(ctx, msg) } } func beforePubSubHandlerInvoke(ctx context.Context, tracer trace.Tracer, topicID string, msg *pubsub.Message) (context.Context, trace.Span) { if msg.Attributes != nil { // extract propagated span propagator := otel.GetTextMapPropagator() log.Info().Msg("Extracing traceparent from message attribute") ctx = propagator.Extract(ctx, propagation.MapCarrier(msg.Attributes)) } opts := []trace.SpanStartOption{ trace.WithSpanKind(trace.SpanKindConsumer), trace.WithAttributes( //customizable attributes semconv.FaaSTriggerPubsub, semconv.MessagingSystemKey.String("pubsub"), semconv.MessagingDestinationKey.String(topicID), semconv.MessagingDestinationKindTopic, semconv.MessagingOperationProcess, semconv.MessagingMessageIDKey.String(msg.ID), ), } return tracer.Start(ctx, fmt.Sprintf("%s process", topicID), opts...) }
And this is how you can use it:
handlerWithTracing := telemetry.WrapPubSubHandlerWithTelemetry(tracer, topicName, handler) sub.Receive(ctx, handlerWithTracing)
I wrote a proof of concept for OpenTelemetry integration: https://github.com/quzhi1/OpenTelemetryPlayground. It contains two services: service A and service B. Service A can call service B in both synchronous and asynchronous ways. You can see how tracing works in both cases.
Diagnosing request failures in microservice is hard. We solved this problem by using distributed tracing with OpenTelemetry. OpenTelemetry collector gathers traces from all Nylas services, and sends to our observability provider (Coralogix). We figured out how to propagate trace with HTTP, grpc and Google PubSub. With OpenTelemetry, we have a complete picture of a request’s lifecycle.
Special thanks to all Nylanauts who helped with the OpenTelemetry integration:
Zhi is a staff engineer who leads the Developer Velocity team at Nylas. In 2021, he left Stripe and became a Nylanaut. Zhi is also a history buff who loves traveling.