hero banner finding memory leaks

Finding memory leak in Go service

8 min read
Tags:

We have a process to diagnose memory leak for Go services. Tools such as pprof and minikube can help us finding the root cause.

Intro

Every programming language can cause memory leak.

At Nylas, we build microservices with Golang. We love Golang light-weight runtime and concurrency. Our Golang services run in Kubernetes, making it is easy to scale up and down. We monitor the CPU and memory usage for each Kubernetes pod to make sure services are in good health.

Sometimes, we find pods restarting several times a day without any error. The memory consumption keeps going up, until it reaches the memory limit. Here is a visualization of the memory consumption and pod restarts:

Memory allocation graph for Kubernetes pods
Pod restart graph

The pod restarts when the memory limit is reached. This potentially impacts service availability, since our API can be temporary unavailable during a restart.

A “wavy” memory consumption may not be a memory leak. It can be caused by Golang’s built-in garbage collection. During garbage collection, the memory consumption swings periodically. This is a typical memory consumption graph for a Go program:

Memory allocation for a typical Go program

Golang’s garbage collection strategy is the result of a design trade-off between CPU and memory usage. One round of garbage collection can take many CPU cycles to free up memory. If garbage collection runs too frequently, a program can become slow and unresponsive. Go deliberately delays garbage collection unless it is necessary.

By default, the garbage collector starts when new heap size is equal to 100% of live heap size. If live heap takes 20 MiB, the garbage collection only starts after new heap is also 20 MiB. This behavior is configurable by setting the environment variable GOGC. Check out https://tip.golang.org/doc/gc-guide for more on Golang’s garbage collection.

We quickly ruled out garbage collection as the cause of our memory issue. Our active heap is very small (around 20 MiB), but our memory consumption seems to grow infinitely (even over 1 GiB). In addition, Go garbage collection does not lead to pod restarts. The restarting behavior is clearly a result of exceeding Kubernetes resource limit. We even tried to run garbage collection manually (via runtime.GC()) periodically to clear memory, but our memory usage still keeps growing.

This is not normal garbage collection. We are quite certain a memory leak is taking place.

Profiling memory

Many of our API services use Go Fiber framework. Fiber has a built-in middleware to profile memory called pprof. Here is how you can install pprof in your API service:

func main() {
    // Create fiber server
    app := fiber.New()

    // Use pprof to profile memory usage
    **app.Use(pprof.New())**

    // Start
    app.Listen(":8080")
}

If your service is using Gorilla/mux, this is how to install pprof:

func main() {
    r := mux.NewRouter()
    AttachProfiler(r)
    http.ListenAndServe(":8080", r)
}

func AttachProfiler(router *mux.Router) {
    router.HandleFunc("/debug/pprof/", pprof.Index)
    router.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
    router.HandleFunc("/debug/pprof/profile", pprof.Profile)
    router.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
}

If the service is not using either of above, you can start a fiber or gorilla mux server on another port.

Once pprof is installed, run the server and go to localhost:8080/debug/pprof where you will see a UI like this:

UI of /debug/pprof endpoint

Let’s dig a little deeper and understand the different profiles available:

  • “Allocs” shows you which line in which function is allocating memory.
  • “Goroutine” shows number of active Go routines.
  • “Heap” is just a sample of the heap.

One way of finding a memory leak is by reading the heap sample to find patterns. If a function is always putting new stuff on the heap, it is highly likely to have a memory leak.

See this example heap:

Heap sample

Function appendSlice is repeatedly allocating memory in the heap. Obviously, this function is causing memory leak.

However, raw heap samples are not easy to read. It is difficult to find memory pattern out of a huge text blob. Fortunately, we have a way to visualize memory usage:

# one time install
go install github.com/google/pprof@latest
# Dump heap to a file
curl http://<HOSTNAME>:<PORT>/debug/pprof/heap > heap.out
# Use pprof to interact with heap
go tool pprof heap.out
# Inside the new command prompt
png

After taking the steps above, pprof will generate a memory allocation diagram named profile001.png:

Memory allocation diagram

In the diagram, each box is a function, and each arrow means a function call. The bigger the box, the higher memory usage. From the graph above, the blame goes to runtime function “allocm” (the largest box near the bottom).

Once we found which function is causing the problem, we can check how the memory is leaked. Let’s look at some typical cases.

Typical memory leaks

Growing global variable

Let’s take a look at the following program:

var globalSlice = make([]int64, 0)

func appendSlice(c *fiber.Ctx) error {
    globalSlice = append(globalSlice, time.Now().Unix())
    return c.JSON(map[string]int{
        "sliceSize": len(globalSlice),
    })
}

When the server starts, the size of globalSlice is 0. Every time we call appendSlice function, it will append a number to the global slice. Since it is a global variable, the slice will live in heap forever. It will keep growing and growing until memory is exhausted.

It is not common to declare a global slice directly. Unbounded slices often hide in global structs. We encourage developers to audit all global variables, and make sure all of them has limited memory allocation.

Hanging go routine

Take a look at the function below:

func hangingGoRoutine() {
    go time.Sleep(time.Hour * 24)
}

Every time function hangingGoRoutine is called, a Go routine get created. The Go routine remains alive for 24 hours. If we call this function 1000 times, there will be 1000 go routines. Growing number of Go routines means unbounded memory. If a Go routine is not properly closed, it will also result in a memory leak.

Usually, a hanging Go routine is not as simple as the sleeper example above. It can be an http client that keeps connection alive. It can also be a dead loop. Long polling or web socket client both keep connection open forever. If you are going to use a never-ending Go routine, make sure there is only one such connection.

Open streams

Let’s consider the code below:

func openFile() {
    file, err := os.Open("/path/to/file.txt")
  // defer file.Close()
    if err != nil {
        log.Fatal(err)
    }
}

This code will result in a memory leak, because os.Open opens a file stream, but never closes it. If openFile function is called repeatedly, it will keep allocating memory for the new file streams. Make sure you always call defer file.Close() after opening a file.

Let’s look at another example:

func makeHttpCall() {
  client := &http.Client {
  }
  req, err := http.NewRequest(method, url, nil)

  if err != nil {
    panic(err)
  }

  res, err := client.Do(req)
  if err != nil {
    panic(err)
  }

  body, err := ioutil.ReadAll(res.Body)
  // defer res.Body.Close()
  if err != nil {
    panic(err)
  }
  fmt.Println(string(body))
}

The code above also causes memory leak. Function ioutil.ReadAll reads the response body, but never close it. The function needs to call defer res.Body.Close().

It turns out that this is exactly what happened in our system. Our services makes a http call, but forgot to close the response body when reading it.

Reproduce memory leak locally

We do not suggest using fiber’s pprof in production:

  1. Fiber’s pprof tool exposes an endpoint /debug/pprof. This endpoint is not protected by authentication. If your API is public, anyone in the world can see your memory allocation.
  2. The heap in production will be very big. It is not easy to analyze such a large body of text.

We encourage profiling memory locally. You can deploy your microservice locally with Minikube and Tilt. For more on deploying services locally, you can check my previous article here: https://www.nylas.com/blog/how-we-test-microservices-locally-at-nylas/

Minikube does not show CPU and memory usage graph by default. You will need to enable metric server plugin:

minikube addons enable metrics-server
minikbue start
minikube dashboard

Wait for some time until there is enough data. Then you will see a graph like this:

CPU and memory graph in Minikube

To simulate API traffic, you can write a simple script like this:

#!/bin/bash

for CALL_I in {1..10000}
do
  curl --location --request GET 'http://localhost:8080/append-slice'
  echo "Calling /append-slice $CALL_I/10000"
done

Watch the memory graph when running this script. If there is a memory leak, memory usage graph will rise and never going down.

Build time!

I made a proof of concept repository for investigating Go memory leaks: https://github.com/quzhi1/GoMemoryLeak. You can clone this repo, see an example of memory leaking service, and investigate using the steps above.

Conclusion

If a service has growing memory usage, check whether it is a memory leak. Consider profiling tools such as pprof to find which function is causing the leak. Do an audit of your code base, and find how the memory was leaked. Lastly, try to reproduce the problem and the fix locally using minikube.

Special thanks to all Nylanauts who helped building and deploying Kong plugins:

  • Pouya Sanooei
  • Prem Keshari

You can sign up Nylas for free and start building!

Related resources

How to Solve Webhook Integration Challenges with PubSub Notification Channel

Key Takeaways This article addresses the challenges of webhook integration and introduces the PubSub Notification…

How to Send Emails Using an API

Key Takeaways This post will provide a complete walkthrough for integrating an email API focused…

How to build a CRM in 3 sprints with Nylas

What is a CRM? CRM stands for Customer Relationship Management, and it’s basically a way…