Why Our Pods Were Breaking Bad (and How We Fixed Them)

Kshitij Nawandar
5 min readDec 11, 2024

--

Background

Memory leaks and performance bottlenecks are common challenges faced by applications, particularly those handling high traffic or large-scale workloads. These issues often lead to increased resource consumption, slower response times, and degraded user experiences. In many cases, scaling resources like CPU and memory may provide temporary relief, but the underlying problem persists, manifesting again as application performance degrades over time. Identifying and resolving memory leaks becomes crucial to maintain optimal performance, ensure scalability, and improve overall user satisfaction.

In this article, we’ll walk through the process of diagnosing a memory leak, analyzing the root cause, and implementing effective solutions to mitigate its impact. We’ll explore practical steps that any application, regardless of the underlying stack or architecture, can follow to troubleshoot and optimize performance.

Background

In large scale applications such as the UPI Switch for Razorpay, which processes almost 100 transactions per second, managing memory efficiently is crucial for maintaining optimal performance. UPI Switch supports high-traffic workloads, handling various deployments such as the API service and workers for processing requests from NPCI, managing webhooks, and handling background tasks like analytics.

Over time, we observed that the workers consistently consumed excessive CPU and memory, leading to degraded API latencies, particularly on critical endpoints like Collect and Verify. Despite scaling efforts and increasing CPU and memory limits, the issue persisted. The problem was only temporarily alleviated by restarting the pods, indicating a deeper issue rather than a simple resource constraint.

Symptoms and Metrics

To understand the scope of the problem, we analyzed key metrics and observed the following:

Despite our efforts to scale the pods and increase resource limits, the issue persisted, with pod restarts providing only temporary relief. This indicated that the root cause wasn’t simply resource exhaustion but rather a deeper, underlying problem.

Investigation

We were confident that the high latency of APIs was tied to pod degradation since all other infrastructure components had been verified to be working optimally. Despite increasing the pod count and adjusting CPU and memory limits, the problem persisted.

The degradation was only mitigated temporarily by rotating the pods, a stopgap solution that highlighted the need for deeper investigation and a long-term fix.

Heap memory analysis

We monitored 1 particular pod which had reached CPU utilisation of 5 cores. We decided to observe heap memory metrics for this pod.

  • Heap memory — amount of memory obtained from the operating system for the heap
  • Heap memory allocated — The number of bytes currently allocated and actively used by the heap.
  • Heap memory idle — The amount of memory allocated by the Go runtime for the heap but not currently in use for live objects.
  • Heap memory inuse — The amount of memory that is actively being used by the heap for live objects.
  • Heap memory released — The amount of heap memory that has been returned to the operating system.

Inference — Heap memory is available and is being actively released.

Garbage Collector(GC) analysis

We decided to observe GC metrics — the rate of GC cycles/s and the duration of GC cycles.

Inference-GC is being actively triggered even though heap memory is available.

To identify trends between the different metrics and CPU utilisation, we tried plotting normalised for different metrics against CPU utilisation.

We observed that the rate of GC cycles was highly correlated with CPU utilisation. This gave us a hunch that it was GC that was consuming high CPU and leading to pod degradation.

Optimizing Garbage Collection (GOGC)

One key finding was the impact of Go’s garbage collector (GOGC environment variable), which controls the frequency and efficiency of GC cycles. By default, GOGC is set to 100%, meaning GC runs every time the heap size doubles. We considered adjusting this from its default setting of 100% to 150% to reduce the frequency of GC cycles and alleviate excessive CPU usage.

Pprof Analysis

To gain deeper insights, we used Go’s pprof tool, which provides detailed profiling of application behavior. We collected profiles at multiple time instances to track resource usage over time.

  • Heap Profile: Exposed via /debug/pprof/heap to capture memory usage.
  • CPU Profile: Exposed via /debug/pprof/profile to monitor CPU usage.
  • Goroutine Profile: Exposed via /debug/pprof/goroutine to track goroutine activity.

By analyzing these profiles, we identified a global variable that was continuously growing. The memory consumption linked to this global array was growing with each GC cycle, increasing CPU usage and impacting performance.

We identified the issue in the codebase and fixed it with 1 line of code change !

Results

CPU and Memory Utilization

After deploying the code fix, we observed significant improvements:

  • Pod CPU usage stabilized at around 150m throughout the day, without any further degradation.
  • Peak memory usage decreased from 700 MiB to around 50 MiB for the workers

API Latency Improvement

  • Collect API: Latency dropped from 445ms to 223ms.
  • Latency spikes due to resource exhaustion were no longer observed.

Lessons learned

  • Monitor GC and Heap Metrics Regularly: GC behavior and heap memory usage are closely tied to performance. Keeping track of these metrics helps identify memory leaks and inefficiencies.
  • Tuning GOGC: Adjusting GOGC settings based on the specific needs of your application can lead to significant performance gains.
  • Profiling with Pprof: Tools like pprof provide valuable insights into resource consumption, helping pinpoint the root causes of performance degradation.
  • Code Review and Cleanup: Identifying and removing long-living global variables or inefficient data structures can prevent memory leaks and improve performance.

--

--

No responses yet