Innovative Software Technology-Resolving Gitaly Memory Spikes in GitLab EKS with Cgroup v2

During routine operations on a self-hosted GitLab instance running on Amazon EKS, a common challenge emerges: unexpected memory spikes in the Gitaly component. These spikes, particularly noticeable after daily backups, are not automatically released, leading to potential performance degradation and instability. This article delves into the root cause of this behavior and presents Cgroup v2 as the definitive solution for GitLab deployments on Kubernetes.

Understanding Gitaly and GitLab Toolbox Backup

Gitaly is the backbone of Git operations within GitLab. It’s responsible for managing all Git-related tasks—like cloning, pushing, pulling, and merging—by isolating repository storage from the main web application. It communicates with other GitLab services using gRPC, enhancing performance and concurrency control.

The GitLab Toolbox Backup component is critical for data integrity in Kubernetes environments, especially when deployed via Helm charts. This pod orchestrates backup and restore operations. Its interaction with Gitaly is crucial: during a backup, the toolbox connects to Gitaly via gRPC, requests repository backups, receives Git bundles, processes and compresses the data, and finally sends it to object storage like S3.

The Memory Conundrum: Linux Kernel Page Cache

The core of the memory issue lies in the Linux kernel’s default behavior. When Gitaly reads hundreds of Git repositories during a backup, the kernel actively caches these files in RAM as “page cache.” While this caching is generally beneficial for frequently accessed data, it becomes problematic for daily backups.

Even after the backup completes and the Gitaly process returns to normal memory consumption (e.g., 195MB), the kernel retains a substantial portion of this page cache, often marking it as “active_file.” This can lead to the Gitaly pod reporting an inflated memory usage (e.g., 37GB), signaling a false positive for an Out-Of-Memory (OOM) event.

Key Implications on Kubernetes with Cgroup v1:

Shared Kernel: The Linux kernel is a single entity across the entire node.
Global Page Cache: The page cache is shared among all pods on the node.
Cgroup v1 Limitations: Traditional Cgroup v1 only limits how much memory a container can use but struggles to differentiate between actual process memory and cached memory efficiently.
Kernel’s Blind Spot: The kernel doesn’t inherently understand Kubernetes concepts like “pods” or “containers.” If a node has ample free memory, the kernel might hold onto the page cache, even if a specific pod is exceeding its defined memory limits due to this cache, putting the pod at risk of being terminated by the OOM killer.

This scenario creates a critical disconnect: the Gitaly process itself is not consuming excessive memory, but the kernel’s persistent page cache, attributed to the Gitaly cgroup, pushes the pod past its allocated limits.

Investigated Solutions and the Cgroup v2 Advantage

Several solutions were considered to mitigate this issue:

Migrate to Cgroup v2: A more complex migration involving node reboots but offers a definitive, long-term solution.
Privileged CronJob: A quick fix to manually clear caches but is a workaround rather than a fundamental solution.
DaemonSet Monitor: An automated approach to monitor and clear caches, but still a mitigation.
Increase Memory Limits: A temporary palliative that only postpones the problem and wastes resources.

While temporary fixes exist, Cgroup v2 emerges as the superior and permanent solution.

How Cgroup v2 Revolutionizes Memory Management:

Cgroup v2 represents the second generation of Linux’s control group system, offering significant improvements over v1. Unlike Cgroup v1’s multiple, independent hierarchies (memory, CPU, I/O), Cgroup v2 uses a single, unified hierarchy.

Crucially, Cgroup v2 introduces a feature called PSI (Pressure Stall Information). PSI allows the kernel to detect when there’s “memory pressure” within a cgroup. When such pressure is detected, the kernel is intelligent enough to automatically release page cache, even if that cache is marked as “active.” This addresses the core problem by dynamically freeing up memory that is no longer truly needed by the Gitaly process post-backup.

By upgrading your EKS nodes to support Cgroup v2, you empower the kernel to intelligently manage memory resources, preventing Gitaly from appearing to consume excessive memory due to stale page cache and ensuring a more stable and efficient GitLab environment on Kubernetes. This fundamental shift in resource management is essential for robust, cloud-native deployments.

Understanding Gitaly and GitLab Toolbox Backup

The Memory Conundrum: Linux Kernel Page Cache

Investigated Solutions and the Cgroup v2 Advantage

Leave a Reply Cancel reply