Rust: Investigating a Strange Out-of-Memory Error

TL;DR: We discovered the hard way that when backtrace is enabled, the anyhow library is capturing a backtrace on every error. Only when printing this error in debug format!(”err: {:?}”, error) backtrace’s symbols are resolved to human-friendly names, thus leading to a significant memory increase for your application. If you need backtrace only on panic, you must set two environment variables, RUST_BACKTRACE=1, and RUST_LIB_BACKTRACE=0, to avoid the issue.

Romain Gérard

January 14, 2025 · 7 min read

Rust: Investigating a Strange Out-of-Memory Error - Qovery

Written by

Romain Gérard

Romain is a staff software engineer at Qovery

See all articles

Engineering

Table of Contents

Conclusion

At Qovery, we have a fairly small service called engine-gateway that handles connections and data transmission with our clients' deployers. This service has a relatively low memory footprint, as its responsibilities are mainly authentication/authorization and streaming bytes forward into a data store for processing.

This engine-gateway has been running fine for months in production without any intervention at a memory usage < 50Mib. Until one day…

That day, we were automatically alerted on Slack that this service had been restarted abruptly and that someone must investigate why.

We reached the service dashboard, but nothing stood up. CPU, memory, network, threads, file descriptors, number of requests… all resources were checked before the crash. Also, there was nothing in the application logs just before; it was all as if the app had been restarted without any reason.

This service runs on Kubernetes, so we checked the pod's status and found out the last reason for the restart was an out-of-memory (OOM), meaning that the application used more memory than the limit we set.

But it felt strange because in our dashboard, the memory consumption of the engine gateway was pretty stable before the crash, and even in general, there was no spike. Also, no slow increase/march of death would indicate a memory leak.

So, to convince ourselves, we dug deeper and connected directly to the node to retrieve the kernel message regarding the OOM. And indeed, dmesg command was greeting us with this.

tokio-runtime-w invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-997
CPU: 6 PID: 1974311 Comm: tokio-runtime-w Not tainted 5.10.210-201.852.amzn2.x86_64 #1
Hardware name: Amazon EC2 t3a.2xlarge/, BIOS 1.0 10/16/2017
Call Trace:
 dump_stack+0x57/0x70
 dump_header+0x4a/0x1f4
 oom_kill_process.cold+0xb/0x10
 out_of_memory+0xed/0x2d0
 mem_cgroup_out_of_memory+0x138/0x150
 try_charge+0x65d/0x6e0
 mem_cgroup_charge+0x7f/0x240
 do_anonymous_page+0xf0/0x520
 __handle_mm_fault+0x499/0x640
 handle_mm_fault+0xbe/0x2a0
 do_user_addr_fault+0x1b3/0x3f0
 exc_page_fault+0x68/0x130
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7f6eabf85bd8
Code: ae c0 13 00 31 d2 48 8d 34 29 48 39 fb 48 89 73 60 0f 95 c2 48 29 e8 48 c1 e2 02 48 83 c8 01 48 09 ea 48 83 ca 01 48 89 51 08 <48> 89 46 08 48 8d 41 10 48 8b 54 24 28 64
RSP: 002b:00007f6eabedd110 EFLAGS: 00010202
RAX: 0000000000000241 RBX: 00007f6ea4000030 RCX: 00007f6e8032cdb0
RDX: 0000000000007015 RSI: 00007f6e80333dc0 RDI: 00007f6eac0c1c60
RBP: 0000000000007010 R08: 0000000000007251 R09: 00007f6e80000000
R10: 0000000000000000 R11: 000000000032f000 R12: 0000000000002250
R13: 00007f6e8032cdb0 R14: 0000000000001000 R15: 0000000000007030
memory: usage 262144kB, limit 262144kB, failcnt 57
memory+swap: usage 262144kB, limit 9007199254740988kB, failcnt 0
kmem: usage 1272kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /kubepods.slice/kubepods-pod7aef5bea_2046_431a_9237_7aa567fe652f.slice:
anon 266821632
file 0
kernel_stack 98304
percpu 0
sock 0
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
anon_thp 0
inactive_anon 266551296
active_anon 0
inactive_file 0
active_file 0
unevictable 0
slab_reclaimable 131272
slab_unreclaimable 0
slab 131272
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgfault 128568
pgmajfault 0
pgrefill 0
pgscan 35
pgsteal 0
pgactivate 0
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0
Tasks state (memory values in pages):
[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[1974229] 65535 1974229      243        1    28672        0          -998 pause
[1974298]  1000 1974298      583      225    36864        0          -997 dumb-init
[1974310]  1000 1974310   154306    76896   741376        0          -997 engine-gateway
oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-ce4373c476b64aa312287f39ace9fdc4720c0ab83c4db76c035cb05ea7f32f7f.scope,mems_allowed=0,oom_memcg=/kub
Memory cgroup out of memory: Killed process 1974310 (engine-gateway) total-vm:617224kB, anon-rss:260492kB, file-rss:47092kB, shmem-rss:0kB, UID:1000 pgtables:724kB oom_score_a

Let’s unpack it a bit. The stack trace is reasonably normal for an OOM. Reading from bottom to top.

This page explains page faults and how they are linked to virtual memory in Linux.

The OOM trace gives some helpful information to narrow down the culprit.

If we look at the breakdown by process, we see that the engine-gateway process is clearly using too much memory. Pause is an internal process for Kubernetes, and dumb-init, which helps manage signals, is not in cause.
The 2nd information is that the memory allocation is for anon(ymous) page, which is the kind of memory used by malloc/free standard libc allocator.

This OOM trace proves that this is the code of our engine gateway that is allocating too much for some unknown reason. More importantly, our monitoring is hiding the truth. The memory surge is so quick that the process is instantly killed without our monitoring being able to report the increase. Our monitoring frequency is set at 10sec, so everything happens under those 10sec.

At this point, we were stuck investigating further. We are running this service with multiple instances, so the crash of one instance is not causing any outage. We control both ends of the service, so we don’t expect any big payload that would cause such an abrupt spike, and clients are able to resume even after an error.

So without any traces to dig in deeper in the code, we decided to let the problem sit, as we were busy with other matters at the time. We only doubled the memory limit of the service in order to catch the issue live if it happens again.

Several weeks later, it happened again. This time, in spite of doubling the memory limit, the OOM propagated at the same time during the night to all running instances of the engine gateway, creating a real outage… To add to the worst, we were left without much more information than the first time.

This time, we took the problem at hand. We increased the engine gateway's memory limit to 4Gib and changed our code to investigate the memory allocation issue.

By default, the libc memory allocator does not provide profiling information that is easily exploitable, so we changed to using the venerable jemalloc so that when we catch the issue live, we can investigate where the allocations are coming from.

For that, we used the crate rust-jemalloc-pprof to retrieve profiling information in pprof’s format and generate a flame graph of allocations.

The change is pretty straightforward in Rust, and well explained in the article. Adding a crate and those lines is enough to use jemalloc in place of the default allocator.

We released everything, and one more time, we waited for blood to be spilled…

This time, we got lucky. We only had to wait several days for the issue to arise again. We saw in our dashboard that the memory increased by more than 400Mib, only to discover that the root cause of the issue was this line of code, which only logged an error.

Looking at the profiling information of the allocation thanks to jemalloc, we can now clearly see what is happening.

We use anyhow crate to manage errors in our application, and we see that calling the Debug implementation for anyhow::Error triggers the symbolization of a backtrace and caches it. The implementation is trying to resolve the symbols of a backtrace to give them human-readable names so that they can be printed.

But why? How? Our first surprise was to discover that when backtrace is enabled, the anyhow library is capturing one on every error created. The second surprise was that we use anyhow pretty much everywhere in this service, and it is not the single path in the code where we log an error. So why does this code path trigger a symbolization and lead to OOM? Because it is the single path where we print it in Debug and not in Display, so only this code path triggers the issue…

This line would not have caused the issue.

error!("Error while getting git token: {}", err);

But this line, yes

error!("Error while getting git token: {:?}", err);

see the issue? Pretty subtle

After looking at anyhow issues, we discovered that it was a common pitfall and that it is even mentioned in the documentation, but with unclear consequences for the profane

If using Rust ≥ 1.65, a backtrace is captured and printed with the error if the underlying error type does not already provide its own. In order to see backtraces, they must be enabled through the environment variables described in std::backtrace:
    If you want panics and errors to both have backtraces, set RUST_BACKTRACE=1;
    If you want only errors to have backtraces, set RUST_LIB_BACKTRACE=1;
    If you want only panics to have backtraces, set RUST_BACKTRACE=1 and RUST_LIB_BACKTRACE=0.

So the fix was easy, not even involving changing code, as we are only interested in backtraces during panic; we restarted the engine-gateway with this extra environment variable RUST_LIB_BACKTRACE=0

#Conclusion

To conclude:

Your monitoring can lie to you if events happen in between its sampling frequency
The documentation you don’t understand is going to bite you in the future
Surprisingly, anyhow/eyre library is leaning by default to capture a backtrace on every error when not explicitly disabled

Voilà! I hope it helps