API response time from 500ms → 200ms and less

Posts

November 19, 2025

💡

This post is my braindump on how we used ruby heapdump to make rails api faster(response time: 500ms → ~120ms).

Story

Once upon a time in 2023, I was stress testing our api endpoints with my friend/oozzal to prepare for the season peak traffic. We randomly started to stress test our endpoints for login, health_check with grafana/k6s and tsenart/vegeta, like.

Sample vegeta run(not actual run):

echo 'GET https://my.api.thapakazi.com/login' \
  | vegeta attack -duration=10s -rate=380 \
  | tee /tmp/results.bin
  | vegeta report

Requests      [total, rate, throughput]         3800, 380.10, 76.08
Duration      [total, attack, wait]             39.998s, 9.997s, 30s
Latencies     [min, mean, 50, 90, 95, 99, max]  129.394ms, 6.558s, 305.681ms, 30s, 30.001s, 30.001s, 30.009s
Bytes In      [total, mean]                     286042, 75.27
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           80.08%
Status Codes  [code:count]                      0:757  200:3043

But to my dismay, our server started to chock with very low volume of request and ironically the number of request that production environment could handle was very very lower compared to that of staging 😭

Our quick instinct was to instantly blame it on our api server/puma config, i.e, number of child process(and associated worker threads) and we tried tweaking the values. While looking for the best practices we found similar thread from dhh@rails#54050, and we tried minimizing threads count per workers and hammering more.

With this config changes we noticed slight better performance in numbers of request servers could handle/respond Vs resource usage like: cpu spikes, but response from api was no where near satisfying in any way🙅🏼‍♂️. We also noticed, on a different test app that even with tiny ec2 instance with similar puma specs/version, server was doing way better in terms of number of request/responds per seconds.

🤔: something fishy

That moment we had a suspicion, something really wrong with our api/data, its not just the puma or nginx config or hardware spec.

So we started digging deeper into the holes for the answers. We tried multiple debugging tools/libs to understand why the servers chocking so easily around 100s+ of requests(I can’t recall the numbers but it was so tiny)

Debugging 🔪

We had many observability tools in place, but none were handy enough to inspect the full context on how the server is actually serving a single http request.

Hence we decided to capture the dump and analyze it in slow motion. With a hope to understand all the middleware shenanigan contributing the lag in production.

We started with:

Rbtrace

a tool that allows you to attach to a running Ruby process (Rails server in our case) and execute arbitrary Ruby code within it.
heapy is heap dump inspection tool.
And little bit of enouragements from:
- https://samsaffron.com/archive/2015/03/31/debugging-memory-leaks-in-ruby
- https://www.honeybadger.io/blog/spying-on-running-ruby-processes/

Our approach were something like, detach one of the production instance from loadbalancer so it only serves our traffic, turn off all the background jobs and make code changes 🙈. Send a single request, take heap dump and analyze.

I know, I know what you are thinking, yes its called test in production but in a controlled way, alirght. We made couple of file changes here there.

# add these to your Gemfile
gem 'rbrace' #

bundle install the gem and added to at head of config.ru, to start tracing before rails boots up.

...
require 'rbtrace'

require 'objspace'
ObjectSpace.trace_object_allocations_start
...
run Rails.applications

→ start the app with single worker and one thread.

→ and send a single request

echo "GET http://localhost/healthz" \
 | vegeta attack -duration=1s -rate=1 | vegeta encode | vegeta report

and took heap dump to analyze with that pid

PID_OF_PUMA=23456
bundle exec rbtrace -p $PID_OF_PUMA --heapdump

which gives somthing like:

*** run `sudo sysctl kernel.msgmnb=1048576` to prevent losing events (currently: 12345 bytes)
*** attached to process 23456
Heapdump being written to /tmp/dump20251118-23456-XXXXXX
*** detached from process 23456

and then read results with heapy

heapy read /tmp/dump20251118-23456-XXXXXX

Generation: nil object count: 209191
Generation:  4 object count:  407
Generation:  10 object count: 55156      <--- sus line 🧐
Generation:  15 object count: 638
Generation:  16 object count: 748
Generation:  17 object count: 1023
Generation:  18 object count: 805
...

and drilling down into a the sus 10th generation(strangely large, 55k objects 🙄) further

heapy read tmp/2015-10-01T10:18:59-05:00-heap.dump 10

output:
Analyzing Heap (Generation: 10)
-------------------------------

allocated by memory (xxx) (in bytes)
==============================
xxx  /opt/rubies/ruby-3.x.y/lib/ruby/3.x.0/set.rb:522
218 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:XXYY
230 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache/memory_store.rb:XX
180 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:1010
50 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:19
50 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:126
50 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:125
40 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:86

object count (55156)
==============================
55156 /opt/rubies/ruby-3.x.y/lib/ruby/3.x.0/set.rb:XXX
5 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:XXYY
2 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache/memory_store.rb:XX
1 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:XXXX
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:19
2 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:126
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:125
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:86

High Ref Counts
==============================

3 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:XXYY
1 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:XXXX
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:19
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:126
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:125
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:86
Duplicate string

It was pointing us to flippercloud/flipper

Stackprof

We also tried running the test with stackprof

# add to Gemfile
gem 'stackprof'

# add to config/application.rb
config.middleware.use (StackProf::Middleware, enabled: true,
                         mode: :cpu,
                         interval: 1000,
                         save_every: 5)

→ Then we start app/puma server and send some request and wait couple of seconds. We would see files like tmp/stackprof-object-xxx.dump.

→ Next we can analyzed the dump:

bundle exec stackprof tmp/stackprof-object-*.dump

==================================
Mode: object(1000)
Samples: [SAMPLES] ([MISS_RATE]% miss rate)
GC: [GC] ([GC_PERCENT]%)
==================================
TOTAL (pct) SAMPLES (pct) FRAME
[TOTAL_81.8] (81.8%) [SAMPLES_81.8] (81.8%) Marshal.load
[TOTAL_5.0] (5.0%) [SAMPLES_5.0] (5.0%) String#split
[TOTAL_4.4] (4.4%) [SAMPLES_4.4] (4.4%) Hiredis::Ext::Connection#read
[TOTAL_7.9] (7.9%) [SAMPLES_2.5] (2.5%) Flipper::Adapters::Redis#fields_to_gate_value
[TOTAL_3.1] (3.1%) [SAMPLES_2.2] (2.2%) Enumerable.each_slice
[TOTAL_3.0] (3.0%) [SAMPLES_0.9] (0.9%) Enumerable.to_h
[TOTAL_0.4] (0.4%) [SAMPLES_0.4] (0.4%) Set#add
[TOTAL_0.3] (0.3%) [SAMPLES_0.3] (0.3%) Marshal.dump
[TOTAL_1.0] (1.0%) [SAMPLES_0.3] (0.3%) Class#new
[TOTAL_0.2] (0.2%) [SAMPLES_0.2] (0.2%) Flipper::Adapters::ActiveSupportCacheStore.key_for
...

We reached to the same conclusion:

The stackprof analysis indicates that Marshal.load is the top performance bottleneck, consuming 81.8% of samples.

The high cost of Marshal.load is heavily tied to ActiveSupport::Cache::Store#read_multi_entries, which in turn is related to how the Flipper gem is interacting with Redis.

Then we jumped to the flipper ui and noticed the culprit, it was indeed skill issue on how to use the feature flag. Devs used this handy gem to build/expose features under controled flags, but never cleaned them 🤦.

One of the feature flag was built to enable new UI of the app to selective users. To make it easy devs decided use email id of the user, i.e, just add new email id to the feature flag and you can see the new ui.

A logic was introduce to automatically do that for new user signups. And over the months 52k new signups accumulated on a single redis key. And the flipper flag check (i.e, read data from redis) was happeing on each request 🤦🤦🤦.

Every new signups was making app slowers
Every single request was un-necessarily loading 52k user emails
Some email ids were veeeeeeeeeeerrrryyyyveeeerrrrryyylooooooong
sigh…

Summary:

Issue:

Production servers were slow 🐌
Instrumentation tools didn’t help much

Finding:

found out feature flag was used the wrong way
55k+ records on single feature flag
feature flags values being loaded on every single request
resulting slower api response

Fix:

logic for feature flag was cleaned up, as it was enabled for all users
all the keys on redis purged

Result:

Servers were chilled within 120ms response time
Throughput rised from ~100rps to 1300rps and more
Less middleware bloats
Additional: revised the nginx/puma config for better throughput
Learnings:
- about profiling and traces
- & servers do better with less threads per worker

Todos:

Notify devs to work on feature flag cleanup tickets
R&D to improve detailed metrices collection of api server

That is all, if you made it this far, thank you 🙇🏻‍♂️.

And please let me know if you have any stories that I can learn from. Cheers 🙌🏼✨

Slaying Zombies in Production