API response time from 500ms β 200ms and less
Story
Once upon a time in 2023, I was stress testing our api endpoints with my friend/oozzal to prepare for the season peak traffic. We randomly started to stress test our endpoints for login, health_check with grafana/k6s and tsenart/vegeta, like.
Sample vegeta run(not actual run):
echo 'GET https://my.api.thapakazi.com/login' \
| vegeta attack -duration=10s -rate=380 \
| tee /tmp/results.bin
| vegeta report
Requests [total, rate, throughput] 3800, 380.10, 76.08
Duration [total, attack, wait] 39.998s, 9.997s, 30s
Latencies [min, mean, 50, 90, 95, 99, max] 129.394ms, 6.558s, 305.681ms, 30s, 30.001s, 30.001s, 30.009s
Bytes In [total, mean] 286042, 75.27
Bytes Out [total, mean] 0, 0.00
Success [ratio] 80.08%
Status Codes [code:count] 0:757 200:3043But to my dismay, our server started to chock with very low volume of request and ironically the number of request that production environment could handle was very very lower compared to that of staging π
Our quick instinct was to instantly blame it on our api server/puma config, i.e, number of child process(and associated worker threads) and we tried tweaking the values. While looking for the best practices we found similar thread from dhh@rails#54050, and we tried minimizing threads count per workers and hammering more.
With this config changes we noticed slight better performance in numbers of request servers could handle/respond Vs resource usage like: cpu spikes, but response from api was no where near satisfying in any wayπ πΌββοΈ. We also noticed, on a different test app that even with tiny ec2 instance with similar puma specs/version, server was doing way better in terms of number of request/responds per seconds.
π€: something fishy
That moment we had a suspicion, something really wrong with our api/data, its not just the puma or nginx config or hardware spec.
So we started digging deeper into the holes for the answers. We tried multiple debugging tools/libs to understand why the servers chocking so easily around 100s+ of requests(I can’t recall the numbers but it was so tiny)
Debugging πͺ
We had many observability tools in place, but none were handy enough to inspect the full context on how the server is actually serving a single http request.
Hence we decided to capture the dump and analyze it in slow motion. With a hope to understand all the middleware shenanigan contributing the lag in production.
We started with:
Rbtrace
- a tool that allows you to attach to a running Ruby process (Rails server in our case) and execute arbitrary Ruby code within it.
- heapy is heap dump inspection tool.
- And little bit of enouragements from:
Our approach were something like, detach one of the production instance from loadbalancer so it only serves our traffic, turn off all the background jobs and make code changes π. Send a single request, take heap dump and analyze.
I know, I know what you are thinking, yes its called test in production but in a controlled way, alirght. We made couple of file changes here there.
# add these to your Gemfile
gem 'rbrace' #bundle install the gem and added to at head of config.ru, to start tracing before rails boots up.
...
require 'rbtrace'
require 'objspace'
ObjectSpace.trace_object_allocations_start
...
run Rails.applicationsβ start the app with single worker and one thread.
β and send a single request
echo "GET http://localhost/healthz" \
| vegeta attack -duration=1s -rate=1 | vegeta encode | vegeta reportand took heap dump to analyze with that pid
PID_OF_PUMA=23456
bundle exec rbtrace -p $PID_OF_PUMA --heapdumpwhich gives somthing like:
*** run `sudo sysctl kernel.msgmnb=1048576` to prevent losing events (currently: 12345 bytes)
*** attached to process 23456
Heapdump being written to /tmp/dump20251118-23456-XXXXXX
*** detached from process 23456and then read results with heapy
heapy read /tmp/dump20251118-23456-XXXXXXGeneration: nil object count: 209191
Generation: 4 object count: 407
Generation: 10 object count: 55156 <--- sus line π§
Generation: 15 object count: 638
Generation: 16 object count: 748
Generation: 17 object count: 1023
Generation: 18 object count: 805
...and drilling down into a the sus 10th generation(strangely large, 55k objects π) further
heapy read tmp/2015-10-01T10:18:59-05:00-heap.dump 10
output:
Analyzing Heap (Generation: 10)
-------------------------------
allocated by memory (xxx) (in bytes)
==============================
xxx /opt/rubies/ruby-3.x.y/lib/ruby/3.x.0/set.rb:522
218 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:XXYY
230 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache/memory_store.rb:XX
180 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:1010
50 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:19
50 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:126
50 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:125
40 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:86
object count (55156)
==============================
55156 /opt/rubies/ruby-3.x.y/lib/ruby/3.x.0/set.rb:XXX
5 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:XXYY
2 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache/memory_store.rb:XX
1 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:XXXX
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:19
2 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:126
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:125
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:86
High Ref Counts
==============================
3 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:XXYY
1 api/shared/bundle/ruby/3.x.0/gems/activesupport-x.x.x.x/lib/active_support/cache.rb:XXXX
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:19
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:126
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:125
1 api/shared/bundle/ruby/3.x.0/gems/flipper-active_support_cache_store-x.x.x/lib/flipper/adapters/active_support_cache_store.rb:86
Duplicate stringIt was pointing us to flippercloud/flipper
Stackprof
We also tried running the test with stackprof
# add to Gemfile
gem 'stackprof'# add to config/application.rb
config.middleware.use (StackProf::Middleware, enabled: true,
mode: :cpu,
interval: 1000,
save_every: 5)β Then we start app/puma server and send some request and wait couple of seconds. We would see files like tmp/stackprof-object-xxx.dump.
β Next we can analyzed the dump:
bundle exec stackprof tmp/stackprof-object-*.dump
==================================
Mode: object(1000)
Samples: [SAMPLES] ([MISS_RATE]% miss rate)
GC: [GC] ([GC_PERCENT]%)
==================================
TOTAL (pct) SAMPLES (pct) FRAME
[TOTAL_81.8] (81.8%) [SAMPLES_81.8] (81.8%) Marshal.load
[TOTAL_5.0] (5.0%) [SAMPLES_5.0] (5.0%) String#split
[TOTAL_4.4] (4.4%) [SAMPLES_4.4] (4.4%) Hiredis::Ext::Connection#read
[TOTAL_7.9] (7.9%) [SAMPLES_2.5] (2.5%) Flipper::Adapters::Redis#fields_to_gate_value
[TOTAL_3.1] (3.1%) [SAMPLES_2.2] (2.2%) Enumerable.each_slice
[TOTAL_3.0] (3.0%) [SAMPLES_0.9] (0.9%) Enumerable.to_h
[TOTAL_0.4] (0.4%) [SAMPLES_0.4] (0.4%) Set#add
[TOTAL_0.3] (0.3%) [SAMPLES_0.3] (0.3%) Marshal.dump
[TOTAL_1.0] (1.0%) [SAMPLES_0.3] (0.3%) Class#new
[TOTAL_0.2] (0.2%) [SAMPLES_0.2] (0.2%) Flipper::Adapters::ActiveSupportCacheStore.key_for
...We reached to the same conclusion:
The stackprof analysis indicates that Marshal.load is the top performance bottleneck, consuming 81.8% of samples.
The high cost of Marshal.load is heavily tied to ActiveSupport::Cache::Store#read_multi_entries, which in turn is related to how the Flipper gem is interacting with Redis.
Then we jumped to the flipper ui and noticed the culprit, it was indeed skill issue on how to use the feature flag. Devs used this handy gem to build/expose features under controled flags, but never cleaned them π€¦.
One of the feature flag was built to enable new UI of the app to selective users. To make it easy devs decided use email id of the user, i.e, just add new email id to the feature flag and you can see the new ui.
A logic was introduce to automatically do that for new user signups. And over the months 52k new signups accumulated on a single redis key. And the flipper flag check (i.e, read data from redis) was happeing on each request π€¦π€¦π€¦.
- Every new signups was making app slowers
- Every single request was un-necessarily loading 52k user emails
- Some email ids were veeeeeeeeeeerrrryyyyveeeerrrrryyylooooooong
- sigh…
Summary:
Issue:
- Production servers were slow π
- Instrumentation tools didn’t help much
Finding:
- found out feature flag was used the wrong way
- 55k+ records on single feature flag
- feature flags values being loaded on every single request
- resulting slower api response
Fix:
- logic for feature flag was cleaned up, as it was enabled for all users
- all the keys on redis purged
Result:
- Servers were chilled within 120ms response time
- Throughput rised from ~100rps to 1300rps and more
- Less middleware bloats
- Additional: revised the nginx/puma config for better throughput
- Learnings:
- about profiling and traces
- & servers do better with less threads per worker
Todos:
- Notify devs to work on feature flag cleanup tickets
- R&D to improve detailed metrices collection of api server
That is all, if you made it this far, thank you ππ»ββοΈ.
And please let me know if you have any stories that I can learn from. Cheers ππΌβ¨