Rate Limiting: Why Modern Systems Must Learn to Say No
How distributed systems protect themselves from overload, abusive traffic, retry storms, and infrastructure collapse using rate limiting strategies.
Senior Developer

The System Did Not Crash Because Of Traffic
The infrastructure had survived large launches before.
Load balancers distributed requests correctly. Auto-scaling added more servers during traffic spikes. Redis caches absorbed repeated reads efficiently. Database replicas handled queries comfortably.
Then one API endpoint suddenly received millions of requests within minutes.
CPU usage exploded.
Redis memory pressure increased sharply. Database connections became exhausted. Retry storms started spreading across services. Queue workers backed up behind overloaded APIs.
The strange part was that most requests were not even legitimate users.
Some came from bots.
Some from crawlers.
Some from buggy clients retrying endlessly.
And some simply came from users refreshing aggressively during peak demand.
The system did not fail because infrastructure was too small.
It failed because the infrastructure allowed unlimited pressure to enter simultaneously.
And this is one of the most important realizations modern distributed systems eventually make:
scalability is not only about handling more traffic.
It is also about controlling traffic.
That is where rate limiting enters the architecture.
Unlimited Access Sounds Fair Until Production Exists
At small scale, APIs often behave openly.
Request arrives.
Server processes request.
Simple.
Example:
Client → API → ResponseNo restrictions.
No limits.
No throttling.
This feels reasonable while traffic remains predictable.
Then infrastructure becomes public.
Now systems encounter:
bots,
scrapers,
abuse,
retry storms,
accidental floods,
malicious traffic,
viral traffic spikes.
And suddenly unrestricted access starts becoming operationally dangerous.
Because one user, one script, or one broken client can destabilize infrastructure shared by everyone else.
Rate Limiting Is Really About Protecting Shared Resources
This is one of the biggest misconceptions beginners have.
Rate limiting is not mainly about blocking users.
It is about protecting infrastructure stability.
Because backend systems contain finite resources:
CPU,
memory,
database connections,
queue throughput,
network bandwidth.
Without limits, a small percentage of traffic can consume disproportionate infrastructure capacity.
And eventually systems begin asking questions like:
How many requests should one user make safely?Or:
How much traffic can one IP generate before harming the platform?These are fundamentally resource management problems.
The Simplest Limit Usually Looks Like This
Example:
100 Requests Per MinuteSimple.
If a client exceeds the limit:
429 Too Many RequestsThe server rejects additional traffic temporarily.
This immediately improves survivability during overload situations.
Because infrastructure starts controlling pressure instead of absorbing everything blindly.
And honestly, many systems become dramatically more stable after introducing even basic rate limiting.
Rate Limiting Quietly Protects More Than APIs
This pattern appears everywhere in infrastructure.
Examples:
login attempts,
OTP verification,
payment APIs,
search endpoints,
file uploads,
password resets,
AI inference systems.
Without limits:
brute force attacks become easier,
abuse becomes cheaper,
infrastructure costs explode,
retry storms amplify failures.
Rate limiting became foundational because distributed systems must defend themselves against both malicious and accidental overload.
Fixed Window Limiting Feels Simple Initially
One of the earliest approaches looks like this:
Count Requests Per MinuteExample:
Minute 10:00 → 98 Requests
Minute 10:01 → Counter ResetsImplementation often looks straightforward:
INCR user:1001
EXPIRE user:1001 60Redis became extremely popular for this pattern because counters are:
fast,
simple,
memory efficient.
And for many workloads, fixed windows work reasonably well.
Until edge cases appear.
Fixed Windows Create Strange Traffic Bursts
Imagine:
100 requests allowed per minuteA client sends:
100 requests at 10:00:59
another 100 requests at 10:01:00
Now:
200 requests happened almost instantly,
but technically the limit was never violated.
This creates burstiness problems.
And eventually systems start requiring smoother traffic control.
That led to more advanced algorithms.
Sliding Windows Improved Fairness
Sliding window systems track requests continuously instead of resetting abruptly.
Instead of:
Current Minutethey evaluate rolling time ranges:
Last 60 SecondsNow burst behavior becomes smoother.
Traffic distributes more predictably.
And infrastructure becomes more stable under spikes.
But sliding windows also become more expensive operationally because systems track:
timestamps,
rolling histories,
moving request windows.
Again, distributed systems trade simplicity for precision continuously.
Token Buckets Quietly Became Extremely Popular
One of the most elegant rate limiting strategies is the token bucket.
Imagine a bucket filling gradually:
+1 token every secondEach request consumes one token.
If tokens remain:
Request AllowedIf empty:
Request RejectedThis creates an important balance:
small bursts remain possible,
sustained abuse gets limited.
Which matches real-world traffic behavior surprisingly well.
Because users naturally generate uneven bursts occasionally.
Leaky Buckets Think Differently
Another common approach is the leaky bucket algorithm.
Imagine traffic entering a bucket:
Requests Enter QuicklyBut leaving at fixed speed:
Process At Controlled RateExcess traffic overflows and gets rejected.
This creates smoother output traffic.
And interestingly, leaky buckets behave much more like traffic shaping systems than simple counters.
Large infrastructures often combine:
token buckets,
sliding windows,
leaky buckets
depending on workload characteristics.
Distributed Rate Limiting Becomes Hard Very Quickly
At small scale, one server can track request counts locally.
Then infrastructure scales horizontally.
Now:
multiple API servers,
multiple regions,
multiple edge nodes
all process requests simultaneously.
Suddenly local counters stop working correctly.
Because:
Server A sees 40 requests,
Server B sees 50 requests,
Server C sees 30 requests.
Globally the client exceeded limits.
Individually no server realizes it.
And this is where distributed coordination reappears again.
Redis Quietly Became The Backbone Of Rate Limiting
One reason Redis dominates distributed rate limiting is shared state.
All servers can update centralized counters:
API Servers
↓
Redis CounterNow limits remain globally consistent across infrastructure.
Redis operations like:
INCR
EXPIREmade distributed counters operationally practical at scale.
And honestly, huge portions of internet infrastructure quietly depend on Redis-powered rate limiting underneath.
Edge Rate Limiting Changed Internet Infrastructure
As systems grew globally, rate limiting moved closer to users themselves.
Instead of protecting only origin servers:
User → OriginCDNs and API gateways started limiting traffic at the edge:
User → Edge → OriginThis dramatically reduced:
origin overload,
malicious traffic pressure,
unnecessary infrastructure cost.
And this became foundational for:
Cloudflare,
API gateways,
DDoS mitigation systems,
modern edge platforms.
Because rejecting harmful traffic earlier is much cheaper than processing it centrally.
Retry Storms Quietly Destroy Systems
One of the most dangerous overload patterns modern systems encounter is retries during partial failures.
Example:
API Slows Down
↓
Clients Retry Aggressively
↓
Traffic Increases Further
↓
Infrastructure Slows MoreThis feedback loop can destroy healthy systems extremely quickly.
Rate limiting helps break this cycle.
By rejecting excess requests early:
429 Too Many Requestssystems preserve stability instead of collapsing entirely.
This is one reason resilience engineering increasingly focuses on controlled degradation rather than unlimited acceptance.
Different Users Often Need Different Limits
Large systems rarely apply identical rules globally.
Examples:
anonymous users,
authenticated users,
premium customers,
internal services,
admin APIs
may all receive different quotas.
Example:
Free Tier → 100 requests/min
Premium → 10,000 requests/min
Internal Services → UnlimitedThis transforms rate limiting into infrastructure policy management rather than simple request counting.
And eventually rate limiting becomes deeply connected to:
billing,
abuse prevention,
infrastructure economics.
Rate Limiting Quietly Shapes User Experience
One subtle thing many engineers underestimate:
rate limiting affects product behavior directly.
Too strict:
users become frustrated,
APIs feel unreliable.
Too loose:
infrastructure becomes vulnerable.
Good rate limiting feels almost invisible.
The best systems:
absorb normal usage smoothly,
stop abusive behavior gracefully,
protect infrastructure quietly.
That balance is much harder operationally than it initially appears.
Modern Infrastructure Increasingly Assumes Traffic Is Untrustworthy
One of the deepest shifts large-scale systems eventually make is philosophical.
Early systems often assume:
users behave normally,
traffic remains predictable,
clients retry responsibly.
Large-scale infrastructure assumes the opposite.
Traffic may become:
malicious,
buggy,
explosive,
automated,
unpredictable.
And architecture evolves accordingly.
Rate limiting became foundational because modern internet systems must survive hostile and chaotic traffic conditions continuously.
One Of The Most Important Infrastructure Lessons
Rate limiting teaches something fundamental:
infrastructure survivability depends as much on rejecting work as processing work.
This is a difficult mindset shift initially.
Because engineers naturally focus on:
scaling servers,
improving throughput,
handling more traffic.
But resilient systems also know when to say:
NoThat ability often determines whether systems degrade gracefully or collapse catastrophically under pressure.
Final Thoughts
At small scale, unrestricted traffic feels harmless.
Then infrastructure grows.
Bots appear.
Retries amplify failures.
Traffic spikes become unpredictable.
And eventually systems realize they cannot safely process unlimited requests continuously.
That is where rate limiting enters the architecture.
It protects infrastructure.
Controls pressure.
Absorbs abuse.
And allows distributed systems to remain stable under chaotic real-world traffic conditions.
Because modern internet infrastructure survives not only by scaling aggressively — but also by controlling how pressure enters the system in the first place.
Up Next In This Series
API Gateways
Including:
centralized API management
authentication and authorization
routing and aggregation
rate limiting at gateways
service discovery
API composition
and why microservice architectures increasingly rely on gateway layers
Comments (0)
Login to post a comment.