Building a Rate Limiter or RPM based Throttler for your API/worker

by Arpit Kumar

26 Mar, 2024

10 minute read

Building a simple rate limiter / throttler based on GCRA algorithm and redis script

# rate-limiter # throttler # redis # gcra

Table of Content

Rate Limiting/Throttling using Generic Cell Rate Algorithm

How GCRA functions in practice

Learn using an easy example

Show me the code

List of alternative Algorithms which are generally used for similar use cases

When it comes to managing the flow of events or API requests, setting up a rate limiting or throttling system is crucial. A rate limiting or throttling system serves as a crucial control mechanism. It works as a safety harness for your workloads. It helps keep your system running smoothly. There are multiple ways in which your Rate limiter/Throttler can help regulate workloads. Some of them mentioned below to help us understand the importance of building a layer of these systems from the first day itself.

Protecting Resources: Every system has its limits. If left unchecked, a flood of data can overwhelm it, causing crashes, slowdowns, and overall poor performance. Rate limiting and throttling step in as safeguards, ensuring that resource usage stays within manageable bounds.
Ensuring Fairness: Without limits in place, some users or applications could monopolize the system’s resources with excessive requests, leaving others in the dust. Rate limiting levels the playing field, ensuring fair access for everyone.
Preventing Abuse: There are always those who might try to exploit systems for malicious purposes, launching attacks that flood the system and disrupt operations. Rate limiting is a key defense, identifying and blocking such attacks before they can cause harm.
Managing Data Flow: Certain tasks, like handling large volumes of data through APIs, require careful management. Throttling enables controlled data delivery, preventing sudden spikes that could overwhelm downstream systems or exceed bandwidth limits.
Cost Optimization: In the world of cloud services, unchecked resource usage can quickly lead to ballooning costs. Rate limiting keeps spending in check by preventing excessive resource consumption.
Improving User Experience: By avoiding overload situations, rate limiting and throttling contribute to a smoother user experience. Users won’t have to deal with sluggish performance or sudden system outages caused by too much traffic.
Security Measures: Rate limiting isn’t just about performance—it’s also a valuable security tool. By monitoring traffic patterns, it can help detect and mitigate potential attacks or unauthorized access attempts.

There are various approaches to implementing a rate limiting system. In this post, we will explore a simplified rate limiting system utilizing Redis and the Generic Cell Rate Algorithm (GCRA). It’s recommended to review other algorithms as well to gain a broader understanding of potential alternatives. Depending on your specific workload, a different algorithm might be more suitable. You should check out the list at the bottom for alternative algorithms.

Rate Limiting/Throttling using Generic Cell Rate Algorithm

Personally, I favor Generic Cell Rate Algorithm (GCRA) due to its qualities of low memory consumption, comprehensibility, and fairness. GCRA, or Generic Cell Rate Algorithm, operates much like a leaky bucket

The Bucket: Picture a bucket with a set size, symbolizing the permitted data rate (e.g., 100 requests per second).
The Leak: At the bottom of the bucket, there’s a hole allowing water (data) to trickle out at a steady pace (e.g., 1 request every 10 milliseconds).
Incoming Traffic: Data packets (cells) enter akin to water being poured into the bucket.

How GCRA functions in practice

Cells Within Limits: If incoming cells arrive at a rate slower than the leak, they flow through seamlessly.
Cell Bursts: GCRA accommodates some “burstiness.” The bucket can temporarily store extra data beyond its capacity, facilitating brief bursts of traffic exceeding the average rate.
Throttling Activation: Once cells start arriving faster than the leak can handle, the bucket overflows. This is where throttling intervenes. Excessive cells are either:
- Delayed: Held back until space becomes available in the bucket.
- Dropped: Discarded if the queue becomes too lengthy.

Dropping the request is easier. In case the bucket is filled then we can drop the request and let the space in the bucket get filled. This can be simulated in an alternative way. Let’s assume each request passes through our rate limiter system.

Learn using an easy example

The foundation of this implementation revolves around determining the optimal interval between two requests, guided by the system’s permissible Requests Per Second (RPS).

So for example if a system is allowed to process at 60 RPM it means that each request should have a gap of 1s. This rate limiting system basically figures out based on a previous request what is the theoretical time of arrival. If this is the first request then it would return the wait time to be zero and set the TAT to be (current time + 1000/rps). If it already has a TAT stored on the redis server it would check whether current time has passed the TAT or not. If TAT is more than server time then it would return the diff to be TAT - server time as the wait time for the request.

Based on the wait time, the client can decide if they want to drop or wait for the returned time to process the request. While in case of delaying the execution, the implementation can be done in two ways. First, the client can retry after the prescribed interval and second, the client can wait till that time and complete the processing without again making a call to rate limiter service.

Show me the code

Let’s implement two versions of the algorithm. We will implement the version where we will only calculate the Theoretical Arrival Time (TAT) and return the time to wait for before retry. We will use Redis Lua scripting to bring transactional behavior for calculation for TAT.

Implementation 1

In this implementation we are returning the wait time for this request till it can be processed and we are updating the TAT. This implementation assumes requests would not be retried but will be processed once the wait time is over.

    
class DynamicThrottleCache(object):

    def __init__(self):
        self.redis_connection = RedisConnectionFactory().get_connection()
        self.script = """
            local key = KEYS[1]
            local server_time = redis.call('TIME')
            local arrival_time = tonumber((server_time[1] * 1000000) + (server_time[2]))
            local increment = tonumber(ARGV[1])

            local stored_tat = redis.call('GET', key) -- Get the current value for the key
            local updated_tat -- Declare updated_tat as local variable

            -- If the key doesn't exist or is nil, set the TAT to current time + increment and return 0.
            -- If the key exists, then get the TAT in number format.
            if stored_tat == false or stored_tat == nil then
                updated_tat = arrival_time + increment
                redis.call('SET', key, updated_tat) -- Update the value for the key
                return 0
            else
                stored_tat = tonumber(stored_tat)
            end

            if arrival_time <= stored_tat then
                updated_tat = stored_tat + increment
                redis.call('SET', key, updated_tat) -- Update the value for the key
                return stored_tat - arrival_time
            else
                -- in case stored_tat is older than 60 secs then reset the tat by arrival time + increment
                -- this is done to avoid any bucket filling which is older than a min, buckets older than a min are discarded
                if arrival_time - stored_tat >= 60000000 then
                    updated_tat = arrival_time + increment
                else
                    updated_tat = stored_tat + increment
                end
                redis.call('SET', key, updated_tat) -- Update the value for the key
                return 0
            end
            """

        -- Check if the script is already loaded in redis, if not load it, else get the sha
        -- of the loaded script
        script_sha = self.redis_connection.script_exists(self.script)
        if not script_sha[0]:
            -- If the script is not loaded in redis, load it and get the sha of the loaded script
            self.script_sha = self.redis_connection.script_load(self.script)
        else:
            -- If the script is already loaded, get the sha of the loaded script
            self.script_sha = script_sha[1][0]

    def get_waittime_in_millis(self, key, rpm):
        wait_time_in_millis = self.get_waittime_in_micros(key, rpm) / 1000
        return max(wait_time_in_millis, 0)

    def get_waittime_in_micros(self, key, rpm):
        increment = 1000 / (rpm / 60) # in seconds
        try:
            wait_time = self.redis_connection.evalsha(self.script_sha, 1, key, increment)
        except Exception as e:
            if 'NOSCRIPT' in str(e):
                self.script_sha = self.redis_connection.script_load(self.script)
                wait_time = self.redis_connection.evalsha(self.script_sha, 1, key, increment)
            else:
                wait_time = 60.0 * 1_000_000 / rpm  # Default wait time in case of failure
                return wait_time
        wait_time_with_jitter = wait_time * random.uniform(0.9, 1.1)
        return max(wait_time_with_jitter, 0)

Implementation 2

Another implementation can be done where a request needs to be retried post the wait time and checked if it is allowed now. I would leave the implementation to the reader. This can be easily done by not incrementing the TAT on each request but only on the request which has arrived post the stored TAT.

These should help figure out a way to implement a decent rate limiter/throttler for processing requests. Since request processing is delayed, some way to handle the delay and process/retry at the end wait time is required. Either the thread can wait or it can be scheduled on some external trigger to be retried at a later time. In case thread is put to sleep it should be done in an async or non-blocking way so that cpu can be utilized by other threads in the process.

List of alternative Algorithms which are generally used for similar use cases

Token Bucket Algorithm

Maintains a fixed-size bucket of tokens that are replenished at a constant rate. Each request consumes a token, and requests are rejected if the bucket is empty. Simple to implement and provides a smooth rate-limiting experience.

Fixed Window Counter Algorithm

Counts the number of requests received in a fixed time window. If the count exceeds a specified threshold, requests are rejected. Provides precise rate limiting but can be sensitive to bursts of traffic.

Leaky Bucket Algorithm

Similar to the token bucket algorithm but with a leaky bucket that allows a constant outflow of tokens. Requests are rejected if the bucket is full when a request arrives. Offers a smoother rate-limiting experience than the fixed window counter algorithm.

Sliding Window Algorithm

Maintains a sliding window of time and counts the number of requests received in that window. Requests are rejected if the count exceeds a specified threshold. Provides a flexible and adaptable rate-limiting mechanism.

Hierarchical Rate Limiting

A multi-level rate-limiting approach where different rate limits are applied to different levels of the system. Enables fine-grained control over rate limiting and can be used to prioritize certain types of requests.