Explores use cases, principles, and common implementation strategies, including Redis, ZooKeeper, and etcd for distributed locks
Most of our modern-day services and products are backed by digital infrastructure. Behind the scenes, workloads are generally divided into smaller tasks to ensure correctness, reliability, availability, and durability.
There are a variety of workloads that need to run behind the scenes to maintain these services. The majority of these workloads are deployed in a distributed manner, meaning a particular task can be performed by any machine out of a group of worker machines. Whether it’s API services, streaming jobs, or workers with queued tasks, distributed systems are common.
From the perspective of reliability and correctness, many tasks performed by these systems must be executed only once or processed by a single worker/process at a time.
This is where distributed locks become essential. Distributed locks are a critical component in distributed systems, allowing multiple processes or nodes to coordinate access to shared resources. In this post, we’ll explore the key use cases for distributed locks and examine some common implementation approaches.
Why Distributed Locks Matter
In a distributed system, resources might be accessed and modified by multiple nodes. Without proper coordination, concurrent operations can lead to unpredictable results. Distributed locks are crucial for
- Consistency: Ensuring that only one process modifies a resource at a time to maintain data integrity.
- Coordination: Synchronizing actions across different nodes that need to work together.
- Fault Tolerance: Handling failures gracefully and ensuring that locks are properly released or reassigned if a node crashes.
Core Principles of Distributed Locks
Distributed locks rely on a few key principles to function correctly:
- Exclusivity: At any moment, only one node should hold the lock and have access to the shared resource.
- Timeout: Locks should have a timeout to prevent indefinite holding if a node fails or becomes unresponsive.
- Fairness: Ideally, locks should be granted in the order requests are received to avoid starvation.
- Atomicity: Lock acquisition and release operations should be atomic to prevent race conditions.
Use Cases for Distributed Locks
-
Resource Contention Management
Distributed locks help manage access to shared resources like databases, files, or API rate limits. They ensure that only one process at a time can modify a resource, preventing data corruption or inconsistencies.
-
Leader Election
In distributed systems, it’s often necessary to designate a leader node for coordination tasks. Distributed locks can facilitate leader election by allowing nodes to compete for a lock, with the winner becoming the leader.
-
Distributed Cron Jobs
When running scheduled tasks across multiple nodes, distributed locks prevent duplicate job execution. Only the node that acquires the lock will run the task.
-
Concurrency Control in Microservices
Microservices often need to coordinate operations across multiple services. Distributed locks can ensure that critical sections of code are executed by only one service instance at a time.
-
Distributed Transactions
In systems without built-in distributed transaction support, locks can be used to implement basic transactional semantics across multiple resources.
Distributed Lock Implementations
Let’s examine some common approaches to implementing distributed locks:
1. Database Level Row Locks
Many relational databases provide features that can be used to implement distributed locks:
BEGIN;
-- Insert or update the lock
INSERT INTO distributed_locks (key, locked_at)
VALUES ('my_key', NOW())
ON CONFLICT (key) DO NOTHING;
-- Check if the lock was acquired
SELECT * FROM distributed_locks WHERE key = 'my_key';
-- Perform critical operation
COMMIT;
However, this approach adds additional load on the database and increases the potential for deadlocks if not implemented carefully.
Another way to implement locking in databases is by using SELECT … FOR UPDATE
BEGIN;
-- Ensure the lock exists and acquire it
INSERT INTO distributed_locks (lock_key)
VALUES ('my_lock_key')
ON CONFLICT (lock_key) DO NOTHING;
-- Lock the row
SELECT *
FROM distributed_locks
WHERE lock_key = 'my_lock_key'
FOR UPDATE;
-- Critical section: perform your operations here
COMMIT;
The lock is automatically released when the transaction ends (COMMIT
or ROLLBACK
).
2. Redis-based Locks (Redlock algorithm)
Redis provides atomic operations that make it suitable for implementing distributed locks:
import redis
def acquire_lock(resource, ttl):
return redis.set(resource, 'locked', nx=True, ex=ttl)
def release_lock(resource):
redis.delete(resource)
The Redlock algorithm extends this concept to multiple Redis instances for increased reliability.
For more information, Redis documentation provides a comprehensive guide: https://redis.io/docs/latest/develop/use/patterns/distributed-locks/
3. ZooKeeper
Apache ZooKeeper, a distributed coordination service, provides robust primitives for implementing distributed locks.
ZooKeeper is fast and offers strong ordering guarantees.
From Zookeeper official docs:
Unlike is standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.) We use the term znode to make it clear that we are talking about ZooKeeper data nodes.
Znodes maintain a stat structure that includes version numbers for data changes, ACL changes, and timestamps, to allow cache validations and coordinated updates. Each time a znode’s data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.
The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.
ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Ephemeral nodes are useful when you want to implement [tbd].
https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_recipes_Locks
public class DistributedLock {
private final ZooKeeper zk;
private final String lockPath;
public boolean acquireLock() throws KeeperException, InterruptedException {
try {
zk.create(lockPath, new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
return true;
} catch (KeeperException.NodeExistsException e) {
return false;
}
}
public void releaseLock() throws KeeperException, InterruptedException {
zk.delete(lockPath, -1);
}
}
This is one of the most correct and fast implementation but requires zookeeper cluster setup which can be a bit challenging if the team is not expert in managing it.
4. etcd
etcd, a distributed key-value store, offers a lease mechanism that’s well-suited for implementing distributed locks:
func acquireLock(ctx context.Context, client \*clientv3.Client, key string) error {
lease, err := client.Grant(ctx, 10) // 10-second TTL
if err != nil {
return err
}
_, err = client.Txn(ctx).
If(clientv3.Compare(clientv3.CreateRevision(key), "=", 0)).
Then(clientv3.OpPut(key, "locked", clientv3.WithLease(lease.ID))).
Commit()
if err != nil {
return err
}
return nil
}
Best Practices and Considerations
When implementing distributed locks, keep these factors in mind:
- Fencing Tokens: Use monotonically increasing values with locks to prevent issues with stale lock holders.
- Timeouts and TTLs: Always use timeouts when acquiring locks and TTLs on the locks themselves to prevent indefinite blocking.
- Reentrancy: Consider whether your locks need to be reentrant (allowing the same process to acquire the lock multiple times).
- Lock Granularity: Balance between fine-grained locks for maximum concurrency and coarse-grained locks for simplicity.
- Failure Handling: Design your system to handle cases where lock acquisition fails or where the lock-holding process crashes.
- Performance Impact: Monitor the performance impact of your locking system, especially under high concurrency.
Conclusion
Distributed locks are a powerful tool for coordinating activities in distributed systems. By understanding the use cases and carefully considering implementation options, you can choose the right approach for your specific needs. Remember that distributed locks add complexity to your system, so use them judiciously and always consider simpler alternatives where appropriate.