Facebook’s TAO: Distributed Social Graph
Reading Facebook’s paper to understand how they handle the data store for their social graph
TAO (The Object and Associations) is a Facebook’s geographically distributed datastore designed to provide efficient access to its social graph. Optimized for read-heavy workloads based on Facebook’s usage patterns, TAO’s strength lies in its minimal API that serves billions of reads and millions of writes per second.
The Problem
When loading a user’s content feed for their social media application, it is not feasible to pre-compute and cache the entire page. This is because a user’s feed must be updated and read at render time to capture all the recent updates/posts. Hence, the content must be generated on the fly, which requires performing data dependency and privacy checks every time the content is loaded. This process involves a large number of reads from the social graph to gather the necessary data.
Facebook initially relied on a distributed MySQL instance to serve the social graph, coupled with Memcache for caching. This setup used a look-aside caching strategy, where the client nodes determined when to invalidate caches. However, this architecture encountered several issues:
- Inefficient Edge Lists: Memcache’s key-value cache wasn’t ideal for edge lists. Each query fetched the entire edge list, and any change to a single edge required reloading the entire list. Due to this, managing concurrent incremental updates to the cached edge list would require a complex system.
- Distributed Control Logic: As mentioned, cache control logic was managed by the clients, which increased the number of potential failure points. This setup also made it difficult to prevent situations where incorrect cache invalidation or failures could inadvertently put excessive strain on the database.
- Read-after-Write Expense: Facebook used asynchronous replication between leader and follower databases. While writes were forwarded to the leader, the local replica might not be updated immediately. Since the social graph’s edges weren’t flattened and the clients were talking directly to the database, it was expensive to provide read after write consistency. TAO mitigated this by caching writes on the replica at the time of the write.
TAO was created to address these issues. It is designed to handle the constantly changing distributed social graph, prioritizing efficiency and availability over strict consistency. TAO assumes that applications should not expect stale data under normal conditions but can tolerate it in certain cases.
Object Model
In Facebook’s graph, objects can be thought of as the nodes, and associations as the edges. Objects are typically identified by a global unique ID, while associations are identified by the source object ID (id1
), the association type (atype
), and the destination object ID (id2
). Each pair of objects can have at most one type of association between them, and both objects and associations can contain additional data in key-value pairs. The following representation shows the object and association data type:
- Object:
(id) → (otype, (key → value)*)
- Association:
(id1, atype, id2) → (time, (key → value)*)
Users are represented as objects, while relationships between them are modeled as associations. Actions taken by users, such as posting or commenting, can either represent objects or associations. Associations can be bidirectional, like FRIEND
or AUTHORED/AUTHORED_BY
, where operations on one side automatically trigger corresponding operations on the inverse association. You can see an example of this in the following image:

The social graph typically includes a mix of old and new data, but most queries are focused on recent data. For example, if Alice has a large following in the above picture, recent comments on her posts will likely appear first in queries. TAO uses association lists that return associations between id1
and atype
, sorted in descending order by time. This makes it easier to render the content required for a given user. Range queries return these lists, which are sorted by the most recent data.
- Association List:
(id1, atype) → [anew ...aold]
Architecture
TAO exposes a small set of SQL queries to interact with the social graph. The backing store for the TAO API is a sharded MySQL database, where each database server stores multiple shards to distribute usage across nodes. By default, objects are stored in one table, and associations in another. Every object has a shard ID to indicate its database location, while associations related to the object are stored based on id1
.
The caching layer of TAO acts as an intermediary between the clients and the database, with multiple cache servers forming what the paper calls a tier. Each request is routed to the appropriate cache server, which follows a similar sharding scheme as the database. Clients interact directly with the cache servers, which handle reads and writes. For cache misses or writes, the cache servers communicate with the databases or other cache servers. TAO’s caching system understands the graph’s structure and can efficiently answer queries even if the exact query hasn’t been processed before. For example, a cached count of zero can suffice to answer a range query.
Write operations for inverse queries require the cache server to notify the corresponding server for id2
via an RPC call. TAO does not provide atomicity between these updates; if a failure occurs, asynchronous jobs later fix any inconsistencies in relationships.
TAO’s Caching and Storage Architecture

To optimize performance and data integrity, TAO’s cache tiers are divided into leader and multiple follower nodes. Clients communicate only with the follower tiers. Shards are assigned to cache tiers using consistent hashing. Since this enforces a single coordinator node each time for an incoming request, it reduces the complexity of cache control logic and improves read-after-write consistency.
Cache misses are directed to the local database copy. All write operations are forwarded from the leader of the local cache tier to the leader database. Once the write succeeds, the leader database responds back to the leader of the local cache tier. The cache leader then synchronously updates the follower node that accepted the request. Other followers are updated asynchronously by the leader.
The MySQL shards are geographically replicated, and each replica is an exact copy of the leader. Shards are mapped to cache tiers using consistent hashing. As mentioned earlier, two different shards may reside on the same node, meaning a server could simultaneously act as both the leader and follower for those respective shards. While the design isn’t fully consistent across all scenarios, it ensures consistency for a user accessing the same region, since the cache server will be updated during the write operation under normal circumstances.
TAO’s cache servers break the RAM into arenas, with each arena handling specific object and association types. This allows for customized cache invalidation timelines based on object type, as well as the ability to isolate problematic data (poorly-behaved objects).
SQL Mapping
Internally, objects are stored in a data
column, which holds all the data associated with that object. This approach simplifies storage, as it doesn’t require creating special tables for different object types. Associations are stored similarly. To avoid performance bottlenecks from expensive SELECT(COUNT)
queries (such as when calculating comment/like counts), association counts are stored in a separate database.
High Degree Objects
Certain objects, such as posts by celebrities, may have a high degree (i.e., a large number of associations). When an object has many comments or likes, queries may frequently hit the database, especially for time-sorted data. TAO addresses this by comparing the degree of the two nodes (id1
and id2
) and choosing the query direction accordingly. If id2
is a low-degree user (e.g., few followers), starting the query from the inverse side is often more efficient.
In cases where both nodes in an edge have a high degree, TAO leverages domain-specific knowledge. For example, since the time of an association is often tied to the creation time of the object, the system can limit the search to associations created after the object’s creation time. If older associations are cached, the query can be answered directly by a TAO follower.
Consistency
TAO is designed to be eventually consistent under normal operation. Replication lag is typically less than one second, and read-after-write consistency is guaranteed within a single tier. Writes are synchronously applied to the leader and follower caches after being written to the database. In the case of inverse relationships, the write is forwarded synchronously to the corresponding cache server for id2
within the same tier.
To manage potential inconsistencies between cache followers, TAO uses version numbers associated with each write. This allows cache servers to determine if they have outdated data that needs to be invalidated.
There is a potential race condition in TAO’s system. The follower storage server might hold an outdated version of data compared to the version cached by the caching server. If the post-write entry is evicted from the cache due to invalidation and then reloaded from the follower database, a client may observe a rollback of the data to an earlier state within a single follower tier. This situation can only arise if the follower region’s storage server takes longer to receive the update than it takes for the cached item to be evicted, which is a rare occurrence in practice. While TAO doesn’t guarantee full consistency, the leader database is always the authoritative source of truth, and reads can be directed to the leader if necessary.
Quick Note on Failures
- Database Failures: Databases are marked as down in a global configuration if any failures are detected. When a leader fails, a follower is promoted to leader, and operation continues. When the leader recovers, it receives any missed replication messages.
- Leader Failures: If a leader cache in a tier fails, requests are redirected to the database, and writes are handled by a randomly chosen cache follower, which then communicates with the database. When the leader comes back online, the follower sends asynchronous invalidation messages to bring the leader back into consistency.
Conclusion
Facebook’s TAO paper was a very interesting read for me. Overlaying the social graph on top of a distributed MySQL instance while also providing read after write consistency in most cases is interesting. The extensive caching being done in a way that flattens the graph and allows for quick queries is also very interesting. The paper details how it has been a rapid enabler for product teams to build upon because new applications can build on top of existing data while providing consistent semantics across all of them. The paper also describes how the separation between the storage and caching layers has allowed them to develop the two differently to suit their needs. Overall, a very interesting read indeed.
Sources: