If you don’t have a Flickr account by now, go sign up now. Dathan Pattishall gave a talk at the MySQL Conference and Expo about how Flickr federated their database to handle their growing site’s needs.
He began by discussing some of the problems they encountered with master-slave topology, slave lag, multiple single points of failure, unable to handle load, and unable to search content.
Since writes are intensive, they needed more than 1 master so that they could get as more write points. This also assisted with their single point of failure issues.
They designed their database to be federated across several MySQL servers. They set up a master-master replication “global ring” that keeps track of globally unique IDs and contains data that can’t be replicated. It also keeps track of the shards and which “shard” the user’s data is on. Shard information is cached in memcached on the PHP side to reduce database load for such a used piece of information. They created two tables to keep track of the globally unique IDs: one for 32-bit integers and one for 64-bit integers.
Then they set up a bunch of shards which consist of two servers to store the user’s data. When a new user is created, they are tied to a random shard. When viewing a user’s photos, the PHP code looks up which shard the user is on, then connects to the shard using the information stored in memcached. Each shard is assigned a weight that allows them assign the appropriate load for the hardware’s capabilities. It is technically possible for the shards to have more than two servers, but I believe they said they are only using two.
Since users are distributed across all of the shards, it is possible for a particular shard to have more load than other shards. They developed an external process to re-balance the shards. I imagine they use this process to re-balance the shards when new shards are added.
One of the big problems was replication lag. On every page load, the user is assigned to a bucket. If the MySQL server goes down, then the PHP code tries the next MySQL server. If all servers in the shard are down, then an error message is displayed.
It was essential for them to have a maintenance process that didn’t interfere with uptime. Each server in the shard shares no more than 50% of the shard load. When they need to perform maintenance, they shutdown half of the servers in each shard, administer the updates, restart those servers, then repeat on the remaining servers.
When Flickr started out, they had 12 shards. Now they have many more and are distributed in two different data centers. They process over 36,000 SQL transactions across 800 connections per second per shard which is about half their capacity. They set MySQL to use 32 concurrent threads for InnoDB tables.
Each shard is comprised of several servers and can be of any class. They favor 2U rackmount 64-bit servers that run Red Hat Enterprise Linux 4 with 16GB RAM and 6 x 15K RPM drives in RAID 10. They can balance the users between the servers to adjust for the server’s processing power. Just by adding more machines, they can handle the several terabytes. Since the servers have 16GB of RAM, about 13GB is used for InnoDB’s buffer pool. They said they have about 120GB to 500GB of data per shard.
Flickr uses two search back-ends. First is the shards and the second is a Yahoo! proprietary search system. Single tag searches are handled by the shards due to the real-time requirements. All other searches are handled by Yahoo!’s search back-end.
There is a lot of work into building a federated database, but the ability to handle an extremely large amount of load is well worth it. I found their architecture to be very interesting and thank them for being so open about their system.