Scaling out MySQL: Hardware Today and Tomorrow
For starters, Jeremy said go 64-bit hardware and operating system. For CPU, faster is better. The current versions of MySQL and InnoDB don’t take full advantage of 8 core servers, so unless you have the budget, Jeremy recommended a single quad-core or a dual dual-core setup. He recommended getting as much RAM as possible. RAM is cheap so go for 32GB, or at least 16GB.
For storage, Jeremy discussed the many options including direct attached storage (DAS), SAN, NAS, and the various hard drive interfaces. From what I gathered, they prefer configuring each DB server with RAID 10. If the RAID controller has battery backed cache, then you should do “write back”, otherwise “write through”. Write back offers faster performance since it caches the data and doesn’t make the system wait for the data to be written to disk. The battery backed cache means that you won’t lose the data pending to be written if the system loses power. There was a brief discussion of SATA vs. SAS. SAS offers faster drives (15,000 RPM) and have processors to handle commands just as SCSI has which improves performance. SAS has another interesting feature where a single drive can be hooked up to two separate SAS controllers in the event one controller should become unavailable.
They buy all of their gear from Dell, but HP, Sun, and IBM are good too. Dell just happens to be significantly cheaper, especially when you go through a sales rep. They mentioned some of the smaller guys including SuperMicro and Silicon Mechanics. I personally really like SuperMicro’s 6015T server because it has 2 nodes in a 1U chassis. This is actually denser than any blade server solution I’ve ever seen. Each node is capable of two quad-core processors and 32GB of RAM. The only downside is you can only have 2 hard drives and both nodes share a single non-redundant power supply. So this would make a decent slave, but you would need to architect your application so it could quickly pick another slave if/when it goes down or use MySQL Proxy.
For databases using InnoDB, they said the InnoDB buffer pools should be 2GB less than to total system memory, so 14GB on a 16GB system. Jeremy mentioned special hardware to speed things up, specifically Kickfire and Violin Memory. Kickfire is a SQL appliance that includes a special SQL chip to speed up operations significantly. Violin Memory’s 1010 memory appliance is sweet. For only $170k you can add 512GB of DRAM in 2U to your database server of a PCI-Express bus. It holds 84 x 6GB chips that can be hot swapped. You can lose 2 sticks before you’re screwed.
Jeremy concluded with high-speed interconnects including InfiniBand and Dolphin Interconnect. InfiniBand is fast and you can hook them all into a switch. Dolphin’s interconnect is also fast and are chained together in a loop similar to external SCSI devices, but you need to make sure they have a driver for your hardware.
I talked to Jeremy after his talk and asked him about diskless slaves which would basically have a RAM drive for the data. While it would be fast, it would take memory that would otherwise be used by MySQL and would be a pain to manage when they come online. So scratch that idea.
Helping InnoDB Scale on Servers with Many CPU Cores and Disks
One of the more popular talks was by Mark Callaghan at Google who talked about ways they managed to get InnoDB to take advantage of system with more than 4 cores and many disks. The primary change they made was to InnoDB’s mutex code used to control concurrent read/writes to pages.
They replaced the existing pthreads mutex code with a more efficient platform specific compare and swap CPU instruction (CAS). They managed to get much better performance. He said they are hoping to get a patch out by the end of the year with their changes. They don’t want to release it until they know it is rock solid.
Scaling Heavy Concurrent Writes In Real Time
Dathan Pattishall, formerly with Flickr, and now with RockYou.com talked about an analytics system he helped build for Flickr. Flickr keeps track of each photos stats including external links. Whenever someone directly embeds a picture from a Flickr Pro user, they record that information, then make those stats available in near realtime.
The old design basically involved inserting records as they came in, but it was killing the servers, especially since those servers were also handling reads for people viewing the stats. Their solution was to create a separate Java daemon that queues up pending inserts. This means only a single thread is used on the MySQL server and it doesn’t block the web servers from serving up the information.
They are inserting the stats into 3 tables, one for daily, weekly, and monthly stats. To keep things in order, they tried a VARCHAR of the URL as the primary key, but ran into major performance issues. So instead they decided to store a hash as bigint:
// php $id = hexdec(substr(md5(url),0,16),16,10);
This code generates a 32 character MD5 of the URL, then takes the first 16 characters and converts them from string of hex numbers to base 10 number. The resulting number fits perfectly in a bigint.
He also mentioned using ibbackup for backing up the databases, but it is not a free solution.
Geo Distance Search with MySQL
Ever since Google Maps API was released, I’ve had an interest in playing around with it. Alexander Rubin of MySQL talked about ways of querying for locations within a given distance of a lat/lng. He first abstracts the distance math into a user-defined function (UDF). Then just calls the UDF from within the query.
I’ve already played with geo searching before, so it was mostly review. He didn’t go into much depth such as MySQL’s spatial extensions.
Dinner at the Tied House
We had a great turnout of around 18 people at the Tied House in Mountain View. We had a number of people from places including MySQL, PrimeBase, and Facebook.
Thanks to the PrimeBase guys! They have a neat transaction storage engine that supports streaming blob data. Normal blobs in MySQL are held in memory during the transaction. The PBXT Storage Engine is designed to stream blob data in and out very efficiently.
I’d like to give a special thanks to Jay Pipes for getting me to come to the conference this year. I truly had a great time. Thanks!