MySQL’s Falcon Storage Engine

Apr 30, 2007

Another session I attended at the MySQL Conference and Expo was about the Falcon Storage Engine presented by Jim Starkey and Anne Harrison.

MySQL Conference and Expo 2007

Falcon is a new transactional database engine that is still in alpha and will be one of the huge features in the MySQL 6.0 release. Since they are wiping the slate clean, they are building the engine from the ground up to fully maximize use of today’s 64-bit multi-core computers.

One of the big features is the very cool true Multi Version Concurrency Control (MVCC) system allows tables or records to be updated without having the overhead of locking rows/tables.

The database supports a number of modes including: consistent read (I’m assuming this is read uncommitted), InnoDB emulation, and read committed. There are a ton of other features that you should check out.

They hope the engine will be ready for public consumption by Q1 2008, but that is certainly subject to change.


A PFX (Personal inFormation eXchange) file is a PKCS #12 certificate that can contain both public and private keys. Windows loves this format, but if we want to use them from a Java program, we will need to convert them to a Java keystore.

Before we begin, make sure you have the Java runtime installed. I’m using Java 1.5.0_07, but 1.6 may work. I doubt 1.4 or below will work.

To make this easy, we are going to leverage a utility class that is bundled with Jetty, a free Java web server. Download the latest stable version from their site, which at the time is version 6.1.1.

Extract the zip file to a folder, then open a terminal and change into that folder and execute the following:

$ java -classpath lib/jetty-6.1.1.jar org.mortbay.jetty.security.PKCS12Import

You should get a message that looks like this:

usage: java PKCS12Import {pkcs12file} [newjksfile]

If you don’t see this exact message, then make sure you are using a valid Java and Jetty version and make sure you are in the Jetty folder. Moving on.

Now, that things work, you can actually pass the path of the PFX file and the keystore to create. You will be prompted for the passwords for both files.

$ java -classpath lib/jetty-6.1.1.jar org.mortbay.jetty.security.PKCS12Import 
         MyCert.pfx MyCert.jks
Enter input keystore passphrase: ******
Enter output keystore passphrase: ******

When it is all said and done, you should be looking at a shiny new Java keystore file. Verify the keystore by executing the following and entering the password you entered above:

$ keytool -list -keystore MyCert.jks -v

Enter keystore password:  ******

Keystore type: jks
Keystore provider: SUN

Your keystore contains 1 entry

Alias name: 29d1a9e13ca529ef1a32b1ea135b713_5a537e12-9c8e-833f-bb76-30ab870dd21
Creation date: Jan 1, 2007
Entry type: keyEntry
Certificate chain length: 1
Certificate[1]:
Owner: CN=XXXX, OU=IT, O=XXXX, L=XXXX, ST=XXXX, C=XXXX, EMAILADDRESS=XXXX
Issuer: CN=XXXX, O=IT, L=XXXX, ST=XXXX, C=XXXX, EMAILADDRESS=XXXX
Serial number: a0032c317ba3200000e1
Valid from: Mon Jan 1 00:00:00 CDT 2007 until: Tue Jan 1 00:00:00 CDT 2008
Certificate fingerprints:
         MD5:  B1:63:6A:2C:2E:97:A4:33:E9:61:98:01:CA:0B:74:91
         SHA1: 61:98:01:04:7D:33:6C:2E:97:A4:D2:C7:61:61:B1:63:6A:2C:2E:97

Your output may vary, but you should have a valid Java keystore in the end. You can use the keytool tool to merge this keystore into another existing keystore, but I’ll leave that for another day.


Federation at Flickr

Apr 28, 2007

If you don’t have a Flickr account by now, go sign up now. Dathan Pattishall gave a talk at the MySQL Conference and Expo about how Flickr federated their database to handle their growing site’s needs.

flickr

He began by discussing some of the problems they encountered with master-slave topology, slave lag, multiple single points of failure, unable to handle load, and unable to search content.

Since writes are intensive, they needed more than 1 master so that they could get as more write points. This also assisted with their single point of failure issues.

They designed their database to be federated across several MySQL servers. They set up a master-master replication “global ring” that keeps track of globally unique IDs and contains data that can’t be replicated. It also keeps track of the shards and which “shard” the user’s data is on. Shard information is cached in memcached on the PHP side to reduce database load for such a used piece of information. They created two tables to keep track of the globally unique IDs: one for 32-bit integers and one for 64-bit integers.

Then they set up a bunch of shards which consist of two servers to store the user’s data. When a new user is created, they are tied to a random shard. When viewing a user’s photos, the PHP code looks up which shard the user is on, then connects to the shard using the information stored in memcached. Each shard is assigned a weight that allows them assign the appropriate load for the hardware’s capabilities. It is technically possible for the shards to have more than two servers, but I believe they said they are only using two.

Since users are distributed across all of the shards, it is possible for a particular shard to have more load than other shards. They developed an external process to re-balance the shards. I imagine they use this process to re-balance the shards when new shards are added.

One of the big problems was replication lag. On every page load, the user is assigned to a bucket. If the MySQL server goes down, then the PHP code tries the next MySQL server. If all servers in the shard are down, then an error message is displayed.

It was essential for them to have a maintenance process that didn’t interfere with uptime. Each server in the shard shares no more than 50% of the shard load. When they need to perform maintenance, they shutdown half of the servers in each shard, administer the updates, restart those servers, then repeat on the remaining servers.

When Flickr started out, they had 12 shards. Now they have many more and are distributed in two different data centers. They process over 36,000 SQL transactions across 800 connections per second per shard which is about half their capacity. They set MySQL to use 32 concurrent threads for InnoDB tables.

Each shard is comprised of several servers and can be of any class. They favor 2U rackmount 64-bit servers that run Red Hat Enterprise Linux 4 with 16GB RAM and 6 x 15K RPM drives in RAID 10. They can balance the users between the servers to adjust for the server’s processing power. Just by adding more machines, they can handle the several terabytes. Since the servers have 16GB of RAM, about 13GB is used for InnoDB’s buffer pool. They said they have about 120GB to 500GB of data per shard.

Flickr uses two search back-ends. First is the shards and the second is a Yahoo! proprietary search system. Single tag searches are handled by the shards due to the real-time requirements. All other searches are handled by Yahoo!’s search back-end.

There is a lot of work into building a federated database, but the ability to handle an extremely large amount of load is well worth it. I found their architecture to be very interesting and thank them for being so open about their system.


Here at the MySQL Conference and Expo, Laura Thomson gave a great talk about Scalability and Performance Best Practices.

She had some interesting points about scalability. She basically said that no matter what language you write your web application, whether it’s compiled (C/C++) or interpreted (PHP, Java), you are subject to scalability issues. Another potential problem is optimizing before you know what exactly to optimize. This can lead to a loss of time that could have been spent on more important things.

Laura’s talk covers three types of best practices: general, scalability, and performance.

General Best Practices

First tip is to profile early and profile often. The earlier you can detect poor performance, the easier it is to fix. There are a handful of tools (APD, Xdebug, Zend) that can help with profiling. Use system profiling tools such as strace, dtrace, and ltrace to gather more information.

There are two types of effective profiling: debugging and habitual. Debugging profiling is about spotting deviations from the norm and habitual profiling is making the norm better. Profiling is an art and requires lots of practice to know where to look.

It is essential that the IT admins and the developers cooperate. This allows crisis’ to be handled properly, especially in production environments were outages are time critical. Team members should report alert the developers of any abnormal behavior changes after a new code release is pushed. Before pushing new code, schedule a launch window and procedures for having developers fix problems and possibly falling back to a previous version. Avoid pushing releases on Fridays, otherwise key team members may be unavailable or over worked on the weekend.

It is recommended to test your application with production data. Test data may not take into account certain scenarios where bugs can be introduced. It is advised to have a staging environment which uses production data and also undergoes simulated load testing.

In order for you to track your application’s performance, you should record your applications performance over a period of time, then analyze the data to find potential issues. There are several means of tracking performance including access logs, system metrics, application profiling, and query profiling.

When a problem occurs, don’t make assumptions. The problem may be caused by something other than what you think it might be.

Scalability Best Practices

When the web application begins to suffer performance issues, start to decouple and isolate components to track down the source. If you need to tweak code, spend only enough time to refactor as needed. Reduce load on servers by moving static content on to dedicated servers.

By default, PHP stores session data on the hard drive. This can cause performance issues and can benefit from storing the session data in a database or better yet in a distributed cache such as memcached.

The most important thing you can do to improve performance is to cache as much data as possible. There are many levels of caching. You can cache data sets or precomputed fragments. For things like images, you can set up dedicated services for caching and serving static content. The usual suspects are recommended for caching (APC, memcached, Squid).

PHP out of the box does not cache compiled pages. That means that ever request, each page has to be parsed and executed. Extensions such as APC and Zend can cache the compiled pages for an immediate speed improvement.

MySQL’s query cache works, but isn’t necessarily implemented the best way. If you query a table, the results are cached on the MySQL server. If a row is inserted, updated, or deleted, the entire cache is flushed. This supposedly has been fixed in MySQL 5.1, but a setting has to be set to not flush the cache.

To scale, your data can be federated across multiple MySQL servers. There can be complications with regards to data reliability and table joins can suffer from major speed hits.

A more reliable way to scale is to use replication. Replication does suffer from “slave lag” issues. The reason the lag can be high is because the master server uses multiple threads to store the data locally. The slave server has to process the replicated items in a single threaded mode to ensure the order of which the transactions is preserved. You can display the status of MySQL’s I/O and SQL thread by executing a SHOW PROCESSLIST statement.

The more database writes, the greater the lag. Depending on your application, you may only want to use replication for failover or backups.

Sometimes you may benefit from designing your application to avoid situations where data is hard to scale and can’t easily be cached.

Performance Best Practices

You definitely want to use a compiler cache. As described above, use APC or Zend for PHP5. If you are connecting to an external data source, perhaps a web service or data feed, minimize the number of instances you request them. Cache their response if possible. You may be able to load the data dynamically using Javascript and a little Ajax magic. Maybe the data isn’t a must have or maybe you can have a page dedicated to display the 3rd party data.

When tuning your applications performance, change one thing at a time. If you change more than one thing, how do you know which change caused the improvement and how do you know if you didn’t introduce new bugs changing the other things. Use MySQL’s EXPLAIN statement to profile your queries and enable the slow query logging. Use MyTop or InnoTop to help profile your queries.

It is crucial that your database is properly indexed. If a table has poorly designed indexes or perhaps too many indexes. Use the smallest data type possible and try to design your tables to be fixed width. That means, use char instead of varchar, set the length of your fields to logical lengths (ie use 128 chars instead of 107 chars). De-normalize when necessary. Remove static data out of the database or store it in a MEMORY table. Use the appropriate storage engine for each table.

For your queries, minimize the number of queries and cache them outside the database when possible.

She claims that deeply recursive code is expensive in PHP. Make sure you are not doing unnecessary looping. If you find that you are, chances are you are doing something wrong and that there is a better idiom for performing the task.

Don’t try to work around or re-write perceived inefficiencies in PHP. Use regular expressions to do intense string manipulation. instead of writing complex serialization code, use PHP’s extensions to do the heavy lifting. Before spending time to write some boilerplate function, check to see if there are any extensions that exist that could help save you time.

Laura gave this talk on behalf of George Schlossnagle. George’s original presentation can be found at http://omniti.com/~george/talks. This was an excellent session and proved to be beneficial.


Technology at Digg.com

Apr 24, 2007

Digg.com has over 1 million users on their LAMP based web application. Their architecture includes a load balancer that distributes web traffic across several web servers. Then they have a single master MySQL server that replicates to several slave MySQL servers. Each web server runs PHP and connects to a random slave MySQL server.

An important part of their infrastructure is memcached. With memcached, they cache content that is faster to retrieve than to query the database. When the PHP page is executed, it queries memcached then renders the page. If memcached doesn’t have the requested content cached, PHP will query the MySQL servers and then dump the result into memcached for future lookups.

They have several servers running memcached to distribute the cache storage.

They divided their MySQL slaves into shards. One of the shards of slaves handles searches. They have two other shards, but the presenter was a little quick for me. There are a couple types of shards:

  • Table-based – put tables on dedicated servers
  • Range-based – put a range of users or topics on dedicated servers
  • Date-based – partition the data by date across dedicated servers
  • Hashed – users or topics are stored on a particular server and referenced via a lookup
  • Partial sharding – not quite sure, but it sounds cool

Digg is not using any built-in MySQL partitioning, federated tables, or clustering. MySQL 5.1 has some mechanisms to handle these shards, but at the time they didn’t exist, so Digg had to roll their own.

Their MySQL databases run on Debian based Linux. Debian’s apt made it extremely easy to upgrade packages. They are running MySQL 5, but they didn’t notice a huge performance increase compared to MySQL 4.1. They use around 20 MySQL servers and memcached runs on about 9 of those MySQL slave servers.

They have reached a level where they can no longer solve database problems by adding more RAM. Digg has begun re-writing queries to optimize I/O. Their database is around 30GB and constantly growing. They have used Cacti for monitoring their MySQL servers, but can be painful to use in an ever changing infrastructure.

For full text searching, they use Apache Lucene on a couple servers.

They store user images on XFS partitions. XFS is supposedly better at handling a large quantity of files and very robust. From their testing, ext3 was slower and not as reliable.

Their presentation is available online at http://eliw.com.


PHP Performance and Security

Apr 24, 2007

Today, Rasmus Lerdorf, the man who kick started PHP, gave a great talk about PHP Performance and Security.

He began by talking about the new MySQL native driver. He did some benchmarks in which it appeared that the driver offered little performance improvements.

The first tool to use to track down performance issues is Callgrind that runs on top of Valgrind. Callgrind dumps a file that can be opened with KCachegrind, which is available on the Callgrind site.

Performance can be improved by installing APC: Alternative PHP Cache. Next he recommends installing the Xdebug extension to profile your PHP application and find performance issues. Caching is key.

For security, he described a scenario where you can browse to a website and the website lists a bunch of links to various websites. With Javascript, the site can detect if the link has been visited. If it has, he could have the Javascript check those pages to see if your session is still valid and then cause problems such as transfer money or open your firewall. You can reduce the surface area of these kinds of attacks by passing all inputs through filters. Each form field, URL, cookie, whatever must go through a filter to escape potential problems.

Next he covers attacks by passing extra stuff in the URL. Even escaping URL parameters with htmlspecialchars() doesn’t protect you from characters that are already escape that are evaluated in the browser to do harmful things.

He talked about a spoofing trick where older versions of the Flash plugin in Internet Explorer can add attributes to the request header such as the domain. When providing links to download PDF files, the URL can include Javascript code that is executed when the PDF plugin is loaded. This can be prevented by setting the mime type for .pdf to application/octet-stream which forces PDFs to be downloaded.

Cross-site request forgery is another huge problem. By adding a hidden input field with some sorts of a session token, or “crumb” as he calls it, in combination with your session cookie, can be used to verify the request is valid.

Rasmus has made his presentation available online at http://talks.php.net/show/mysql07.


MySQL Replication

Apr 23, 2007

This afternoon I attended the MySQL Replication: The Complete Tutorial at the MySQL Conference. This was a two part tutorial of which I missed the first tutorial because I was at the MySQL Scaling and High Availability Architectures tutorial in the morning.

The second part jumped right in row-based replication. Row-based replication writes the actual changed row to the binary log whereas regular statement replication records the actual SQL statement to the log.

MySQL’s binary log is designed to handle both statement and row-based replication. MySQL’s replication can also replicate tables, stored procedures, stored functions, triggers, and events. MySQL will automatically add a “definer” to the statement so that privileges are applied correctly on the slave server.

There was some discussion about the technical side of replication and cluster replication. They touched on backup tools in which I learned that there are no free tools for performing a hot-backup of an InnoDB table. You can get around this by replicating the table to a slave server. The slave’s table could then be taken offline for a full backup or the table could be a MyISAM in which the database could be backed up while online. They recommended to a tool for calculating a table checksum to make sure your replicated databases are all synced.

Other interesting links include Heartbeat for Linux, DRBD (distributed replicated block device), and stunnel (secure tunnel).

The future of replication includes better conflict resolution for multi-master scenarios where the same row is being modified by both servers before they have had a chance to sync. The plan is to fire a special trigger when a conflict occurs so you can define what happens.

MySQL can support replication of blob data, but it can be extremely slow if the blob is of a significant size. As I understand it, the blob data is sent over the wire regardless of whether the data was modified or not. It is recommended to create a second table to hold the blob and create a one-to-one relation between the two tables.

All in all, this was an informative tutorial, but I’m left wondering if I would have been better off attending the Wikipedia internals discussion.


This morning I attended the Scaling and High Availability Architectures tutorial at the MySQL Conference. MySQL databases can be designed to both scalable and highly available, but how you achieve that really depends on your application.

The first technique described for making your database scale is to divide it into partitions. Where you should partition your data depends on how much data you have. A good way would be to divide your database per user across a couple MySQL servers. When you needed a particular user’s data, you would need to query a directory which maps the user’s ID to the server that stores their information. This is simple and requires no clustering or replication, but doesn’t offer high availability.

Another issue is making sure each MySQL server is balanced. There is a framework called HiveDB that can help with partitioning the database.

To make your database highly available, you need to cluster or replicate. For high transaction web applications, replication is recommended over clustering. There are several strategies for replicating your database. The easiest way is to set up two servers: one master and one slave. This works great, but you need to make sure your application framework will connect to the slave when the master goes down. Then there are issues with restoring the missing data back to the master.

You can set up multiple slave servers which is great for querying data, but is slow for writes. If you have multiple slaves, you can reduce load on the master by replicating to a single slave that replicates the data to several other slave servers. This introduces a single point of failure which is not ideal.

The best solution is setting up a master-master replication setup. Each master replicates to the other. There is potential for synchronization issues, so they recommend you assign a virtual IP to the servers and only one of the servers has the IP at a given time. A heartbeat between the the servers would allow each server to monitor the others status. When the other server stops responding to the heartbeat, the server assumes the IP. You can then set up slaves off those masters, but that does add another level of latency, but shouldn’t be that bad.

MySQL does allow you set up a replication ring where transactions are passed around in a circle from one master to another to another and so on, but they highly recommend to NOT do this.

So, a master can replicate to one or more slaves. Slaves can be masters and they can replicate to other slaves. A slave cannot have two masters which would be nice, but I can appreciate the technical issues.

Setting up master-master replication is pretty simple to configure and offers great availability. In the end, if you have a huge database, it’s best to partition the database by users into a bunch of master-master replication server pairs. Each master can have one or more slaves for extra read performance, but can also create latency issues. A slave would be very handy for doing hot backups.

Anyone have any other recommendations for making MySQL scalable and highly available?


Thursday, April 19th, the latest rev of the Linux distro Ubuntu 7.04 (Feisty Fawn) was released. I wasted no time to download it and installing it in one of my trusty virtual machines.

I have a Macbook Pro and I use Parallels Desktop for Mac as my VM software. Installing previous Ubuntu releases was a piece of cake, but Ubuntu 7.04 was being a little “feisty” to install.

After trying the DVD, Desktop, and Alternative versions, I still had no luck. It was hard to tell what the problem was on the DVD and Desktop editions since they were live CDs and the screen just went blank. The Alternative version gave me a clue that the installer couldn’t load the CD/DVD drivers.

After doing some digging, I discovered there is a problem with both Parallels CD/DVD driver and the latest Linux kernel that ships with 7.04.

To get around the problem, I changed the OS Type to “Solaris” and OS Version to “Other Solaris” and then tried again to install Ubuntu.

Ubuntu’s installer found the drive without a problem and successfully installed the operating system. After the installation was complete, I shutdown the VM and changed the OS Type back to “Linux” and OS Version to “Other Linux kernel 2.6”. I have no idea if the CD/DVD drive works after changing these back, nor do I know if there are any issues with just leaving the OS Type as Solaris.

My hopes are that either a new kernel and/or Parallels releases an update to fix this problem. Honestly though, I have never used the CD/DVD drive in a Linux VM since everything I install is downloaded.

Has anybody else had this issue or come up with a better workaround?


MinneBar is Tomorrow!

Apr 20, 2007

Tomorrow, Saturday April 21st, is MinneBar, Minnesota’s BarCamp. MinneBar is described as an “(un)Conference” which means it’s a free, ad-hoc gathering of technology folks where everyone is encouraged to contribute.

MinneBar

I will be presenting about the Dojo Javascript Toolkit in the last time slot. Since I only have 50 minutes, I’m sticking to the essentials behind why I use Dojo and why you should too. I’ll be posting my presentation online under the projects when it’s ready.

As this moment, there’s 350 people signed up. This year there are 7 tracks across 6 time slots. Plus they left time for lunch and some demos. There are some great sessions this year, so if you are in the Minneapolis/St. Paul area, you should come!

To register, visit their website, click the “login” link in the top right, use the password displayed above the login, then edit the main page, and add yourself to the bottom. You can see what time things are happening and more importantly where.

Hope to see you there!