From Granizada

Jump to: navigation, search

Very quick notes thrown together for FOSSCamp talk 16 May 2008.

Note that as of mid-2010 I am reviewing a GPLed, portable solution based on inotify that will add to our toolkit, however, it has not yet been approved for general release pending in-house testing.


Problem Definition

One of the biggest server problems to solve is making it easy to reliably replicate structured data over WANs in near real-time.

Breaking the statement down, I mean:

  • reliably replicate – making a copy which is usable as-is, not just a backup of data that requires a restore. Once this exits we get failover and load-sharing solutions using existing Unix software stacks.
  • WANs - the sorts of links found between high-speed datacentres or even on most LANs with a few switches involved; basically anything other than fast interconnects as used by cluster filesystems.
  • structured data - SQL, files, email, non-SQL databases or anything else higher than block level
  • near real-time – less than 30 seconds. This is administrator's real-time of course, not engineer's!
  • data - lots. Terrabytes at least.

These are problems which have been worked on for 30 years, there are commercial solutions that address portions of this problem space very well. It is downright embarrassing that Linux has only very spotty coverage of this problem space, although it is a natural area for Linux to address.

Problem Discussion

Ordinary administrators would like to do that extra little something to take their existing solutions and add replicants in the way that DNS does. Unfortunately there isn't any little something; to some extent this is application-specific, but not entirely. Clustering is more specific and requires relatively more skill, and doesn't do anything about the common case of reducing single points of failure by physical dispersal. When the IT director is required to deliver a robust solution without necessarily seamless failover, Microsoft often looks attractive for ease of deployment and even problem space coverage.

The file case

For replicating files, while replication is a very common dream, the solutions usually offered tend to be some or all of: complex, only work on tightly-coupled clusters, not quite work at all, and require special hardware, application development or client software. GFS, lustre, MPI approaches, ctdb and so on. Linux LVM has some serious shortcomings with its snapshots and Zumastor is fixing this. Over on OpenSolaris (not reasonably compatible with Linux for licensing reasons) ZFS can do some replication. OpenAFS offers yet another solution but it is has some very specific requirements that don't fit what many networks are willing to deploy, including being very slow.

As of Windows Server 2003 and even more so with Server 2008 Microsoft have a fairly solid whole-of-filesystem replication solution. It doesn't address everything in the problem statement but is much better than anything available on Linux. And, as of very recently, the full specs for this are available without copyright concerns and with a certain amount of patent protection. There is some progress towards thinking about implementing this within the Samba team... let me know if this matters to you!

Reducing the Scope

Right now (as in, 6-12 month timeframe) Linux can't offer any kind of comprehensive solution to file, database and mail clustering and replication and certainly not unified solutions. But Linux can distinguish itself by:

  • acknowledging that this is an unsolved set of problems.
  • offering solutions for specific cases that we know we can cover well using today's technologies.
  • documenting the characteristics of the major solutions to help guide people in their choices.

We can make a lot of users happy by highlighting these practical solutions:

  • SQL as of very recently (ship with SLONY as a profile, and instructions, and on MySQL -- and no, MySQL hasn't done replication reliably for very long at all) [2010 NOTE: Postgres 9 has Write-Ahead Log shipping capabilities, mostly addressing this problem.]
  • LDAP (need OpenLDAP version 2.4 with multimaster and syncrepl; need to ship the OpenLDAP sample configs from the OpenLDAP test suite which show how all this works; need to ship a preconfigured profile for syncrepl.)
  • Maildirs (a very narrow case of the filesystem problem that we can solve, including with indexes from Dovecot and some other maildir-using servers)
  • Filesystems (low-volume/low latency, the generic case is very difficult and unsolved on Linux)

There is also the implication in promoting replication we live in a world where uptime and failover matter, and that frequently the notion of backup and restore is in many respects unworkable. Again this is not a story that Linux makes well by default.

The Maildir situation is very common, and sufficiently constrained that we can do a good job of replication. Even moderately sized company will have a mailstore of millions of files, but because there are no copy operations in Maildir, replication across a WAN with inotify triggering is feasible. You then need to cater for various metadata databases that tend to come with Maildir-based servers, but that is solvable in most cases.

A Note on Consistency at Filesystem Level

Despite what it looks like at first, very few apps are actually consistent on unexpected powerfail. Postgress, Berkely DB, Maildir and bzr are and therefore most things that use them. Specifically, they are designed not to suffer from performance degradation or corruption on restart after power cord yanked. There might be some partial data around to clean up, and a small number of in-flight transactions may have been lost depending on the protocol in use, but the databases are intact.

A Note on Block Level Solutions

ndb and similar solutions are no use. They just mean the inconsistent state is replicated everywhere at high speed. This can seem counter-intuitive, but consider: when a network block device is functioning correctly, half-way through a file store operation the same half file is on both ends of the link. There is no promise of atomicity. You can layer any of the solutions in the foregoing section above the block devices, but this hasn't answered any questions these solutions don't otherwise address. In most cases it will break replicability designed in a higher level.

Next Steps

Some of the short-term actions discussed at FOSSCamp:

  • Ship example config files for replication setups in all apps where this is an option. Start with Postgres and OpenLDAP. Amazing but true, this isn't done and it would help many :-)
  • Create profiles for choosing replication at install time for packages using foregoing config files.
  • Use rsync3 everywhere; lower RAM requirements means can traverse large trees, speed and asyncronicity improvements make it run faster. Doesn't necessarily solve anything (think multi-terrabyte filesystems) but it can reduce problems.
  • Dan and Claudio's work on rsync to integrate inotify in a new daemon thread as per design notes.
  • Use --rsyncable (thanks Rusty) on many gzip operations by default.
  • Special case of foregoing: every time a .deb is created. This has been discussed forever, can we please just do it now? We need all distros to as well, it will reduce very significantly the mirror burden and distro X gets a hard time on the mirror if distro Y is not --rsyncable! Taken up with mvo, this seems to have been an oversight, will be fixed.)
  • Get --rsyncable pushed to gzip upstream (check this isn't so!!)
  • Move from direct rsync (including with --rsyncable) on Debian mirrors to publishing --batch-mode results. This way the servers don't have to do any checksum calculations... more or less the same checksum calcs made over and over again.
  • Update the inotify rsync design so that it also links to --batch-mode, ie generating a patch list for all files where inotify has triggered.
  • More on --batch-mode: modify all rsnapshot-type systems to use --batch-mode so the result is then suitable for multiway WAN replication in a way where multiple replications don't incur the checksum and traversal time cost more than once, which matters for big filesystems.
  • Seriously look at integrating Zumastor snapshot stuff for the device manager, lvm snapshots are highly suboptimal right now and of dubious use for replication as a result. Dan has a list of fail cases for LVM.

Longer-term items:

  • Dan's design from Claudio's suggestion for modifying inotify to return a byte range by changing both the VFS and the inotify interface. Daniel Phillips is also interested in fixing the kernel inotify interface.
  • Wayout not short-term idea: Modify filesystem journal so that replaying the journal interprets the journal events as inotify events in a kernel ring buffer. When a machine restarts, as soon as the journal is replayed this buffer is populated. Apps can choose whether or not to look at this data when they restart. So over time we'd expect to see wherever apps issue inotify_watch() they can first call inotify_replay() in whatever priming manner makes sense for them. Definitely in the mad Dan category.
  • Stream the journal from ext4 over the network in realtime: this is not a short-term solution but it is potentially the most efficient solution. From Robert Collins.
  • Get Samba to implement the CIFS replication protocol as now documented and also researched by Metze. No effort is currently being put into this (check!) Not a short-term solution.
Personal tools