storage.png

STORAGE

I’ve worked on several storage systems during my life. Here is a summary, and a collection of a few reference documents.

1995-2000 CODA

Coda developed at Carnegie Mellon by the group of Mahadev Satyanarayanan (Satya) was the first file system to deeply explore the concept of disconnected operation. The official Coda web page contains many resources, and my publications from these years are on the Coda research web pages.

I led the effort from 1996 to 2001 and ported Coda to Linux. I wrote an overview article in the Linux Journal in 1998, with beautiful illustrations created by Gaich Muramatsu.

Michael Callahan initiated a near Herculean effort to port Coda to DOS (remember that?), we wrote a paper about Coda on Windows.

Coda won an award at Linux World in 1999. Coda unquestionably influenced systems like IntelliMirror and Dropbox, but Coda was difficult to productize. Towards 1998, I became interested in simpler light weight solutions.

1998-2001 INTERMEZZO

InterMezzo was a new synchronizing file system, It shared the key feature of disconnected operation with Coda. It’s focus was simplicity and efficiency, and indeed its core functionality was implemented in a few 1000 lines of kernel and user level code. InterMezzo had a kernel driver called presto which was a Linux kernel file system (until I requested it to be removed because I was unable to take it further, as Lustre had taken over my life). Presto layered itself over a disk file system, which it leveraged to maintain the cache, thereby creating an in-kernel path with little overhead for most operations. It worked with a daemon in user space named Vivace (FUSE didn’t exist yet), which was responsible for handling cache missed and change propagation to other systems. It was used by Mountain View Data as the basis of a product, and by Tacit Systems.

I wrote several papers about InterMezzo

Around this time I compared the protocols used by many network file system, and an overview of some of my thoughts at that time is contained in this presentation given at the Usenix Technical Conference

1999-2013 LUSTRE

Lustre, a parallel file system for HPC, is the most successful project I started. Now, 20 years after I started it, it’s still very widely used, and acquisitions of the Lustre team continue to happen. The later history of Lustre is well documented on the Lustre Wikipedia page.

I wrote a lengthy document, dubbed the Lustre book about Lustre’s architecture, which described the large set of features mostly requested by the DOE, usually leveraging the lines of thought in my earlier work. Even relatively recent features such as client metadata write back caching were described in this book. In some cases, for example that of quality of service, the design documented in the book (which I developed with Sandia National Laboratories) was not implemented. However, amusingly, my staff said that the best book written about Lustre was the NFS v4.1 spec, which does indeed overlap with Lustre (without ever mentioning it).

Several white papers contain key elements of the architecture. These white papers were widely distributed in the 2000’s, we include several here:

Perhaps just a few ideas in Lustre truly stand out as original:

  • It used a very compact RPC format that minimized round trips, called “intents”. A similar mechanism was later incorporated into NFSv4.

  • A very simple “commit callback” mechanism was used to inform cache management that persistent state had been reached. To this day I remain surprised that commit callbacks were not widely used in storage applications.

  • Lustre has sophisticated management of request sequences to reach almost exactly once semantics for remote procedure calls. Again, similar mechanisms were later incorporated into NFSv4.

  • An “epoch recovery” model was developed and patented in the mid 2000’s, and summarized in a patent. It underlies DAOS recovery.

A few other patents were awarded to me, mostly about detailed optimization and consistency mechanisms.

2009-2013 COLIBRI, LATER MERO

Colibri was a design created in 2009 for an “exa-scale” container based storage system. It was developed in my startup ClusterStor which was acquired by Xyratex (which in turn became part of Seagate, after which Seagate sold other assets of ClusterStor to Cray). Colibri was renamed to Mero and briefly became a product for Seagate.

Colibri could not pursue an open source model, but some details A few presentations describe its key thoughts, and we attach one here delivered in 2010 at the TeraTec Conference. Exascale File Systems - Scalability in ClusterStor’sColibri System. Many years later a paper about its architecture was published.

2010-2013 EXASCALE IO EFFORT

The ExaScale IO effort was a discussion group I led from 2011 - 2013 in which many experts participated. When I left Xyratex, others took over the effort. It influenced other projects, and a European research project SAGE resulted from several more years of exploration. Meghan McLelland summarized the efforts in a nice presentation at the 2013 LUG

White paper and 4 progress reports on EIO.

2012 SECURE LUSTRE

The Lustre book described an architecture for security in Lustre. In 2013 Lockheed Martin awarded a contract to Xyratex to implement this design, and a secure Lustre file system became a product. A beautiful fact sheet about the secure product is available. My suspicion is that, under the influence of cloud infrastructure, the security considerations have significantly changed since these approaches were explored.

2016-2018 CAMPAIGN STORAGE

Campaign Storage is an archival storage system designed and implemented at Los Alamos National Laboratory (LANL). The effort took place in Gary Grider’s group. Nathan Thompson and I explored this commercially, but thought its target market was too small.

Gary Grider and I discussed this from an open source perspective. During this process, I organized a Campaign Storage discussion group which attracted reasonable attendance. LANL and I prepared a response to an official RFI regarding these matters. Attached are a few papers and presentations regarding this.