Introduction to Freeze Frame File System over RMDA

Version 6

    This post introduces Freeze Frame File System (FFFS) over RDMA v1.0, Released by Cornell University.

     

    References

     

    Overview

    Many applications perform real-time analysis on data streams. We argue that existing solutions are poorly matched to the need, and introduce our new Freeze-Frame File System, recently released by Cornell University. Freeze-Frame FS is able to accept streams of updates while satisfying “temporal reads” on demand. The system is fast and accurate: we keep all update history in a memory-mapped log, cache recently retrieved data for repeat reads, and use a hybrid of a real-time and a logical clock to respond to read requests in a manner that is both temporally precise and causally consistent. When RDMA hardware is available, the write and read throughput of a single client reaches 2.6GB/s for writes and 5GB/s for reads, close to the limit (about 6GB/s) on the RDMA hardware used in our experiments.  Cornell's FFFS software is free and open-sourced, and can be used in any setting where a standard file system is used, including from the Spark/DataBricks platform, where it would be used just like the built-in HDFS.

     

    FFFS implements a new kind of temporal storage system in which each file system update is represented separately and its time of update is tracked.  When a temporal read occurs, FFFS recovers the exact data that applied at that time for that file.  These data structures are highly efficient: FFFS outperforms HDFS even in standard non-temporal uses. We provide extensive details in our paper: The Freeze-Frame File System

     

    FFFS was originally created for the use of smart electric power grid, where developers are creating machine-intelligence solutions to improve the way electric power is managed and reduce dirty power generation. In this setting, data collectors capture data from Internet of Things (IoT) devices, such as power-line monitoring units (so-called PMU and micro-PMU sensors), in-home smart meters, solar panels, and so on. The data is relayed into a cloud setting for analysis, often by an optimization program that will then turn around and adjust operating modes to balance power generation and demand. With FFFS, grid network models and other configuration information are updated by writing files to reflect the evolving state, in effect creating an archive of past states.

     

    FFFS can support any normal file-based analytic programs, so once the data is captured, existing analytic code will work as usual.

     

    The big win occurs when performing temporal analytics: analysis of the evolving state of the system over time, rather than just on the state of the system at a single instant.  By using FFFS for data captures, smart grid developers can run standard non-real-time analytic applications on a series of snapshots, and thus can study the evolution of power-grid data.  They simply ask FFFS to fetch a series of high-accuracy snapshots, at any precision desired. The programs therefore run multiple times, one time on each of a series of snapshots.  Because the input files evolve through time,  we obtain a series of outputs that also evolve over time.  Thus, for many purposes, the analytic programs often don’t need to be modified to understand “time” per-se. The series of outputs would be an evolving temporal analysis for the system.

     

    In addition, FFFS supports temporal access to a file at more than one point in time, from a single program.  As such, one can easily write programs that explicitly search data in the time dimension!  It is as if the same file could be opened again and again, with each file descriptor connecting to a different version from a different instant in time.

     

    As noted earlier, FFFS is a free, open-source product.  Cornell’s development team (see brief bios below) continues to develop and to fix bugs, welcomes contributions, and is also open to sharing FFFS applications created for temporal data analytics.

     

    Features and Benefits

    Freeze Frame offers the following benefits:

    • Dramatically better performance through utilization of Mellanox RDMA solutions. When RDMA is available, FFFS takes advantage of the hardware.  Any data read or written will be moved using zero-copy hardware rather than through standard kernel/user-space TCP.  For data-intensive uses, this offers a huge speedup compared to standard non-RDMA networking.
    • Ultra-accurate tracking of real-time for Internet of Things (IoT)  applications. Freeze Frame:
      • Can be configured to understand the time fields in your data streams.  If network elements contain GPS clocks, FFFS will use those clocks as its time source for the IoT data.
      • Files can be overwritten, in whole or in part. FFFS supports the full POSIX file system API (in HDFS, parts of the POSIX API are not supported: files can only be appended-to).
      • Unlike HDFS, FFFS does not require the user to pre-plan a snapshot. FFFS allows a user to access a snapshot when needed, long after the data was captured, and with no significant overhead costs. In addition, FFFS provided a new read API to access the snapshot, enabling applications to read the state of a file at any given time without creating or referring to a ‘snapshot’ explicitly. This is superior to HDFS’s snapshot directory (a special directory named by a time, within which FFFS will show read-only versions of all the files as they looked at that time) because:
        • The users may not know which snapshot will be required by a future analysis.
        • FFFS snapshots guarantee  temporal precision and causal consistency.
      • Never exposes applications to “mashups” of data from multiple times.
      • If desired, permits applications to specify the time at which to perform each read and write.
    • FFFS is easy to install, following the download, compilation and install instructions on the distribution Website.

     

    Freeze Frame permits greatly improved consistency and accuracy relative to HDFS, as seen in the images displayed below, which show a representative use case. We simulated the propagation of a wave in water and broke the data into 20x20 “camera feeds”, like small videos, 100 frames per second.  Then we read the data out at 5ms (20fps) intervals and glued the frames together to make a video.  On the left, the data was loaded into HDFS and the snapshots generated using its HDFS snapshot capability: the poor handling of time results in many distortions. In the middle and right, we see FFFS in action: it understands time and gives extremely accurate results.  To see an animated version of this figure, click here.

     

    To understand the advantage, imagine that you have created a machine learning algorithm to discover waves propagating through a power grid.  If you ran your amazing algorithm on the version of the data at the left, it will have to overcome huge levels of "noise" injected by the data collection and storage solution.  Very likely your algorithm will malfunction, purely because with garbage input, it will yield garbage output.  In the middle and on the right, your algorithm would perform better because with help from FFFS, the input will be more accurate: more consistent in these two senses (logical consistency and temporal accuracy).

     

    In all situations, FFFS can leverage Mellanox RDMA solutions.  As a result, FFFS with Mellanox RDMA will be far faster than HDFS in all styles of use, and at the same time, will offer better consistency and temporal accuracy.

    HDFSFFFS + SERVER TIMEFFFS + SENSOR TIME
    hdfs37.pngfffs37.pnguser37.png

     

     

    Performance

    Below we see one of several performance experiments; our paper reports on many more of them.  To create these graphs, we ran FFFS with Mellanox ConnectX-3 FDR 56 Gb/s adapters, and the results are shown in the figures below.

    • We configured our RDMA layer to transfer data in pages of 4KB, and used scatter/gather to batch as many as 16 pages per RDMA read/write.
    • Figure (a) (Single Client) shows that the FFFS read and write throughput grows when the data packet size used in write increases. When our clients write 4096KB per update, FFFS write throughput reaches 2.6 GB/s, which is roughly half the RDMA hardware limit of 6 GB/s (measured with ib_send_bw tool). FFFS read throughput reaches 5.0GB/s.
    • We scaled up to a 16-DataNodes setup consisting of 1 NameNode and 16 DataNode servers. In figure (b) (Multiple Clients), we can see that file system throughput grows in proportion to the number of DataNodes, confirming that FFFS scales well.

    rdmathp_slides.pngrdmascale_slides.png
    (a) Single Client (DataNode) Throughput(b) Multiple DataNodes (Aggregate) Throughput

     

    Team

    The lead FFFS developer was Weijia Song, a post-doc at Cornell University who works in the Department of Computer Science in a group headed by Ken Birman.  PhD student Theo Gkountouvas collaborated closely on data structure design and on extending FFFS to use an optimal consistency-preserving data representation.  In future work, Weijia is using Cornell's RDMA-based Derecho programming platform to make FFFS highly available via data replication and fault-tolerance.  This will also make it even more scalable, since we will be able to spread reads and writes over a large number of compute/storage nodes.  Theo is working to use Pig Latin, a scripting language, to extract and format the data from files into matrix or tensor representations, which can then support direct computing from MPI, Matlab, Linpack, or other packages.  A deeper integration into Derecho is also in the planning stages.  Contact Ken Birman (ken@cs.cornell.edu) with questions about the Derecho and FFFS plans.