Several features give Bε-trees and BetrFS a performance edge. Some of the most significant:
LSM trees can be tuned to have the same insertion complexity as a Bε-tree, but queries in a natıvely implemented LSM tree can be slow, as shown in the table below. Bloom filters can improve point query performance of an LSM tree, but not range query performance.
Bε-trees are the only WOI implementation that matches the query performance of B-trees for all workloads. In short, LSMs match B-tree query times in special cases, and Bε-trees match B-tree query times in general.
Any workloads that do small, random, blind writes (i.e., writes without a preceding read) will particularly benefit on BetrFS. BetrFS is also good for sequential scans and searches within the file system. In many cases, the performance is comparable either way.
These graphs illustrate both random writes within a large file and small file creation within a directory on several general-purpose file systems:
In the case of scans, such as grep and find, BetrFS also does quite well:
Our goal is that no application should perform worse, but more research is needed for several cases.
Large, sequential I/O. In the case of writes, this is the cost of essentially full data journaling (and possibly logarithmic extra writes), compared to less aggressive schemes.
Large file deletes and trucates. These overheads are attributable to the sheer number of deletion upserts that must be inserted for a delete.
Large directory renames. BetrFS organizes data lexicographically on disk, which ensures locality, but requires copying data during renames. This currently trades search performance for rename time. We believe the costs of rename can be reduced.
As an implementation convenience, we use ext4 as a block allocator underneath our B^e tree implementation. We intend to streamline this layer in future releases.
Our design minimizes changes to the kernel. The current code requires a few kernel patches, such as enabling direct I/O from one file system to another. We expect to eliminate most of these patches in future versions.
We have currently been measuring and tuning performance on traditional hard drives (HDDs), although the code should also work on solid state drives (SSDs). Stay tuned for measurements of SSD performance.