Starburst Presto + Alluxio = better together
With more companies using Presto for reporting and analytics, we here at Starburst are seeing more use cases around operational reporting. These types of queries need to be returned subsecond and usually involve a small subset of the dataset.
Presto was designed from the ground up to offer interactive analytics using a massively parallel processing SQL engine that can combine data from multiple sources using a variety of connectors. As more and more companies discover the power of “separation of storage and compute” along with querying the data where it lies, it’s no wonder Presto is being asked to add even more functionality.
Alluxio focuses its innovation at the data layer as a key enabling technology for Presto and a wide range of analytics applications and use cases. Performance is always critical, but providing memory speed response time is only part of the solution. If the application can’t access the data, it’s of no use. Alluxio creates a virtual data layer that aggregates data from any file or object store, providing unification across silos and allowing applications to continue using the same industry-standard interfaces to access the data.
For use cases where the same data is regularly queried and due to the fact that Presto does not store or cache data, the two solutions complement each other extremely well. Coupled with Presto, the Alluxio platform provides an Enterprise, read/write block-level caching engine that connects to a variety of storage systems including S3 and HDFS.
The diagram below illustrates how Alluxio might be implemented on a public cloud such as AWS or Azure. Alluxio supports connectors for both storage types as well as HDFS:
As data is queried from S3 or Blob storage in the diagram, those blocks are cached in Alluxio. Another important and very technical detail is a feature recently added called Async Caching. This allows partial reads of data blocks in order to speed up the reading of data. When a slower storage medium like S3 and Blob storage are used, this greatly increases performance.
An example of Async Caching is reading the footers of ORC or Parquet files. This is performed by Presto in order to determine if the file contains the data required by a query. If the entire block is read just to look at the footer, then this could take many seconds to minutes. With Async Caching, this now takes seconds with the remaining data in the block read in the background without holding up the query. If Presto decides is needs to read the entire block, that data will be cached in Alluxio speeding up the query even more.
Alluxio also supports tiered storage. This allows data to be cached at different storage layers based on usage. This means the data that is being used more often is cached in the fastest tier with the lowest retrieval latency. This is of course RAM. From there, other tiers such as SSD and regular hard drives can be used for 2nd and 3rd tiers. Additionally, files can be “pinned” into the cache which allows greater flexibility for certain use cases.
Amazon S3, Azure’s Blob storage along with on-premises/private cloud object stores from Minio and CEPH provide an excellent, low-cost, fault tolerant object store for companies to store their historical and operational data. Using Presto along with popular BI reporting tools has skyrocketed in popularity and Alluxio provides these companies with an additional tool in their belts to increase performance when using these object stores.