For the last few years, the hot topic in any organization is the separation of storage and compute. With data volumes increasing on a daily basis as well as the types of data being stored, placing this data on a flexible storage medium such as HDFS and cloud object storage such as Amazon's S3 and Azure's Blob storage provides a company with great flexibility on when and where they consume this data.
Welcome back to the series of blog posts (checkout our previous post!) about Presto's first Cost-Based Optimizer! Today let's focus on the challenge of choosing the optimal join order. The order by which relations are joined affects performance of a query substantially. Poor join order might introduce unnecessary CPU and network overhead. To overcome that, the Starburst Presto release includes a state-of-the art join enumeration algorithm that will greatly benefit its users. Let’s first do a quick introduction how Presto join enumerator will speed up your common queries and then we will discuss the algorithm in more details.
The Cost-Based Optimizer (CBO) we have released just recently achieves stunning results in industry standard benchmarks (and not only in benchmarks)! The CBO makes decisions based on several factors, including shape of the query, filters and table statistics. I would like to tell you more about what the table statistics are in Presto and what information can be derived from them.
As mentioned in our previous blog about the Starburst Presto release and its hottest addition - the Cost Based Optimizer for Presto we’re happy to share the results of benchmarks we did for this release (195e) comparing it to the ‘vanilla’ Presto release 195. Now we will continue on the process of getting all those CBO-related changes merged into the ‘vanilla’ Presto repository.
The benchmarks were performed using a standard set of TPC-H and TPC-DS queries. As a side-note, I would like to highlight that, thanks to our team’s contributions throughout the last couple years, Presto supports 100% TPC benchmark queries and executes them unmodified! That is with no prohibited query modifications. You can find the queries in our repository.
Today, I am pleased to announce the availability of Presto 195e including Presto’s first Cost Based Optimizer! With the new optimizer you should expect to see significant improvements in Presto’s query performance. Our team, in collaboration with Facebook, spent the last year heads down working on it, so you can understand why we are pretty excited that this day has finally come! You can read more about Starburst’s state of the art optimizer here.
Next week, we will be releasing the Starburst Distribution of Presto 195e. Based on prestodb/presto 0.195, Starburst’s 195e will ship with Presto’s first cost-based optimizer! In our performance testing and in collaboration with customers in our beta program, we are measuring greater than an order of magnitude performance improvement for many analytical queries such as TPC-H and TPC-DS queries.
The Strata Data conference in San Jose last week was another great event in the Strata series. Those coming from the Hadoop background definitely noticed how little Hadoop there was. Quite a change from 2009 when I first attended the Hadoop World conference in NYC.
As more and more companies turn to low-cost object storage to store a majority of their data, providing easy access to this data has become vitally important. The need to transform and load data to other sytems that provide specific features is still a necessity for certain requirements but querying an object store directly is gaining popularity.
Some great news from the AWS folks. They have increased network bandwidth for EC2 instances when communicating with S3. For those of you that use Presto on AWS, this is GREAT news!
One of the biggest complaints we hear from our customers is trying to troubleshoot performance issues when using S3 as a data source for Presto. With this increase in speed, a fully queryable S3 powered data lake is even easier.
As you may have learned from our first press release, we have announced the creation of Starburst, a new independent company solely focused on Presto, an open source distributed SQL engine.
If you are new to Presto, please read more about its unique SQL-on-Anything capabilities.
In this blog post, I would like to talk a bit about how we got where we are today.