Cookie Notice

This site uses cookies for performance, analytics, personalization and advertising purposes.

For more information about how we use cookies please see our Cookie Policy.

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Required

These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.

Analytical/ Performance Cookies

These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages. This helps us to improve the way the website works and allows us to test different ideas on the site.

Functional/ Preference Cookies

These cookies allow our website to properly function and in particular will allow you to use its more personal features.

Targeting/ Advertising Cookies

These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites. You should check the relevant third party website for more information and how to opt out, as described below.

Blog

Resources

Documentation

Matt Fuller

VP of Product

Starburst

Starburst’s Presto on AWS up to 18x faster than EMR

Last Updated: January 10, 2024

Data Architect Data Engineer Data Lake Data Leader/CxO Education Head of Analytics InfoSec starburst enterprise

Karol Sobczak & Anu Sudarsan, Co-Founders & Software Engineers at Starburst

Introduction
Last week, we announced the availability of Starburst’s Presto on AWS Marketplace. With this offering, one can deploy a Presto cluster and begin querying S3 data in a matter of minutes! Coupled with simple deployment, Presto is automatically configured for your AWS EC2 instances, ultimately saving you both time and AWS costs. Previously, Presto was only available on AWS via EMR; in this blog post, we’ll dive into the performance benchmark comparisons between Starburst’s Presto on AWS and AWS EMR Presto.

Starburst Presto Auto Configuration
Starburst Presto is automatically configured for the selected EC2 instance type, and the default configuration is well balanced for mixed use cases. For example, Presto may get around 80% of total node physical memory, while query.max-memory-per-node is set at a reasonable 20% of Presto Java heap memory. A configuration such as this is suitable for low-to-medium concurrency while still allowing larger single queries to pass. In a future post, we will cover more detailed auto configuration choices tailored to specific cluster use cases.

Benchmark Environment
We conducted benchmarks against TPC Benchmark™ DS (TPC-DS) schemas ranging from 10GB to 1TB. Schemas were chosen to prevent time-consuming queries, as each query was run repetitively in order to produce stable results. Additionally, the tables were stored in S3 in ORC data format (Zlib compressed) and were neither bucketed nor partitioned.

Starburst Presto was deployed using the CloudFormation template and Presto AMI from Starburst’s Presto on AWS offering. We used one coordinator and ten worker nodes, both of r4.8xlarge EC2 instance types. EMR Presto (version 0.194) from the EMR release emr-5.13.0, also used the same instance types with one Master and ten Core (worker) nodes. For cost-effectiveness, we used spot instances for both clusters. Further, we implemented a common external Hive metastore hosted in the same region as the cluster. The test result is the mean duration from six query executions after five prewarm query runs.

Results
Overall, Starburst Presto performs much better than EMR Presto on the same instance types; and our results have shown that said performance is better both with and without the cost based optimizer enabled. Additionally, EMR Presto fails to complete all TPC-DS queries out of the box.

Thus for running the entire benchmark queries reported here, Starburst Presto gave us around 7x improvement in terms of AWS cluster cost as compared to EMR Presto. The above graphs clearly illustrate the performance gains for Starburst Presto– showing the largest when CBO is used. Here are speedup charts which provide a more aggregated improvements summary:

EMR Presto Tuning
With default EMR Presto configuration, more than 52% of queries failed in the benchmark.

The majority of the queries failed due to the “Queries exceeded max memory size..” error. In an effort to correct this issue, we modified query.max-memory property in EMR Presto. Above this, we also tuned fs.s3.maxConnections property in EMRFS to fix timeout in queries. Although queries started passing, these efforts did not bring any significant improvement to query performance. We also switched EMR Presto not to use EMRFS, but use PRESTO as the hive.s3-file-system-type to study any benefits of the file system type. However, we did not observe any significant change in query runtime. The numbers reported above are the best we received from EMR Presto after multiple tuning attempts.

Conclusions
Overall, because of the improvement in query execution times, our Starburst Presto release enables one to run more analytics and/or reduces your AWS costs. Additionally, our release is much easier to use, as it works seamlessly with your S3 data. On the other hand, our experience with EMR Presto was on the opposite end of the spectrum. Over 52% out of 99 TPCD-S queries fail out of the box, and we had to expend considerable effort and research in debugging and figuring out why EMR Presto doesn’t work as well as our distribution.

Future posts
In the future, we plan to release a blog focused on the process of choosing an instance type depending on cluster workload. Choosing the right, and most cost-effective, instance type is a difficult task and deserves separate consideration.

Try Presto on AWS Today!
Need help with getting setup or want to learn more? Contact us aws@starburstdata.com

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

© Starburst Data, Inc. Starburst and Starburst Data are registered trademarks of Starburst Data, Inc. All rights reserved. Presto®, the Presto logo, Delta Lake, and the Delta Lake logo are trademarks of LF Projects, LLC

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

Query your data lake fast with Starburst's best-in-class MPP SQL query engine
Get up and running in less than 5 minutes
Easily deploy clusters in AWS, Azure and Google Cloud

For more deployment options:

Download Starburst Enterprise

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Blog

Resources

Pages

Documentation

Starburst’s Presto on AWS up to 18x faster than EMR

Last Updated: January 10, 2024

Related posts

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free with
Starburst Galaxy

For more deployment options:

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Starburst Galaxy

Starburst Enterprise

By Use Cases

By Industry

Documentation

Connect

Education

Filter:

Blog

Resources

Pages

Documentation

Starburst’s Presto on AWS up to 18x faster than EMR

Last Updated: January 10, 2024

Related posts

Starburst Enterprise LTS Backport Releases

Introducing New Data Observability Features in Starburst Galaxy – Now in Public Preview

Automating the “Icehouse” – Fully-managed Open Lakehouse Platform on Starburst Galaxy

What’s New in Starburst Galaxy – April 2024

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free withStarburst Galaxy

For more deployment options:

Start Free with
Starburst Galaxy