Starburst @ Strata — 3x Presto!
Kamil Bajda-Pawlikowski, Co-founder & CTO
The Strata Data conference in San Jose last week was another great event in the Strata series. Those coming from the Hadoop background definitely noticed how little Hadoop there was. Quite a change from 2009 when I first attended the Hadoop World conference in NYC. This year’s favorite topic was AI and machine learning while the expo hall featured quite a variety of vendor companies covering stream processing, data storage, modern DBMS, as well as various Business Intelligence tools and their accelerators.
While advanced analytics is surely a hot topic, many hallway conversations revealed that good old SQL analytics and managing ever-growing datasets are what occupies most of the attendees’ time at work. Are there any significant trends there?
Presto is definitely one picking up momentum! Netflix, Uber and Lyft shared their data platform evolution stories and the common themes are:
- moving away from expensive proprietary data mart and data warehouse solutions to S3 or HDFS to scale economically to 10s and 100s of PBs of data
- running 100s of nodes of Presto for interactive SQL analytics with tens and hundreds of thousands of queries per day
- leveraging Hive and/or Spark for long-running batch data transformation ETL jobs
I encourage everyone to take a look at the slides and watch recordings when they appear on the conference website. I am going to include the links to the talks at the bottom. In the meantime, I want to note a few highlights:
Netflix is obviously a well-known Presto user, working with it since 2014. They have talked about their Presto experiences several times in the past few years (see presentations from 2015, 2016, and 2017). The latest status is that Presto at Netflix is a primary interactive SQL engine for their S3-based data warehouse of 100PBs while Spark is used for long-running data transformation jobs and Amazon Redshift covers some remaining niche use cases.
Uber started with Presto about 2 years ago and has two Presto clusters (hundreds of nodes total) and their Presto usage is growing fast (50% since last year). Today Presto runs 180K queries daily, more than Hive and Spark combined. Similarly to Netflix, Presto is used for interactive SQL queries and scales to levels beyond what their “Commercial DB” can achieve.
Lyft joined the Presto user community last year. Their goal is to move away from Amazon Redshift and Presto is their choice for interactive SQL analytics while Hive is used for big batch ETL. Today Lyft stores 20PBs in S3 and it is growing at the rate of 3B+ events / day.
Here are the links to the talks at Strata:
Enjoy the links and let us know when we can help you to get the most out of Presto!