What is Presto?

 

Presto is an open source distributed SQL engine for running fast analytic queries against various data sources ranging in size from gigabytes to petabytes. Presto was designed and built from scratch for interactive analytics.  It approaches the speed of commercial data warehouses while scaling to the size of very large organizations.

Background

Presto was originally developed by Facebook to scale to the data size and performance they needed. In Fall 2012 a small team of four engineers at Facebook started working on Presto. By Spring 2013, the first version was successfully rolled out within Facebook. Later that year, Facebook open sourced Presto under the Apache License. Since then the Presto community has thrived with a lot of internal contributions from Facebook and external contributions from other organizations. Presto is used by many well-known and well-respected technology companies today.

SQL-on-Anything

Presto is used to query data where it lives including Hadoop, S3, Redshift, Relational Databases, Cassandra, Kafka, and even proprietary data stores. Data is never moved; rather it is read from the data stores at query time. Within the single query you can reach out to multiple data stores which allows to analyze data across your entire organization. For example, in the same query data arriving from a Kafka topic may be joined with historical data in HDFS or S3.

In addition, Presto offers flexible deployment options. It can run basically anywhere you want: on premises, in the cloud, on bare metal commodity hardware or in a virtualized environment. You may co-locate Presto with your underlying data store or let it run on a dedicated cluster in order to separate compute from storage.

Architecture

Presto is a distributed system that runs on one or more machines to form a cluster. An installation will include one Presto Coordinator and any number of Presto Workers. The Presto Coordinator is the machine to which users submit their queries. The Coordinator is responsible for parsing, planning, and scheduling query execution across the Presto Workers. Adding more Presto Workers allows for more parallelism and faster query processing.

 

Pluggable

 

Presto was architected to be very pluggable. This design allows for simplified development of connectors to virtually anything. As long as one can transpose the data into something that Presto Connector APIs can consume, Presto can query it. This is done primarily by implementing three sets of APIs: Metadata API, Data Location API, and Data Stream API.

The Metadata API provides Presto with catalog information of relational tables and schemas. If the underlying system does not have such a concept, it is simply up to the connector implementation. For example, topics in Kafka are viewed as tables for the Kafka Connector in Presto.

The Data Location API provides Presto with the information where the data lives. For example, it will list splits for HDFS or the hostname where an SQL Server instance is running.

The Data Steam API provides Presto the data from the underlying data source (HDFS, SQL Server, Cassandra, etc.) and puts it into an internal Presto representation to continue the processing inside Presto's execution engine.

Presto is ... Fast!

From the beginning Presto was designed with SQL query performance in mind. It leverages both well-known and novel techniques for distributed query processing. Some include:

  • In-memory parallel processing
  • Pipelined execution across nodes in the cluster
  • Multithreaded execution model to keep all the CPU cores busy
  • Efficient flat-memory data structures to minimize Java garbage collection
  • Java Byte code generation

References

https://prestodb.io/
https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)


Presto Users

Presto is used in production at very large scale at many well-known and well-respected companies. It is simply proven to work providing interactive query capabilities on all types of disparate data ranging from small to extremely large.  It works because many of the companies using Presto contribute back to the open source community improving it for everyone. This uniqueness continues to drive Presto as the canonical open source distributed SQL-on-Anything query engine.

The following is a list of just some of the companies using Presto today:

 

Airbnb

Atlassian

Amazon

Bloomberg

Comcast

Facebook

FINRA

LinkedIn

Lyft

NASDAQ

Netflix

Pinterest

Slack

Twitter

Uber

Walmart

Warner Brothers

Yahoo! Japan