Webcast: Securing Presto... Watch »

Connectors

About Presto Connectors

Overview

Presto, an SQL-on-Anything engine, comes with a number of built-in connectors for a variety of data sources. Presto’s architecture fully abstracts the data sources it can connect to which facilitates the separation of compute and storage. The Connector SPI allows building plugins for file systems and object stores, NoSQL stores, relational database systems, and custom services. As long as one can map the data into relational concepts such as tables, columns, and rows, it is possible to create a Presto connector. What is more, inside a single installation of Presto users can register multiple catalogs and run queries that access data from multiple connectors at once. There is no need to perform a lengthy ETL process since Presto can simply query data where it lives.

Presto is available with a large number of existing connectors for the following data sources:

  • Hive Metastore plus HDFS or S3-compatible object stores
  • NoSQL engines such as Cassandra, Accumulo, MongoDB, Redis, and Kafka
  • RDBMS engines such as MySQL, PostgreSQL, SQL Server, and Redshift

Additional connectors are under development. Please Contact us  if you are interested in a connector not listed above. We are happy to help!

Cassandra

Brief description

Apache Cassandra is a very popular open source distributed NoSQL column-oriented storage engine known for its performance, scalability, and high availability.

The Cassandra connector allows Presto to query data from Cassandra. It can be used to provide ANSI SQL analytics over data stored in Cassandra alone or to join data between different systems such as Cassandra, MySQL, and HDFS/S3. In addition to SELECT queries, DML statements such as INSERT INTO and CREATE TABLE AS are also supported by the Cassandra connector. Finally, users can display schemas, tables, views, and columns from the Cassandra system registered in Presto.

Documentation link: https://docs.starburstdata.com/latest/connector/cassandra.html

Pushdown

Presto brings back from Cassandra only the values for those columns that are needed to compute a given query. In addition, Presto attempts to push down filters involving the partition key and clustering key defined for a given Cassandra table. Filters on a partition key limit the numbers of partitions that need to be processed by Presto which results in significantly faster response times. Filters on a clustering key decrease the number of rows that need to be transferred from the Cassandra table partitions to Presto and thereby further speed up queries.

Parallelism

Data transfer between a Presto cluster and a Cassandra cluster is fully parallelized. Cassandra table partitions are turned into Presto splits that are scheduled across multiple Presto Workers.

All Presto Workers open connections to Cassandra nodes (where the assigned partitions are stored) and fetch the columns and rows relevant for a given query.

Concurrency

Multiple concurrent Presto queries may reference Cassandra-based tables. Each query parallelizes its data transfer as explained above.

Use cases

The key use case for the Cassandra connector is to enable fully-featured ANSI SQL analytics of data stored in Apache Cassandra. Unlike the built-in Cassandra Query Language (CQL), Presto offers joins, aggregations, and subqueries.

In addition, the online data from Cassandra can be combined in a single query with the archival data stored in HDFS or S3 as well as in relational stores such as PostgreSQL.

Hive (HDFS)

Brief description

HDFS is a distributed file system available in every Hadoop distribution including the most popular ones offered by Cloudera and Hortonworks. In addition, HDFS-compatible storage is also available in the cloud as part of managed Hadoop offerings on AWS, Azure, and Google Cloud.

The Hive connector allows Presto to query data stored in HDFS-compatible filesystems and registered in a Hive Metastore. Please note the “Hive connector” name causes some people confusion.  The Hive connector DOES NOT actually use Hive to parse or execute the SQL query in any way.  It only uses the Hive Metastore (aka HCatalog) for metadata behind the scenes. That allows users to share the data with other engines such as Hive, Spark, Pig etc.

The Hive connector can be used to provide ANSI SQL analytics of data stored in HDFS alone or to join data between different systems such as HDFS, PostgreSQL, and Cassandra. In addition to SELECT queries, DML and DDL statements such as CREATE/DROP, SCHEMA/TABLE, and INSERT INTO are also supported by the Hive/HDFS connector. Finally, users can display schemas, tables, views, and columns from the Hive Metastore registered in Presto.

Documentation link: https://docs.starburstdata.com/latest/connector/hive.html

Pushdown

Presto is able to query several popular file formats such as ORC, Parquet, RCFile, AVRO, SequenceFile, and Text. The ability to pushdown column projections and filters depends on the particular file format. For example, in the case of ORC and Parquet, Presto will read only the columns needed by the query and leverage the built-in min/max indices to skip reading files and blocks that do not contain rows with values that would satisfy the filters. The pushdown capabilities improve overall query performance by reducing network I/O between Presto clusters and S3.

Parallelism

Data transfer between a Presto cluster and HDFS is fully parallelized. Presto splits are created out of data file segments and scheduled across multiple Presto Workers. All Presto Workers open parallel connections to HDFS Data Nodes and fetch the columns and rows relevant to a given query from the assigned splits.

Concurrency

Multiple concurrent Presto queries may reference HDFS-based tables. Each query parallelizes its data transfer as explained above.

Use cases

The key use case for the Hive/HDFS connector is to enable fully-featured ANSI SQL analytics of data stored in an HDFS-compatible filesystem.

In addition, the historical data from HDFS can be combined in a single query with the online data from Cassandra and with the data from a relational DBMS such as PostgreSQL.

Hive (S3)

Brief description

Amazon S3 is an object store engine available in AWS and the most popular place to store data for Amazon cloud users. The S3 API has become so ubiquitous that S3 compatibility is now offered by many vendors of object storage engines including Ceph, Minio, OpenIO, Cloudian, and IBM Cloud Object Storage.

The Hive connector allows Presto to query data stored in S3-compatible engines and registered in a Hive Metastore. Please note the “Hive connector” name causes some people confusion. The Hive connector DOES NOT actually use Hive to parse or execute the SQL query in any way. It only uses the Hive Metastore (aka HCatalog) for metadata behind the scenes. That allows users to share the data with other engines such as Hive, Spark, Pig etc.

The Hive connector can be used to provide ANSI SQL analytics of data stored in Amazon S3 alone or to join data between different systems such as S3, MySQL, and Cassandra. In addition to SELECT queries, DML and DDL statements such as CREATE/DROP, SCHEMA/TABLE, and INSERT INTO are also supported by the Hive/S3 connector. Finally, users can display schemas, tables, views, and columns from the Hive Metastore registered in Presto.

Documentation link: https://docs.starburstdata.com/latest/connector/hive.html

Pushdown

Presto is able to query several popular file formats such as ORC, Parquet, RCFile, AVRO, SequenceFile, and Text. The ability to pushdown column projections and filters depends on the particular file format. For example, in the case of ORC and Parquet, Presto will read only the columns needed by the query and leverage the built-in min/max indices to skip reading files and blocks that do not contain rows with values that would satisfy the filters. The pushdown capabilities improve overall query performance by reducing network I/O between Presto clusters and S3.

Parallelism

Data transfer between a Presto cluster and S3 is fully parallelized. Presto splits are created out of data file segments and scheduled across multiple Presto Workers. All Presto Workers open parallel connections to S3 and fetch the columns and rows relevant to a given query from the assigned splits.

Concurrency

Multiple concurrent Presto queries may reference S3-based tables. Each query parallelizes its data transfer as explained above.

Use cases

The key use case for the Hive/S3 connector is to enable fully-featured ANSI SQL analytics of data stored in an S3-compatible object store.

In addition, the historical data from S3 can be combined in a single query with the online data from Cassandra and with the data from a relational DBMS such as MySQL.

MS SQL Server

Brief description

Microsoft SQL Server is a commercial relational DBMS offered by Microsoft.

The SQL Server connector allows Presto to query MS SQL Server databases. It can be used to join data between different systems such as MS SQL Server, Cassandra and HDFS/S3 or between several MS SQL Server instances. In addition to SELECT queries, DML statements such as INSERT INTO and CREATE TABLE AS are also supported by the MS SQL Server connector. Finally, users can display schemas, tables, views, and columns from the MS SQL Server database registered in Presto.

Documentation link: https://docs.starburstdata.com/latest/connector/sqlserver.html

Query Pushdown

Presto will take your SQL query and will push down some of the query processing related to tables stored in MS SQL Server to the underlying DBMS via JDBC.

In particular, column projections and simple filters will be pushed down so that the amount of data transferred from MS SQL Server base tables or views is limited only to those columns and rows needed to compute the Presto query.

For example, for the following Presto query:

SELECT product_id, avg(price) AS avg_price 
FROM sqlserver.merchant.sales

WHERE country_id = 1

GROUP BY product_id

ORDER BY avg_price DESC

LIMIT 10;

a MS SQL Server instance will receive:

SELECT product_id, price

FROM merchant.sales

WHERE country_id = 1;

while the rest of the processing (aggregation, sorting, and limit) will be done in Presto after fetching data from MS SQL Server.

Today Presto is not capable of pushing down aggregations and joins into MS SQL Server. In many cases a simple workaround for this limitation is the creation of views inside MS SQL Server that will be referenced by Presto queries. Such views should contain aggregations and/or joins of MS SQL Server tables. The views are processed inside MS SQL Server (along with any column/filters pushed down by Presto) and the resulting intermediate data is streamed back to Presto for the final processing.

Parallelism

The MS SQL Server connector is JDBC-based. For each MS SQL Server table referenced in a given Presto query there will be one SQL statement pushed down via a JDBC connection. The resulting rows from MS SQL Server will come as a single stream via that JDBC connection.

Typically, one Presto Worker will open such a JDBC connection for each MS SQL Server table referenced by a Presto query; as a result, fetching data from a single table is not parallelized. However, when multiple MS SQL Server tables are referenced, fetching them is done in parallel across many Presto Workers.

Concurrency

Multiple concurrent Presto queries may reference MS SQL Server-based tables. All those queries will result in parallel JDBC connections from a number of Presto Workers to the MS SQL Server server. The pushed-down SQL statements are executed inside MS SQL Server and the resulting rows are fetched by the Presto Workers all at the same time.

Use cases

A common use case is to join the most recent transactional data from MS SQL Server with the historical data stored in HDFS or S3. Data in MS SQL Server is hot and actively used operationally while the cold historical data is for analytics only. Presto allows to UNION ALL the rows from both systems to enable comprehensive analysis.

Another interesting usage pattern is to keep quickly-changing dimension tables in MS SQL Server to allow frequent INSERT, UPDATE, and DELETE operations. Presto allows joining these small MS SQL Server tables with large append-only fact tables stored in Hadoop or in object stores such as S3.

MySQL

Brief description

MySQL is a very popular open source OLTP DBMS widely used in a variety of applications for over two decades.

The MySQL connector allows Presto to query MySQL databases. It can be used to join data between different systems such as MySQL, Cassandra and HDFS/S3 or between several MySQL instances. In addition to SELECT queries, DML statements such as INSERT INTO and CREATE TABLE AS are also supported by the MySQL connector. Finally, users can display schemas, tables, views, and columns from the MySQL database registered in Presto.

Documentation link: https://docs.starburstdata.com/latest/connector/mysql.html

Query Pushdown

Presto will take your SQL query and will push down some of the query processing related to tables stored in MySQL to the underlying DBMS via JDBC.

In particular, column projections and simple filters will be pushed down so that the amount of data transferred from MySQL base tables or views is limited only to those columns and rows needed to compute the Presto query.

For example, for the following Presto query:

SELECT product_id, avg(price) AS avg_price 
FROM mysql.merchant.sales

WHERE country_id = 1

GROUP BY product_id

ORDER BY avg_price DESC

LIMIT 10;

a MySQL instance will receive:

SELECT product_id, price

FROM merchant.sales

WHERE country_id = 1;

while the rest of the processing (aggregation, sorting, and limit) will be done in Presto after fetching data from MySQL.

Today Presto is not capable of pushing down aggregations and joins into MySQL. In many cases a simple workaround for this limitation is the creation of views inside MySQL that will be referenced by Presto queries. Such views should contain aggregations and/or joins of MySQL tables. The views are processed inside MySQL (along with any column/filters pushed down by Presto) and the resulting intermediate data is streamed back to Presto for the final processing.

Parallelism

The MySQL connector is JDBC-based. For each MySQL table referenced in a given Presto query there will be one SQL statement pushed down via a JDBC connection. The resulting rows from MySQL will come as a single stream via that JDBC connection.

Typically, one Presto Worker will open such a JDBC connection for each MySQL table referenced by a Presto query; as a result, fetching data from a single table is not parallelized. However, when multiple MySQL tables are referenced, fetching them is done in parallel across many Presto Workers.

Concurrency

Multiple concurrent Presto queries may reference MySQL-based tables. All those queries will result in parallel JDBC connections from a number of Presto Workers to the MySQL server. The pushed-down SQL statements are executed inside MySQL and the resulting rows are fetched by the Presto Workers all at the same time.

Use cases

A common use case is to join the most recent transactional data from MySQL with the historical data stored in HDFS or S3. Data in MySQL is hot and actively used operationally while the cold historical data is for analytics only. Presto allows to UNION ALL the rows from both systems to enable comprehensive analysis.

Another interesting usage pattern is to keep quickly-changing dimension tables in MySQL to allow frequent INSERT, UPDATE, and DELETE operations. Presto allows joining these small MySQL tables with large append-only fact tables stored in Hadoop or in object stores such as S3.

PostgreSQL

Brief description

PostgreSQL (known also as Postgres or PSQL) is a very popular open source OLTP DBMS widely used in a variety of applications for decades.

The PostgreSQL connector allows Presto to query PostgreSQL databases. It can be used to join data between different systems such as PostgreSQL, Cassandra and HDFS/S3 or between several PostgreSQL instances. In addition to SELECT queries, DML statements such as INSERT INTO and CREATE TABLE AS are also supported by the PostgreSQL connector. Finally, users can display schemas, tables, views, and columns from the PostgreSQL database registered in Presto.

Documentation link: https://docs.starburstdata.com/latest/connector/postgresql.html

Query Pushdown

Presto will take your SQL query and will push down some of the query processing related to tables stored in PostgreSQL to the underlying DBMS via JDBC.

In particular, column projections and simple filters will be pushed down so that the amount of data transferred from PostgreSQL base tables or views is limited only to those columns and rows needed to compute the Presto query.

For example, for the following Presto query:

SELECT product_id, avg(price) AS avg_price 
FROM postgresql.merchant.sales

WHERE country_id = 1

GROUP BY product_id

ORDER BY avg_price DESC

LIMIT 10;

a PostgreSQL instance will receive:

SELECT product_id, price

FROM merchant.sales

WHERE country_id = 1;

while the rest of the processing (aggregation, sorting, and limit) will be done in Presto after fetching data from PostgreSQL.

Today Presto is not capable of pushing down aggregations and joins into PostgreSQL. In many cases a simple workaround for this limitation is the creation of views inside PostgreSQL that will be referenced by Presto queries. Such views should contain aggregations and/or joins of PostgreSQL tables. The views are processed inside PostgreSQL (along with any column/filters pushed down by Presto) and the resulting intermediate data is streamed back to Presto for the final processing.

Parallelism

The PostgreSQL connector is JDBC-based. For each PostgreSQL table referenced in a given Presto query there will be one SQL statement pushed down via a JDBC connection. The resulting rows from PostgreSQL will come as a single stream via that JDBC connection.

Typically, one Presto Worker will open such a JDBC connection for each PostgreSQL table referenced by a Presto query; as a result, fetching data from a single table is not parallelized. However, when multiple PostgreSQL tables are referenced, fetching them is done in parallel across many Presto Workers.

Concurrency

Multiple concurrent Presto queries may reference PostgreSQL-based tables. All those queries will result in parallel JDBC connections from a number of Presto Workers to the PostgreSQL server. The pushed-down SQL statements are executed inside PostgreSQL and the resulting rows are fetched by the Presto Workers all at the same time.

Use cases

A common use case is to join the most recent transactional data from PostgreSQL with the historical data stored in HDFS or S3. Data in PostgreSQL is hot and actively used operationally while the cold historical data is for analytics only. Presto allows to UNION ALL the rows from both systems to enable comprehensive analysis.

Another interesting usage pattern is to keep quickly-changing dimension tables in PostgreSQL to allow frequent INSERT, UPDATE, and DELETE operations. Presto allows joining these small PostgreSQL tables with large append-only fact tables stored in Hadoop or in object stores such as S3.