PostgreSQL (known also as Postgres or PSQL) is a very popular open source OLTP DBMS widely used in a variety of applications for decades.
The PostgreSQL connector allows Presto to query PostgreSQL databases. It can be used to join data between different systems such as PostgreSQL, Cassandra and HDFS/S3 or between several PostgreSQL instances. In addition to SELECT queries, DML statements such as INSERT INTO and CREATE TABLE AS are also supported by the PostgreSQL connector. Finally, users can display schemas, tables, views, and columns from the PostgreSQL database registered in Presto.
Documentation link: https://docs.starburstdata.com/latest/connector/postgresql.html
Presto will take your SQL query and will push down some of the query processing related to tables stored in PostgreSQL to the underlying DBMS via JDBC.
In particular, column projections and simple filters will be pushed down so that the amount of data transferred from PostgreSQL base tables or views is limited only to those columns and rows needed to compute the Presto query.
For example, for the following Presto query:
SELECT product_id, avg(price) AS avg_price
WHERE country_id = 1
GROUP BY product_id
ORDER BY avg_price DESC
a PostgreSQL instance will receive:
SELECT product_id, price
WHERE country_id = 1;
while the rest of the processing (aggregation, sorting, and limit) will be done in Presto after fetching data from PostgreSQL.
Today Presto is not capable of pushing down aggregations and joins into PostgreSQL. In many cases a simple workaround for this limitation is the creation of views inside PostgreSQL that will be referenced by Presto queries. Such views should contain aggregations and/or joins of PostgreSQL tables. The views are processed inside PostgreSQL (along with any column/filters pushed down by Presto) and the resulting intermediate data is streamed back to Presto for the final processing.
The PostgreSQL connector is JDBC-based. For each PostgreSQL table referenced in a given Presto query there will be one SQL statement pushed down via a JDBC connection. The resulting rows from PostgreSQL will come as a single stream via that JDBC connection.
Typically, one Presto Worker will open such a JDBC connection for each PostgreSQL table referenced by a Presto query; as a result, fetching data from a single table is not parallelized. However, when multiple PostgreSQL tables are referenced, fetching them is done in parallel across many Presto Workers.
Multiple concurrent Presto queries may reference PostgreSQL-based tables. All those queries will result in parallel JDBC connections from a number of Presto Workers to the PostgreSQL server. The pushed-down SQL statements are executed inside PostgreSQL and the resulting rows are fetched by the Presto Workers all at the same time.
A common use case is to join the most recent transactional data from PostgreSQL with the historical data stored in HDFS or S3. Data in PostgreSQL is hot and actively used operationally while the cold historical data is for analytics only. Presto allows to UNION ALL the rows from both systems to enable comprehensive analysis.
Another interesting usage pattern is to keep quickly-changing dimension tables in PostgreSQL to allow frequent INSERT, UPDATE, and DELETE operations. Presto allows joining these small PostgreSQL tables with large append-only fact tables stored in Hadoop or in object stores such as S3.