Presto Memory Connector

Originally posted

There is a highly efficient connector for Presto! It works by storing all data in memory on Presto Worker nodes, which allow for extremely fast access times with high throughput while keeping CPU overhead at bare minimum.

The Presto Memory connector works like manually controlled cache for existing tables. It does not backup its data in any permanent storage and users have to manually recreate tables on their own after every Presto restart. Those are serious limitations, but hey… it is something to start from, right?

To use it, you first need to configure it on your cluster and then set the memory.max-data-per-node property, which limits how much data users will be allowed to save in Presto Memory per one node. After those steps the memory connector is ready and you can use it as any other connector. 
You can create a table:

CREATE TABLE memory.default.nation AS SELECT * from tpch.tiny.nation;

Insert data into an existing table:

INSERT INTO memory.default.nation SELECT * FROM tpch.tiny.nation;

Select from the Memory connector:

SELECT * FROM memory.default.nation;

Drop a table:

DROP TABLE memory.default.nation;

As mentioned earlier, any reads from and writes to memory tables are extremely fast. All of the data is stored uncompressed in Presto’s native query engine data structures. That means there is no disk/network IO overhead for accessing the data and CPU overhead is pretty much non-existing.

The Memory connector has some limitations of which you can read in the documentation. It was developed primarily for microbenchmarking Presto’s query engine, but since then it was improved to the point that it now can be used for something more. If you have some tables with hot data that do not change very often or you need to query a slow external table multiple times (say, a remote MySQL database) maybe you could give Presto Memory a try?

There are also some ideas to expand this connector in a direction of automatic tables caching, so stay tuned for more updates in this topic in the future.

Use Cases

Typical use cases for this connector are frequent joins between two different systems where the smaller data source is unreliable or the performance requirements demand data be cached in Presto. Here are a few example use cases:

Small Dimension Tables - In this use case, a RDBMS such as MySQL is used to store dimensional type data. This is typically a much smaller data set then the data that is being joined such as fact based data. (sales, usage, etc..) Sometimes it’s quicker to cache this data into Presto in order to increase performance. 

Source Availability - If one of the sources that are queried often are not available at certain times, that data could be cached in Presto to increase the availability. 

We want to hear from you! Do you like the idea of a fully featured memory connector in Presto? Our Presto support customers have shown interest in this connector already! 

Please contact us and let us know!