Starburst Presto Cloud is now available in AWS! Learn More »

Querying data in S3 using Presto and Looker

Eric Whitlow, Technical Business Development

With more and more companies using AWS for their many data processing and storage needs,  it’s never been easier to query this data with Starburst Presto on AWS and Looker, the quickly growing data analytics platform suite.

Looker already provides a Presto connector which allows you to connect to a variety of data sources within AWS. This blog post will describe how easy it is to query and analyze your data in sources such as S3 and MySQL. Using Looker’s easy to use interface, your company can take advantage of a true separation of compute and storage query engine such as Presto.

With the recent release of our new AWS Marketplace offering, we’ll show you how easy it is to deploy a Presto cluster powered by Starburst and immediately begin querying data from S3 using Looker.

Overview of the steps involved:

Step 1 – Deploy a Starburst Presto cluster using our new AWS Marketplace offering

Step 2 – Connect Looker to Presto and start querying and building visualizations

——————————

For this example, we’ve already populated a S3 bucket with Walmart sales data from Kaggle: https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data and created our Hive metastore tables that define how the files are organized.

Since this is a simple demonstration, we chose the train.csv and stores.csv files which have store information along with sales.

stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of store.

  • Store – the store number
  • Type – type of store
  • Size – size of the store

train.csv

This is the historical training data, which covers from 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

  • Store – the store number
  • Dept – the department number
  • Date – the week
  • Weekly_Sales –  sales for the given department in the given store
  • IsHoliday – whether the week is a special holiday week

We placed these files into two different folders on S3 named sales and stores:

Lastly, Hive tables were created in the metastore to match these files. They are named “sales” and “stores”.

Let’s get started with Step 1: Deploy a Starburst Presto cluster using our new AWS Marketplace offering

Deploying a Presto cluster very easy. First, visit the AWS Marketplace at https://aws.amazon.com/marketplace and type in “Presto, by Starburst” and our easy to use offering will be returned in the search results.

After clicking the subscribe button and agreeing to the terms, we choose the CloudFormation option and the region:

There are 3 CloudFormation screens with the first one already being filled in with the Starburst Presto URL:

The next screen is where we name the Presto cluster, choose the EC2 server size and the number of Presto workers and other options that are specific to your AWS environment. If you have any questions, please feel free to reach out to us directly at aws@starburstdata.com

The 3rd and last screen allow tags, additional security information as well as monitoring features:

Once enter this information, the CloudFormation template gets to work on creating our cluster. This usually only takes a few minutes and we’re ready to go. This shows a small 2 node cluster with a Coordinator and the Hive Metastore we provisioned:

Step 2 – Connect Looker to Presto and start querying and building visualizations

Now, we are ready to connect Looker up to our Presto cluster and start querying data in S3.

To enable access from the Looker application into our Starburst Presto cluster, we must “whitelist” the Lookers IP addresses. These can be found here: https://docs.looker.com/setup-and-management/enabling-secure-db

We add these IP addresses into our AWS Security group which is attached to our coordinator EC2 instance:

Now, we’re ready to connect Looker to our new Presto cluster.

 

Under the Admin/General Settings/Database/Connections section, create a new connection and choose PrestoDB as the dialect. Enter the public IP address of your Presto Coordinator along with 8080 for the port. That is it for the setup:

Now, if you go to Develop/SQL Runner in Looker, your Presto connection is available to query:

Next, you can create a new LookML Model and create some nice graphs and reports based on the Walmart sales data we placed in S3:

From the time of creating the Presto cluster, configuring Looker and querying data on S3 was less than 10 minutes. With the separation of storage and compute becoming popular on public clouds such as AWS, companies are now empowered to offer self-service “data as a service” to their users and customers easier than ever.

Do you need help setting up and configuring Presto or want to learn more about our new AWS Marketplace offering? Contact us at aws@starburstdata.com.

Useful links:

Presto on AWS documentation: https://docs.starburstdata.com/latest/aws.html

Looker documentation for Presto: https://discourse.looker.com/t/connecting-looker-to-a-presto-instance/1870

LookML Documentation: https://docs.looker.com/data-modeling/learning-lookml/what-is-lookml