Cookie Notice

This site uses cookies for performance, analytics, personalization and advertising purposes.

For more information about how we use cookies please see our Cookie Policy.

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Required

These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.

Analytical/ Performance Cookies

These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages. This helps us to improve the way the website works and allows us to test different ideas on the site.

Functional/ Preference Cookies

These cookies allow our website to properly function and in particular will allow you to use its more personal features.

Targeting/ Advertising Cookies

These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites. You should check the relevant third party website for more information and how to opt out, as described below.

Blog

Resources

Documentation

Shaun Bruno

Marketing

Starburst

Data Lakes without Hadoop

Last Updated: January 29, 2024

Data Lake Trino

It seems like migrating to the cloud has dominated the news and a lot of companies are shuttering their data centers and letting cloud providers handle it for them. Reasons such as elasticity, simplicity, and infrastructure agility are all great reasons but there are many companies that continue to host their own infrastructure. The reasons could be security or they believe the cloud doesn’t provide the cost benefits in their scenario.

For these companies, building a data lake usually means setting up a Hadoop cluster and choosing a vendor to support it (although this is becoming less of need as it used to be.) Organizations like the idea of a company-wide object store which can store a variety of data including structured and unstructured data. There are a variety of companies that offer object S3 compatible storage software which can be installed anywhere.

One of the advantages of deploying your own object store is you get to use your own storage. This could be storage that you already own or the chance to build a new cluster using commodity servers which combine into a large storage pool. Since most of these storage engines support Amazon’s S3 protocol, they work seamlessly with Starburst and allow you to query data directly out of your data lakes.

In this blog post, we’ll aim to understand foundational data lake solutions and how Starburst can help.

How the rise of cloud computing disrupted Hadoop’s dominance with object storage

Sure, Hadoop was able to process large amounts of raw data using distributed systems. At an architectural level, these early systems used the Hadoop Distributed File System (HDFS) to store their data in large, on-premises installations.

Over time, the rise of cloud computing disrupted Hadoop’s dominance, replacing it with object storage. Cloud object storage allowed for much greater separation of both compute and storage on a scale impossible before. At the same time, costs for cloud object storage were much, much lower.

This began a shift in data lakes from exclusive use of HDFS towards the predominant use of distributed object storage. This sparked further developments in adjoining technologies to make better use of object storage. This is particularly true of query engines as cloud object storage requires a separate query engine to run. Starburst is designed to use both object storage and HDFS as needed.

Currently, the three largest providers of cloud data lake storage services include: Amazon S3 (AWS), Microsoft Azure Blob Storage/Azure Data Lake, and Google Cloud Storage.

The Hadoop framework brought the ability to distribute large computing jobs using parallel processing. With the advent of cloud-based object storage, a technological revolution was under way.

The emergence of Hive

But there was a problem. Hadoop was complex, especially for analytical tasks. Creating MapReduce jobs required an intricate knowledge of Java that many users lacked. This gap would give birth to a new technology, Hive, which enabled users to interact with Hadoop by controlling MapReduce using SQL syntax. This was a game changing step as it opened up data lake analytics to a new audience and helped drive its adoption.

Most data lakes are built on Hadoop, a distributed file system that can store vast amounts of data. Hadoop is designed to be scalable and fault-tolerant, meaning it can keep working even if some of the system’s servers fail, making it an ideal platform for data lakes.

When you build a data lake on Hadoop, you can use any number of technologies to access the data. You can use SQL-based tools like open source Trino, Hive, or Impala to run queries against the data. Or you can use Hadoop’s MapReduce framework to process and analyze the data.

An alternative to HiveQL

Hive was built on top of HDFS to provide SQL-like query functionality. This approach had many limitations owing to the compilation process needed to turn HiveQL into MapReduce. Starburst presents an alternative approach to HiveQL.

Starburst query engine conforms to the ANSI SQL standard. It allows for a platform-independent, single source of access for data from any data source. Data can be housed in data lakes, data warehouses, or databases. Queries can be federated across multiple sources, providing a best-of-all worlds approach.

For example, transactional data is often best served in a database, as those systems are designed to act as systems or record. At the same time, structured analytical data may still be processed in a data warehouse. Data lakes excel at semi-structured and unstructured data analytics. With Starburst, all of these systems can work together in a single query engine.

Starburst also offers superior performance when compared to other technologies. This is achieved by deploying a Massively Parallel Processing (MPP) architecture that is able to leverage the combined processing power of large clusters to achieve superior processing speeds.

Finally, by facilitating the storage options most suitable to a given use case, costs can be reduced when compared to other techniques. Highly-structured data can be retained in data warehouses, while unstructured data can be held in a less expensive data lake without sacrificing access. At the same time, the ability to scale compute resources to meet a number of different needs helps save costs in another way.

Don’t just take our word for it, here’s what a Starburst customer, Comcast had to say, “When end users are going into on-prem or cloud environments, they will be presented with all the data sets they have access to, irrespective of where the data is located. This offered huge value to our end users.”

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

© Starburst Data, Inc. Starburst and Starburst Data are registered trademarks of Starburst Data, Inc. All rights reserved. Presto®, the Presto logo, Delta Lake, and the Delta Lake logo are trademarks of LF Projects, LLC

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

Query your data lake fast with Starburst's best-in-class MPP SQL query engine
Get up and running in less than 5 minutes
Easily deploy clusters in AWS, Azure and Google Cloud

For more deployment options:

Download Starburst Enterprise

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Blog

Resources

Pages

Documentation

Data Lakes without Hadoop

Last Updated: January 29, 2024

Related posts

Get Started with Starburst Galaxy today

How the rise of cloud computing disrupted Hadoop’s dominance with object storage

The emergence of Hive

An alternative to HiveQL

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free with
Starburst Galaxy

For more deployment options:

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Starburst Galaxy

Starburst Enterprise

By Use Cases

By Industry

Documentation

Connect

Education

Filter:

Blog

Resources

Pages

Documentation

Data Lakes without Hadoop

Last Updated: January 29, 2024

Related posts

Starburst Enterprise LTS Backport Releases

Introducing New Data Observability Features in Starburst Galaxy – Now in Public Preview

What’s New in Starburst Galaxy – April 2024

Automating the “Icehouse” – Fully-managed Open Lakehouse Platform on Starburst Galaxy

Get Started with Starburst Galaxy today

How the rise of cloud computing disrupted Hadoop’s dominance with object storage

The emergence of Hive

An alternative to HiveQL

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free withStarburst Galaxy

For more deployment options:

Start Free with
Starburst Galaxy