4 mistakes to avoid when building lakehouse based solutions on AWS

Author: Igor Royzis

Mistake #1

Dumping all your data in S3 without a well designed data lake partitioning that supports your organization’s needs

We’ve all heard the saying “store all data all the time in your data lake” or something of that nature. Well, it’s true, this approach may give organizations immediate access to all of their data at any time. And data lake partitioning may not matter much when dealing with small amounts of data. But as data grows in size and complexity, having the right partitioning is a must. Data partitioning is the term used for creating a “folder like” structure in AWS S3. Partitions are important for achieving faster and cheaper data access. Data lake access services on AWS are partition-aware. In other words, if data has partitions, then a service (aka Athena) will go directly to that part of your data lake to get data instead of scanning the entire dataset.

In some cases it’s necessary to duplicate your data in your data lake to provide various partition structures. For example, you may decide to partition orders by year, month, day for analyzing orders over time. So when you need to see number of orders per month in 2021, you’d run this Athena query, which will go straight to S3 location which contains partition year=2021 and ignore all the other years.

SELECT year, month, count(1) FROM orders WHERE year = 2021 GROUP BY year, month

But you may also want fast access to specific order by creating another partition structure by order id.  This way accessing the order is going to be lightning fast, because there will be no data scanning and Athena will go directly to partition order_id=10001.

SELECT * FROM orders WHERE order_id = 10001

Note that above are SQL queries running over data stored in S3 data lake, without the use of a relational database. See mistake #4 below for why this is important.

Mistake #2

Not having a concrete strategy for change data capture for each dataset

Orders are created, changed, fulfilled – thousands or even millions per day; products are added to the catalog or discontinued occasionally; employees join companies, get promoted, leave; Activities happen at different times, frequencies and volumes. As such, data requires individual approach for capturing and organizing changes, including additions, updates and deletions, in order to create up-to-date view of organization’s data in addition to being able to see what happened and when.

There are various CDC (change data capture) approaches, but most organizations can start with just a few.

Cumulative CDC – capture and replace entire dataset at regular intervals. This is usually done via a nightly batch process. This approach is appropriate for data like product catalogs, benefits information, etc. In other words, smaller datasets that change infrequently.

Incremental CDC on write – capture and apply data changes to provide an up-to-date snapshot of data. This approach maybe used with smaller datasets which don’t require realtime or near-realtime access, or datasets that only have new data (e.g. financial markets transactions or any other time series based data).

Incremental CDC on read – append all changes to the dataset without changing existing data. This approach should be used for fast changing data with realtime or near-realtime access needs. Persisting these changes must be lighting fast and requires appropriate partitioning to allow for faster access. It’s the data consumer responsibility to use appropriate logic to get latest data.

Mistake #3

Deciding to use only one AWS service for all transformation and analysis needs (because it’s the path of least resistance), instead of choosing the right service for the type, volume and velocity of data

AWS DMS, AWS Glue, AWS Lambda, Amazon EMR, Amazon Kinesis. These are the most popular AWS services for working with data. When organizations first decide to build data lake based platforms on AWS, they often do a POC with one of the above services. For example, if the use case requires daily data cleansing and integration for many large datasets with a complex logic, AWS Glue spark job becomes a clear choice. POC works well and the team starts building the platform. Then a new requirement comes in – to process similar type of data and apply similar logic but for a much smaller set of data every 10 minutes. The team decides to continue using Glue spark jobs.

This second decision was a choice of convenience, which will cost the team much more money. The proper decision would’ve been to use Glue Python shell job or even AWS Lambda function, which would easily handle smaller datasets and cost much less.

Sometimes the choice is not very clear without analyzing the use case. For example, same result can be achieved with EMR spark job and GLUE spark job (or Glue streaming), both on an infrequent and realtime streaming basis. The cost, complexity, frequency, and integration with other tools would have to be considered to make the right decision.

Standardizing on specific AWS service should be based on the type of use case and not on “path of least resistance”. Being able to implement something quicker should not be the only factor. There are ways to standardize and achieve reusability and speed of development. For example, deciding to only use Python for all data and analytics services and creating full infrastructure and CI/CD automation for whatever AWS service is part of the architecture is very important. Also, writing clean, modular code will allow reusability across different services. For example, code written in Python with data manipulation using Pandas can be reused by Glue, Lambda, EMR, Kinesis, ECS, EKS. The service that provides the runtime for your code is just an appropriate vehicle to to take you from point A to point B. You wouldn’t use an 18-wheeler to drive 2 people around town. You’d use a passenger car.

Mistake #4

Storing too much data in a database or data warehouse, instead of doing most of data exploration in the data lake  

Cloud native architecture newbies tend to apply their on premise experience way too liberally. Specifically, they put as much data as possible into a relational database or data warehouse, because that allows them to use views, indexes and SQL queries to explore data with ease. Well, it’s true, relational databases traditionally make it easy to explore data. But at a high cost, both spending time and resources on storing data and on keeping the database up and running.

A better, faster to implement, and more cost-effective approach is to use your S3 data lake for data exploration with tools that provide SQL access to S3. Amazon Athena, Spark, Presto, Hive, or S3 Select via AWS SDK or CLI are services and tools that allow using SQL as if your data were stored in a relational database.

The most popular service for data exploration in S3 is Amazon Athena. It’s a serverless, interactive query service that uses Glue catalog and allows views and standard SQL queries. You can run Athena queries using a built-in console, your favorite SQL client of choice, or directly from your code using AWS SDK.