Geekflare is supported by our audience. We may earn affiliate commissions from buying links on this site.
Share on:

Building Data Warehouse and Data Lake in AWS

aws-data-lake-and-data-warehouse
Invicti Web Application Security Scanner – the only solution that delivers automatic verification of vulnerabilities with Proof-Based Scanning™.

Data warehouse, Data lake, Lakehouse. If none of these words resonate with you at least a little, then your job is clearly not related to data. 👨‍💻

However, that would be quite an unrealistic premise since today, everything is related to data, seems like. Or how the corporate leaders like to describe it:

  • Data-centric and Data-driven business.
  • Data anywhere, anytime, anyhow.

The Most Important Asset

It seems like data has become the most valuable asset of more and more companies. I remember big corporates always generated a lot of data, think terabytes of new data each month. That was still 10-15 years ago. But now, you can easily generate that amount of data within a few days. One would ask if it is really necessary, even if it is some content anybody will use. And yes, it is definitely not 😃.

Not all of the content will be of any use, and some parts not even a single time. Often I witnessed on the front line how companies generated an enormous amount of data only to become useless after a successful load.

data-the-valuable-asset-1

But that is not relevant anymore. Data storage – now being in the cloud – is cheap, the sources of data grow exponentially, and today no one can predict what they will need one year later once new services are onboarded to the system. At that point, even the old data can become valuable.

Therefore, the strategy is to store as much data as possible. But also in as effective form as possible. So that data can be not only saved effectively but also queried, reused, or transformed and distributed further.

Let’s have a look at three native ways how to accomplish this inside AWS:

  • Athena Database – cheap and effective, although simple way to create a data lake in the cloud.
  • Redshift Database – a serious cloud version of a data warehouse that has the potential to replace the majority of the current on-premise solutions, unable to catch up with the exponential growth of data.
  • Databricks – a combination of a data lake and data warehouse into one single solution, with some bonus on top of that all.

Data Lake by AWS Athena

AWS-Athena
Source: aws.amazon.com

The data lake is a place where you can store incoming data in unstructured, semi-structured, or structured form in a fast way. At the same time, you don’t expect this data to be modified once they are stored. Instead, you want them to be as atomic and immutable as possible. Only this will ensure the greatest potential for reuse in later stages. If you would lose this atomic property of the data right after the first load into a data lake, there is no way how to get this lost information back again.

AWS Athena is a database with storage directly on S3 buckets and without server clusters running in the background. That means it is a really cheap data lake service. Structured file formats such as parquet or comma-separated value (CSV) files maintain the data organization. The S3 bucket holds the files, and Athena refers to them whenever processes select the data from the database.

Athena doesn’t support various functionalities otherwise deemed standard, such as update statements. This is why you need to look at Athena as a very simple option. On the other hand, it helps you to prevent modification of your atomic data lake simply because you can’t 😐.

It supports indexing and partitioning, which makes it usable for effective select statements execution and creating logically separate chunks of data (for example, separated by date or key columns). It can also scale horizontally very easily, as this is as complex as adding new buckets to the infrastructure.

YouTube video

Pros and Cons

The benefits to consider:

  • The fact that Athena is cheap (consisting only of S3 buckets and per-usage SQL usage costs) makes for the most significant advantage. If you want to build an affordable data lake in AWS, this is it.
  • As a native service, Athena can easily integrate with other useful AWS services like Amazon QuickSight for data visualization or AWS Glue Data Catalog to create persistent structured metadata.
  • Best for running ad hoc queries over a large amount of structured or unstructured data without maintaining a whole infrastructure around it.

The disadvantages to consider:

  • Athena is not particularly effective in returning complex select queries fast, especially if the queries do not follow the data model assumptions of how you designed to request the data from the data lake.
  • This also makes it less flexible with regard to the potential future changes in the data model.
  • Athena does not support any additional advanced functionalities out of the box, and if you want something specific to be part of the service, you need to implement it on top.
  • If you expect the data lake data usage in some more advanced presentation layer, often the only choice is to combine it with another database service more suitable for that purpose, like AWS Aurora or AWS Dynamo DB.

Purpose and Real-World Use Case

Choose Athena if the target is the creation of a simple data lake without any advanced data warehouse-like functionalities. So, for example, if you don’t expect serious high-performing analytics queries running regularly over the data lake. Instead, having a pool of immutable data with easy data storage extension is the priority.

You no longer need to worry too much about the lack of space. Even the cost of S3 bucket storage can be further decreased by implementing a data life cycle policy. This basically means moving the data across different types of S3 buckets, targeted more towards archival purposes with slower ingestion return times but lower costs.

A great feature of Athena is that it automatically creates a file consisting of data that are part of a result of your SQL query. You can then take this file and use it for any purpose. So it is a good option if you have many lambda services further processing the data in multiple steps. Each lambda outcome will automatically be the outcome in a structured file format as input ready for the subsequent processing.

Athena is a good option in situations when a large amount of raw data is coming to your cloud infrastructure, and you don’t need to process that at the time of loading. That means all you need is fast storage in the cloud in easy to understand structure.

Another use case would be to create a dedicated space for data archival purposes for another service. In such a case, Athena DB would become a cheap backup place for all the data you don’t need right now, but it might change in the future. At this point, you will just ingest the data and send it further.

Data Warehouse by AWS Redshift

AWS-Redshift
Source: aws.amazon.com

A data warehouse is a place where data are stored in a very structured way. Easy to load and extract. The intention is to run a large number of very complex queries, joining many tables via complex joins. Various analytic functions are in place to calculate various statistics over the existing data. The ultimate goal is to extract future predictions and facts to be leveraged in the business going forward, using existing data.

Redshift is a full-fledged data warehouse system. With cluster servers to tune and scale – horizontally and vertically and a database storage system optimized for rapid complex query returns. Although today you can run Redshift in serverless mode as well. There are no files on S3 or anything similar. This is a standard database cluster server with its own storage format.

It has performance monitoring tools in place out of the box, alongside customizable dashboard metrics you can use and watch to fine-tune the performance for your use case. The administration is also accessible via separate dashboards. It takes some effort to understand all the possible features and settings and how they impact the cluster. But still, it is nowhere as complex as the administration of Oracle servers used to be in the case of the on-premise solutions.

Even though there are various AWS limits in Redshift that do set some boundaries of how to use it on day to day basis (for example, hard limits on the amount of concurrent active users or sessions in one database cluster), the fact that operations are executed really fast helps to workaround those limits to some extend.

YouTube video

Pros and Cons

The benefits to consider:

  • Native AWS cloud data warehouse service that is easy to integrate with other services.
  • A central place for storing, monitoring, and ingesting various types of data sources from very different source systems.
  • If you ever wanted to have a serverless data warehouse without the infrastructure to maintain it, now you can.
  • Optimized for high-performance analysis and reporting. Unlike a data lake solution, there is a strong relational data model for storing all incoming data.
  • Redshift database engine originates in PostgreSQL, which ensures high compatibility with other database systems.
  • Very useful COPY and UNLOAD statements for loading and unloading the data from and to S3 buckets.

The disadvantages to consider:

  • Redshift does not support a large amount of concurrent active sessions. The sessions will be put on hold and processed sequentially. While it might not be an issue in most cases, as the operations are really fast, it is a limiting factor in systems with many active users.
  • Even though Redshift supports a lot of functionalities previously known from mature Oracle systems, it still isn’t on the same level. Some of the expected features just might not be there (like DB triggers). Or Redshift does support them in quite limited form (like materialized views).
  • Whenever you need a more advanced custom data processing job, you have to create it from scratch. Most of the time, use Python or Javascript code language. It is not as natural as PL/SQL in the case of the Oracle system, where even the function and procedures use a language very similar to SQL queries.

Purpose and Real-World Use Case

Redshift can be your central store for all the various data sources previously living outside of the cloud. It is a valid replacement for previous Oracle data warehouse solutions. Since it is also a relational database, the migration from Oracle is even quite a simple operation.

If you have any existing data warehouse solutions in many places that aren’t really unified in terms of approach, structure, or a predefined set of common processes to run above the data, Redshift is a great choice.

It will just provide you with an opportunity to merge all the various data warehouse systems from different places and countries under one roof. You can still separate them per country so that the data stays secure and accessible only to those who need it. But at the same time, it will allow you to build a unified warehouse solution covering all corporate data.

Another case might be if the target is to build a data warehouse platform with the extensive support of self-services. You can understand it as a set of processing that individual system users can build. But at the same time, they are never part of the common platform solution. That means such services will remain accessible only to the creator or group of people defined by the created. They won’t affect the rest of the users in any way.

Check our comparion between Datalake and Datawarehouse.

Lakehouse by Databricks on AWS

AWS-Databricks
Source: databricks.com

Lakehouse is a term that is really bound to the Databricks service. Even if it is not a native AWS service, it lives and operates within the AWS ecosystem very nicely and provides various options for how to connect and integrate with other AWS services.

Databricks aim to connect together (previously) very distinct areas:

  • A solution for data lake storage of unstructured, semi-structured, and structured data.
  • A solution for data warehouse structured and rapidly accessible query data (also called Delta Lake).
  • A solution supporting analytics and machine learning computing over the data lake.
  • Data governance for all of the areas above with centralized administration and out-of-the-box tools to support productivity for different types of developers and users.

It’s a common platform that data engineers, SQL developers, and machine learning data scientists can use simultaneously. Each of the groups also has a set of tools that they can use to accomplish their tasks.

So Databricks aim at a jack-of-all-trades solution, trying to combine the benefits of the data lake and data warehouse into a single solution. On top of that, it provides the tools to test and run machine learning models directly over already-built data stores.

YouTube video

Pros and Cons

The benefits to consider:

  • Databricks is a highly scalable data platform. It scales depending on the workload size, and it is doing so even automatically.
  • It is a collaborative environment for data scientists, data engineers, and business analysts. Having the possibility to do all of this in the same space and together is a great benefit. Not only from an organizational perspective, but it also helps save another cost otherwise needed for separate environments.
  • AWS Databricks integrates seamlessly with other AWS services, such as Amazon S3, Amazon Redshift, and Amazon EMR. This allows users to easily transfer data between services and take advantage of the full range of AWS cloud services.

The disadvantages to consider:

  • Databricks can be complex to set up and manage, especially for users who are new to big data processing. It requires a significant level of technical expertise to get the most out of the platform.
  • While Databricks is cost-effective in terms of its pay-as-you-go pricing model, it can still be expensive for large-scale data processing projects. The cost of using the platform can quickly add up, especially if users need to scale up their resources.
  • Databricks provides a range of pre-built tools and templates, but this can also be a limitation for users who need more customization options. The platform may not be suitable for users who require more flexibility and control over their big data processing workflows.

Purpose and Real-World Use Case

AWS Databricks is best suited for large corporations with a very large amount of data. Here it can cover the requirement to load and contextualize various data sources from different external systems.

Often the requirement is to provide data in real time. This means from the time the data appear in the source system, the processes shall pick up immediately and process and store the data into Databricks instantly or with just minimal delay. If the delay is anything above a minute, it is considered near-real-time processing. In any case, both scenarios are often achievable with the Databricks platform. This is mainly due to the extensive amount of adapters and real-time interfaces connecting to various other AWS native services.

Databricks also easily integrates with Informatica ETL systems. Whenever the organization system already uses the Informatica ecosystem extensively, Databricks looks like a good compatible addition to the platform.

Final Words

As the data volume continues to grow exponentially, it is good to know there are solutions that can cope with that effectively. What once was a nightmare to administrate and maintain now requires very little administration work. The team can focus on creating value out of the data.

Depending on your needs, just choose the service that can handle it. While AWS Databricks is something that you will probably need to stick to after the decision is made, the other alternatives are quite more flexible, even if less capable, especially their serverless modes. It is quite easy to migrate to another solution later.

Thanks to our Sponsors
More great readings on Data Management
Power Your Business
Some of the tools and services to help your business grow.
  • Invicti uses the Proof-Based Scanning™ to automatically verify the identified vulnerabilities and generate actionable results within just hours.
    Try Invicti
  • Web scraping, residential proxy, proxy manager, web unlocker, search engine crawler, and all you need to collect web data.
    Try Brightdata
  • Semrush is an all-in-one digital marketing solution with more than 50 tools in SEO, social media, and content marketing.
    Try Semrush
  • Intruder is an online vulnerability scanner that finds cyber security weaknesses in your infrastructure, to avoid costly data breaches.
    Try Intruder