ETL stands for Extract, Transform, and Load. ETL tools extract data from various sources and transform it into an intermediate format suitable for target systems or the data model requirements. And finally, they load data into a target database, data warehouse, or even data lake.
I remember times from 15 to 20 years back when the term ETL was something only a few understood what it is. When various custom batch jobs had their peak on the on-premise hardware.
Many projects did some form of ETL. Even if they did not know, they should name it ETL. During that time, whenever I explained any design that involved ETL processes, and I called them and described them in that way, it looked almost like another world technology, something very rare.
But today, things are different. Migration to the cloud is the top priority. And ETL tools are the very strategic piece of the architecture of most projects.
In the end, migrating to the cloud means taking the data from on-premise as a source and transforming it into cloud databases in a form that is as compatible as possible with the cloud architecture. Exactly the job of the ETL tool.
History of ETL and How It Connects to the Present

The main functions of ETL were always the same.
Extraction
ETL tools extract data from various sources (being it databases, flat files, web services, or, lately, cloud-based applications).
It usually meant taking files on the Unix file system as input and preprocessing, processing, and postprocessing.
You could see the reusable pattern of folder names like:
- Input
- Output
- Error
- Archive
Under those folders, another subfolder structure, mainly based on dates, existed, too.
This was just the standard way to process incoming data and prepare it for load into some kind of database.
Today, there are no Unix file systems (not in the same way as before)—maybe even no files. Now there are APIs – application programming interfaces. You can, but you don’t need to have a file as the input format.
It can all be stored in cache memory. It can still be a file. Whatever it is, it must follow some structured format. In most cases, this means JSON or XML format. In some cases, the old good Comma-separated-value (CSV) format will do it as well.
You define the input format. Whether the process will involve also creating the history of input files is solely up to you. It isn’t any more a standard step.
Transformation
ETL tools transform the extracted data into a suitable format for analysis. This includes data cleaning, data validation, data enrichment, and data aggregation.
As used to be the case, the data went through some complex custom logic of Pro-C or PL/SQL procedural data staging, data transforming, and data target schema storage steps. It was a similarly mandatory standard process like it was to separate the incoming files into subfolders based on the stage the file was processed.
Why was it so natural if it was also fundamentally wrong at the same time? By directly transforming the incoming data without permanent storage, you were losing the biggest advantage of raw data – immutability. Projects just threw that away without any chance for reconstruction.
Well, guess what. Today the less transformation of raw data you execute, the better. For the first data storage into the system, that is. It might be the next step will be some serious data change and data model transformation, sure. But you want to have stored the raw data in as much unchanged and atomic structure as possible. A big shift from the on-premise times, if you ask me.
Load
ETL tools load the transformed data into a target database or data warehouse. This includes creating tables, defining relationships, and loading data into the appropriate fields.
The loading step is probably the only one that is following the same pattern for ages. The only difference is a target database. Whereas previously, it was Oracle most of the time, now it can be whatever is available in the AWS cloud.
ETL in Today’s Cloud Environment
If you plan to bring your data from on-premise into (AWS) cloud, you need an ETL tool. It does not go without it, which is why this part of cloud architecture became probably the most important piece of the puzzle. If this step is wrong, anything else afterward will follow, sharing the same smell everywhere.
And while there are many competitions, I’d focus now on the three I have personal experience with the most:
- Data Migration Service (DMS) – a native service from AWS.
- Informatica ETL – probably the main commercial player in the ETL world, successfully transforming its business from on-premise to cloud.
- Matillion for AWS – a relatively new player inside cloud environments. Not native to the AWS, but native to the cloud. With nothing like history comparable with Informatica.
AWS DMS as ETL

AWS Data Migration Services (DMS) is a fully managed service that enables you to migrate data from different sources to AWS. It supports multiple migration scenarios.
- Homogenous migrations (e.g., Oracle to Amazon RDS for Oracle).
- Heterogeneous migrations (e.g., Oracle to Amazon Aurora).
DMS can migrate data from various sources, including databases, data warehouses, and SaaS applications, to various targets, including Amazon S3, Amazon Redshift, and Amazon RDS.
AWS treats the DMS service as the ultimate tool for bringing data from any database source into cloud-native targets. While the main goal of DMS is just data copy to the cloud, it does a good job of transforming the data along the way as well.
You can define DMS tasks in JSON format to automate various transformation jobs for you while copying the data from the source to the target:
- Merge several source tables or columns into a single value.
- Split source value into multiple target fields.
- Replace source data with another target value.
- Remove any unnecessary data or create completely new data based on the input context.
That means – yes, you can definitely use DMS as an ETL tool for your project. Maybe it won’t be as sophisticated as the other options below, but it will do the job if you define the goal clearly upfront.
Suitability Factor
Although DMS provides some ETL capabilities, it is primarily about data migration scenarios. There are some scenarios where it may be better to use DMS instead of ETL tools like Informatica or Matillion, though:
- DMS can handle homogeneous migrations where the source and target databases are the same. This can be a benefit if the goal is to migrate data between databases of the same type, such as Oracle to Oracle or MySQL to MySQL.
- DMS provides some basic data transformation and customization capabilities, but it may not be super mature in that regard. This can still be a benefit if you have limited data transformation needs.
- Data quality and governance needs are, in general, quite limited with DMS. But those are areas that can be improved in later phases of the project with other tools, more determined for that purpose. You might need the ETL part to be done as simply as possible. Then DMS is a perfect choice.
- DMS can be a more cost-effective option for organizations with limited budgets. DMS has a simpler pricing model than ETL tools like Informatica or Matillion, which can make it easier for organizations to predict and manage their costs.
Matillion ETL

is a cloud-native solution, and you can use it to integrate data from various sources, including databases, SaaS applications, and file systems. It offers a visual interface for building ETL pipelines and supports various AWS services, including Amazon S3, Amazon Redshift, and Amazon RDS.
Matillion is easy to use and can be a good choice for organizations new to ETL tools or with less complex data integration needs.
On the other hand, Matillion is kind of a tabula rasa. It has some predefined potential functionalities, but you must custom-code it to bring it to life. You can’t expect Matillion to do the job for you out of the box, even if the capability is there by definition.
Matillion also often described itself as ELT rather than an ETL tool. That means it is more natural for Matillion to do a load before the transformation.
Suitability Factor
In other words, Matillion is more effective in transforming the data only once they are already stored in the database than before. The main reason for that is the custom scripting obligation already mentioned. Since all the special functionality must be coded first, the effectiveness will then heavily depend on the effectiveness of the custom code.
It is only natural to expect this will be better handled in the target database system and leave on Matillion only a simple 1:1 loading task—much fewer opportunities to destroy it with custom code here.
While Matillion provides a range of features for data integration, it may not offer the same level of data quality and governance features as some other ETL tools.
Matillion can scale up or down based on the needs of the organization, but it may not be as effective for handling very large volumes of data. The parallel processing is quite limited. In this regard, Informatica is surely a better choice because it is more advanced and feature-rich at the same time.
However, for many organizations, Matillion for AWS may provide sufficient scalability and parallel processing capabilities to meet their needs.
Informatica ETL

Informatica for AWS is a cloud-based ETL tool designed to help integrate and manage data across various sources and targets in AWS. It is a fully managed service that provides a range of features and capabilities for data integration, including data profiling, data quality, and data governance.
Some of the main characteristics of Informatica for AWS include:
- Informatica is designed to scale up or down based on the actual needs. It can handle large volumes of data and can be used to integrate data from various sources, including databases, data warehouses, and SaaS applications.
- Informatica provides a range of security features, including encryption, access controls, and audit trails. It complies with various industry standards, including HIPAA, PCI DSS, and SOC 2.
- Informatica provides a visual interface for building ETL pipelines, which makes it easy for users to create and manage data integration workflows. It also provides a range of pre-built connectors and templates that can be used to connect the systems and enable the integration process.
- Informatica integrates with various AWS services, including Amazon S3, Amazon Redshift, and Amazon RDS. This makes it easy to integrate data across various AWS services.
Suitability Factor
Clearly, Informatica is the most feature-rich ETL tool on the list. However, it can be more expensive and complex to use than some of the other ETL tools available in AWS.
Informatica can be expensive, especially for small and medium-sized organizations. The pricing model is based on usage, meaning organizations may need to pay more as their usage increases.
It can also be complex to set up and configure, especially for those new to ETL tools. This can require a significant investment in time and resources.
That also leads us to something we can call a “complex learning curve”. This can be a disadvantage for those that need to integrate data quickly or have limited resources to devote to training and onboarding.
Also, Informatica may not be as effective for integrating data from non-AWS sources. In this regard, DMS or Matillion could be a better option.
Lastly, Informatica is very much a closed system. There is only a limited ability to customize it to the project’s specific needs. You just have to live with the setup it provides out of the box. Thus that limits the flexibility of the solutions somehow.
Final Words
As it happens in many other cases, there is no one size fits all solution, even such thing as the ETL tool in AWS.
You might choose the most complex, feature-rich, and expensive solution with Informatica. But it makes sense to do most if:
- The project is rather large, and you are sure the whole future solution and data sources connect to Informatica as well.
- You can afford to bring a team of skilled Informatica developers and configurators.
- You can appreciate the robust support team behind you and are good with paying for that.
If something from above is off, you might give it a shot to Matillion:
- If the project’s needs are not so complex in general.
- If you need to include some very custom steps in the processing, flexibility is a key requirement.
- If you don’t mind building most of the features from scratch with the team.
For anything even less complicated, the obvious choice is the DMS for AWS as a native service, which can probably serve your purpose well.
Next, check out data transformation tools to manage your data better.
-
Delivery-oriented architect with implementation experience in data/data warehouse solutions with telco, billing, automotive, bank, health, and utility industries. Certified for AWS Database Specialty and AWS Solution Architect… read more