A professional social network needs to load data from multiple data sources into the Amazon Redshift data warehouse.
- The ETL must be able to load data into the Redshift from SQL Server and Postgres databases, Salesforce, Marketo, Smartsheet, and some other SaaS applications.
- The ETL must support incremental change replication.
- The ETL must be able to automatically adjust the destination schema if there are new tables in the source or new fields added to the source.
- After the Flow loads data into the Redshift, it must execute the SQL to update some of the tables in the data warehouse.
- The expected volume of data: terabytes, half a billion records every day.
- The expected number of tables: a few hundred. Therefore, there must be a way to update ETL Flows using API.
- If there is an error, the Flow should continue, but it must send a notification with detailed information about the error(s).
- The ETL will be running every few hours.
Configure Redshift and authorize it to work with a specific S3 bucket.
Create a Flow to incrementally load data from the SQL Server and Postgres databases into Redshift
Use the Flow type
Database to Redshift.
Read how to create a Flow to extract, transform, and load data in Amazon Redshift.
Add each pair of source-target tables as a separate source-to-destination transformation.
Alternatively, you can load tables by a fully qualified wildcard source table name, for example,
prod.dbo.inventory*. Read about loading multiple tables by a wildcard table name.
To load data faster, enable
Parallel processing. If you are planning to load data by a wildcard table name, also enable the
Use Parallel Threads when processing sources by a wildcard.
Configure incremental change replication
The customer wanted to propagate only modified data (new and updated records) and
MERGE (insert new, update existing records) it with the existing data in the data warehouse.
Read how to MERGE data in the source with existing data in the Redshift.
Read how to configure a change replication so only the modified data will be loaded into the Redshift.
Configure Flow to handle schema changes automatically
The customer needed the Flow to handle schema changes automatically, such as:
- the new table is added to the source schema
- the new column is added to the table in the source
- the new column is added to the table in the Redshift
Create a Flow to load data from Salesforce and other SaaS application
The customer needed to load data from Salesforce, Marketo, and Smartsheet. In the Etlworks Integrator, the connectors for all these SaaS sources are actually regular database connectors, so the
Database to Redshift Flow type can be used to configure the source-to-destination transformations. Read more about connectors for SaaS data sources:
- Amazon Marketplace
- Premium connectors (more than 150 business applications)
Configure Flow to ignore errors and send notifications
Based on the requirements, the Flow must continue running even if the error occurred when loading a specific source table in the Redshift. It also must send a notification about the error(s).
Configure the ETL process that updates tables in the data warehouse
Based on the requirements, when the Flow is done loading data in Redshift, it must execute the SQL statements(s) to update some of the tables in the data warehouse.
Create a Flow that combines together all the Flows above
Create a nested Flow and add all Flows created in the previous steps to the pipeline.
In some cases, it makes sense to configure nested Flow to be executed in parallel.
The typical Etlworks Integrator environment is memory constraint. If you will be loading hundreds of millions of rows of data from hundreds of sources in parallel, it makes sense to limit the total number of threads to avoid the out for memory error.
Use API to update the Flow
One of the requirements was the ability to update the Flow (add/remove pairs of the source to destination transformations) using API.
Read about Flow management API in the Etlworks Integrator.