Overview
A major retail chain with stores across the US operates 600+ independent Microsoft SQL Server databases that work behind the firewall. They needed to consolidate data in the Snowflake data warehouse. The expected traffic from each location is tens of millions of records per day.
Requirements
- The solution must be able to load recent changes from a single SQL Server database behind the firewall into the Snowflake data warehouse within 5-minute time frame.
- The solution must support the initial load of the 100+ tables in the single SQL Server database into the Snowflake.
- The solution must be able to support 600+ independent loaders.
- The solution must be as fault-tolerant as possible and require minimum maintenance.
Solution
The solution developed by the team uses a hybrid data integration model:
- 600+ data integration agents deployed to the Windows boxes on the same network as SQL Servers behind the firewall.
- A dedicated instance of Etlworks on the cloud is used to manage the agents.
A single data integration agent works as an independent loader, extracting data from the transactional SQL server database using a change replication technique using change tracking and loading data directly into the Snowflake data warehouse.
The actual data integration Flows are developed and tested in the development instance of Etlworks and sent to the agents from the production instance.
The agents are monitored from the production instance of Etlworks.
Challenges
The same Flow must work across all agents
The same Flow must work across all agents, even though the Connection parameters are different for each database.
This has been solved by using {tokenizing}
Connection to the source SQL Server database and deploying global properties with each agent. The properties contain the actual database URL, username, and password for each instance of the agent.
Initial synchronization
Each remote SQL Server has a reasonably populated, multi-gigabyte transactional database with 100+ tables, indexes, constraints, etc. Multiple applications in real-time actively update the databases. Extracting and loading data from all 600+ databases at the same time would present a major challenge and probably would not be physically possible, even considering that loading into Snowflake is extremely fast.
This has been solved by the scheduled rollout (10 databases per day) and creating Flows that are smart enough to run a full extract the first time and automatically switch to the delta extract after the full extract/load (per table) has been completed.
Fast delta synchronization
The customer needed an almost real-time replication, so the built-in CDC (change data capturing) solution was proposed. Unfortunately, at least half of the SQL Server databases were on the older versions of SQL Server, which do not support CDC . Fortunately, Etlworks supports change replication using change tracking (CT). This technique requires a primary key in each source table. Not all of the tables had primary keys, so the customer had to modify the schema.
High-availability
The entire solution requires almost 100% availability. The distributed nature of the hybrid cloud implementation makes it very fault-tolerant. Basically, if any of the agents or even Etlworks is down, the system is still available as long as the Snowflake database is available. If the local SQL Server database or a local integration agent is down, a simple restart fixes the problem. After restart, the agent will continue from where it left off.
Zero-maintenance
Integration Agent operates as a headless background service. It calls home periodically to report the status and grab recent updates, but it does not require any maintenance outside of upgrades and possible restart.
Comments
0 comments
Please sign in to leave a comment.