As the digital landscape continues to evolve at a rapid pace, maintaining aging on-premises data infrastructure can become a significant bottleneck for enterprise operations. In this project, we tackled the migration of a massive Data Warehouse (DWH) from an internally hosted IBM DB2 database to the Google Cloud. The new architecture relies on Google BigQuery, a serverless, columnar database, and utilizes the Python-based data build tool (dbt) for data transformation. The primary motivation was to replace the legacy system due to ever-increasing license costs and lacking performance, while also strictly avoiding any future “vendor lock-in”.
The scope of this migration spanned from June 2024 to June 2025, involving a 14-person team utilizing agile Scrum methodologies. As part of the 5-person core team, my focus was specifically on the migration of the Extract-Transform-Load (ETL) logic and the downstream reporting.
The list below highlights the individual milestones of the project, which we will discuss in detail later on:
- Automating the extraction of legacy ETL logic from GUI-based tools
- Translating heavily nested, DB2-specific SQL into modular, database-agnostic dbt models
- Conducting dependency analysis via data mining to ensure a smooth migration flow
- Validating data integrity between the legacy and cloud systems using automated hash comparisons
Gathering the Code: Automating the UI
If you ask data professionals about their favorite tasks, manually clicking through legacy user interfaces to export code is surely not one of them. Our legacy ETL logic was built in IBM Design Studio and scattered across roughly 1000 data flow files. To speed up the extraction, I developed a Python script to automate the process.
The script utilizes the pyautogui library to programmatically execute mouse clicks and keyboard inputs, navigating the GUI. Because the legacy tool’s loading times were unpredictable, I integrated the pytesseract library for Optical Character Recognition (OCR). The tool took screenshots of specific screen areas, converted them to text, and compared them against expected values to dynamically adapt to the application’s loading states.
Untangling the SQL Spaghetti
Once the code was exported, we faced another major hurdle: the SQL itself was incredibly difficult to analyze, with single data flows containing up to 20 nested subqueries.
To wrangle the data into a usable format, I built a Python-based SQL translator and linting tool using the sqlparse library. The script split the code into segments, filtered out unnecessary Data-Definition-Language (DDL), and converted the SQL strings into dictionaries. This made it easier to identify nested subqueries and flatten them into a structured sequence of Common Table Expressions (CTEs).
Furthermore, to achieve our goal of avoiding vendor lock-in, the code needed to be database-agnostic. I used Python’s re library (regular expressions) to find DB2-specific functions and replace them with ANSI-SQL equivalents, for instance translating NVL to COALESCE. I also developed a module featuring Jinja macros for dbt that compiled into the correct syntax depending on the target environment, ensuring compatibility across BigQuery, MS SQL-Server, and MySQL.
Dependency Mining and Orchestration
Migrating a DWH requires a strict order of operations; all source tables must be present before loading a target table to guarantee a continuous data flow. To allow for a seamless workflow, I applied data mining techniques to the exported SQL files.
Using regex patterns, the script extracted the names of source and target tables, ultimately identifying 5612 relations across the 1000 ETL routes. This data was saved as a CSV and loaded into our DB2 development server using DBeaver. From there, I created SQL views that checked which ETL routes had all their source tables implemented, providing a data-driven foundation for planning all subsequent sprint cycles.
Validation and Developer Experience
To ensure absolute data integrity post-migration, testing was expanded from simple row counts to hash-based row comparisons. I built a Python CLI tool utilizing the fire library, alongside ibm-db and google-cloud-bigquery to fetch data from both environments. The tool loaded data in increments of 100,000 rows to avoid overloading memory, read them into pandas DataFrames, and performed a full outer join based on a primary key hash and a data key hash. This allowed us to instantly identify specific rows with deviations.
Finally, I optimized the dbt model generation process. Originally, metadata queries against BigQuery took about 3 seconds each, bringing the total run time for generating staging models to 25 minutes. By rerouting these queries to a locally generated SQLite database, the process was slashed to just 5 minutes, significantly improving the local development experience.
Conclusion
Overall, this project resulted in the successful implementation and testing of around 250 data flows with roughly 375 target tables. Beyond the technical achievements, this journey highlighted how crucial agile methodologies and team dynamics are in large-scale projects. It also practically demonstrated the 80/20 rule of automation: investing time upfront to build Python tools saved us hundreds of hours of manual labor. Finally, dissecting a legacy system provided a profound understanding of why modularity and database-agnostic design are essential for future-proofing data architecture.