We are using Amazon DMS to migrate over a billion rows from a production MySQL box to Redshift. If you haven't used DMS before it migrates data (either a once off or continuous) from one database vendor to another. In our case, MySQL to Redshift.
One of the replication errors we ran into is zeroed out dates. MySQL allows for dates that have a zeroed value, "0000-00-00 00:00:00". This is of course very silly and no other database vendors support this, including Redshift. These values were created a long time ago, but they must be cleaned up in the source database before we can try the replication again.
The simplest solution is to fix the bad data with a query like:
UPDATE my_table SET created_time = NULL WHERE created_time = '0000-00-00 00:00:00';
Updating large tables (especially ones that don't have very ideal indexes) on a running behemoth production box isn't always possible (or at least smart). We needed a less invasive approach. Here's the plan:
- Create a new database from a recent snapshot.
- Run the expensive search queries on the new box to find the bad data.
- Produce a patch file that fixes the specific rows on production (using the primary key, so there's no funny business).
Step 1 is pretty self explanatory. With a new database, we could hit it as hard as we want with the expensive queries to find the records.
However, there were too many values across dozens of tables and columns which meant that creating the patch by hand would be quite time consuming. Time to automate with a big ol' SQL query:
SELECTconcat("SELECT concat('UPDATE `",table_schema ,'`.`' ,table_name ,"` SET " ,column_name ," = NULL WHERE " ,pk_columns ," = \\'', " ,pk_columns ,",'\\';') as upsql" ,' from `' ,table_schema ,'`.`' ,table_name ,"` where " ,column_name ," = '0000-00-00 00:00:00';")FROM(SELECT(SELECTgroup_concat(column_name)FROMinformation_schema. COLUMNSWHEREtable_schema = c.table_schemaAND table_name = c.table_nameAND column_key = 'PRI') AS pk_columns ,table_schema ,table_name ,column_nameFROMinformation_schema. COLUMNS AS cWHEREtable_schema IN('mydb')AND table_name IN(-- a bunch of known bad tables)AND data_type IN('datetime' , 'date' , 'timestamp')) AS t;
This approach is more like SQLception. The query above generates a lot of SELECT statements that find bad rows, which in turn generate UPDATE statements for the patch. Whaaaa? I'll explain...
I'll put that big query above into a file called generate_selects.sql, then produce another file directly:
mysql --disable-pager --batch --raw --skip-column-names --unbuffered --execute 'source generate_selects.sql' > generate_updates.sql
The file generated (generate_updates.sql) looks something like:
SELECT concat('UPDATE `mydb`.`my_table` SET start_date = NULL WHERE id = \'', id,'\';') as upsql from `mydb`.`my_table` where start_date = '0000-00-00 00:00:00';SELECT concat('UPDATE `mydb`.`my_table` SET end_date = NULL WHERE id = \'', id,'\';') as upsql from `mydb`.`my_table` where end_date = '0000-00-00 00:00:00';
Now we feed that file in again:
mysql --disable-pager --batch --raw --skip-column-names --unbuffered --execute 'source generate_updates.sql' > patch.sql
This is the step that takes a significant amount of time - so go grab a coffee, then work on another task, then do your laundry... it's going to take a while.
Here's one I've prepared earlier! ...
There will be a new patch.sql that looks like:
UPDATE `mydb`.`foo` SET expires = NULL WHERE id = '111';UPDATE `mydb`.`bar` SET SaleDate = NULL WHERE ID = '6c546add-6f3c-14a8-1994-f788e20035a7';UPDATE `mydb`.`bar` SET TimeFlag = NULL WHERE ID = '8be09311-4f5f-10ed-4b49-2a4bf482165f';
Done! Well, sort of... in our case there were several tables that had composite primary keys that messed up the output a little. Nothing a bit of Sublime lovin' couldn't fix.
Now we have a patch for production that can be applied quickly and effectively. Also, we now have a log of the exact records that were changed in case something pops up in the future.