Redshift does not have indexes, this includes primary keys. You can (and should) create these and they are helpful for the query planner, but they do not physically exist. This also means that the constraints are not enforced. When moving a lot of data around it sometimes happens that processes over lap and it leads to duplicate records.
This was our problem. There are a few ways to solve this:
- Create a new table, SELECT DISTINCT into the new table and do the old switch-a-roo. This would be very slow on such a large dataset, but more importantly we literally didn't have enough storage space to recreate the table.
- Use some external program or processor to go through the table and delete individual or groups of records. Too hard.
- Use some crazy SQL statement with windowed functions to try and delete join specific rows. Probably possible, but likely very slow.
- Surgically find and delete the duplicate records. This seems like the best solution, and the one I used.
It works like this (all wrapped in a transaction):
- Find all of the primary key values that belong to duplicate rows, and store those distinct IDs in a temporary table.
- Create another table with the same structure as the original table and load in the distinct rows for the specific IDs identifier in the last step,
- Delete all of the rows in the main table for the distinct IDs in the first step.
- Insert back all the single rows we extracted in step 2.
- Some cleanup (drop tables, etc).
One bonus for us (if you could call it that) is that we knew that the duplicate records fell in a specific week range so all the queries below could have the added filter of a date range (saledateid) which is the main sort key. This very much speeds things up.
If you have an extremely large table where even this method would not be feasible it might make sense to divide your dataset into sections of your sort key and apply this technique over known sort key ranges.
Here we go:
BEGIN;-- First identify all the rows that are duplicateCREATE TEMP TABLE duplicate_saleids ASSELECT saleidFROM salesWHERE saledateid BETWEEN 2224 AND 2231GROUP BY saleidHAVING COUNT(*) > 1;-- Extract one copy of all the duplicate rowsCREATE TEMP TABLE new_sales(LIKE sales);INSERT INTO new_salesSELECT DISTINCT *FROM salesWHERE saledateid BETWEEN 2224 AND 2231AND saleid IN(SELECT saleidFROM duplicate_saleids);-- Remove all rows that were duplicated (all copies).DELETE FROM salesWHERE saledateid BETWEEN 2224 AND 2231AND saleid IN(SELECT saleidFROM duplicate_saleids);-- Insert back in the single copiesINSERT INTO salesSELECT *FROM new_sales;-- CleanupDROP TABLE duplicate_saleids;DROP TABLE new_sales;COMMIT;
When used correctly, Redshift once again proves it's an absolute beast at dealing with processing huge sets of data.