a UNIQUE index permits multiple NULL values for columns that can contain NULL.
There are many scenarios in which this behavior is desirable. For example imagine you create a user table and require an email address, but you want username to be optional. You make the username column nullable, but you also want to make sure all non-null values in the column are unique, so it makes sense to add a UNIQUE INDEX. If the UNIQUE INDEX didn’t allow multiple NULL values then you would need to move the username column to a separate table in order to both maintain uniqueness and allow users without usernames.
But there are also plenty of scenarios where this behavior can cause problems. For example, consider aggregated fact tables in a star schema. You typically only want one row for a given combination of dimension values, and you can enforce this using a UNIQUE INDEX. However, if one or more of the dimension columns are nullable then you are at risk of having duplicate data.
Recently this affected an ETL process at Flite that loads data into a MySQL star schema. Sometimes I need to re-run all or part of the ETL process based on failures, and I rely on unique indexes to prevent the same data from being loaded into the star schema more than once. Most of my dimension columns do not allow null values, but a few of them do, which led to some data duplication when I re-ran pieces of the ETL flow. To fix the problem I identified all of the affected columns, cleaned up the existing data, made the columns NOT NULL, and came up with other values to use for missing data: For text columns I replaced NULL with the empty string ''. For numeric values I replaced NULL with 0.
Read on for details on how to find and fix this problem. For my examples I will use the Foodmart sample database from Pentaho.
Introducing the problem
I’ll use the sales_fact_1997 table for my example:
12345678910111213
mysql> desc sales_fact_1997;
+--------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+---------------+------+-----+---------+-------+
| product_id | int(11) | NO | MUL | NULL | |
| time_id | int(11) | NO | MUL | NULL | |
| customer_id | int(11) | NO | MUL | NULL | |
| promotion_id | int(11) | NO | MUL | NULL | |
| store_id | int(11) | NO | MUL | NULL | |
| store_sales | decimal(10,4) | NO | | NULL | |
| store_cost | decimal(10,4) | NO | | NULL | |
| unit_sales | decimal(10,4) | NO | | NULL | |
+--------------+---------------+------+-----+---------+-------+
If I make the promotion_id column nullable and replace 0 with NULL, then dump the data and try to reload it again, all of the rows with a promotion_id value of NULL are re-inserted, giving me duplicate data.
That’s a good way to find out how many tables are affected, but to figure out which columns are affected I wrote my own query on information_schema:
1234567891011121314
mysql> select c.table_schema,c.table_name,c.column_name
-> from information_Schema.table_constraints tc
-> inner join information_Schema.key_column_usage kcu on kcu.TABLE_SCHEMA = tc.TABLE_SCHEMA and kcu.TABLE_NAME = tc.TABLE_NAME and kcu.CONSTRAINT_NAME = tc.CONSTRAINT_NAME
-> inner join information_schema.columns c on c.table_schema = kcu.table_schema and c.table_name = kcu.table_name and c.column_name = kcu.column_name
-> where tc.table_schema = 'foodmart'
-> and tc.constraint_type = 'UNIQUE'
-> and c.table_name like '%fact%'
-> and c.is_nullable = 'YES'
-> order by c.table_schema,c.table_name,c.column_name;
+--------------+-----------------+--------------+
| table_schema | table_name | column_name |
+--------------+-----------------+--------------+
| foodmart | sales_fact_1997 | promotion_id |
+--------------+-----------------+--------------+
Now that I found the affected tables and columns, it’s time to fix them.
Fixing the problem
I can use a similar query against information_schema to generate the DDL to make those columns NOT NULL:
123456789101112131415161718192021
mysql> select concat('ALTER TABLE ',c.table_schema,'.',c.table_name,' MODIFY COLUMN ',
-> c.column_name,
-> ' ',
-> c.column_type,
-> ' ',
-> case when c.CHARACTER_SET_NAME is not null and c.CHARACTER_SET_NAME != '' then concat(' CHARACTER SET ',c.CHARACTER_SET_NAME) else '' end,
-> case when c.COLLATION_NAME is not null and c.COLLATION_NAME != '' then concat(' COLLATE ',c.COLLATION_NAME) else '' end,
-> ' NOT NULL',
-> ';') as sql_stmt
-> from information_Schema.table_constraints tc
-> inner join information_Schema.key_column_usage kcu on kcu.TABLE_SCHEMA = tc.TABLE_SCHEMA and kcu.TABLE_NAME = tc.TABLE_NAME and kcu.CONSTRAINT_NAME = tc.CONSTRAINT_NAME
-> inner join information_schema.columns c on c.table_schema = kcu.table_schema and c.table_name = kcu.table_name and c.column_name = kcu.column_name
-> where tc.table_schema = 'foodmart'
-> and tc.constraint_type = 'UNIQUE'
-> and c.table_name like '%fact%'
-> and c.is_nullable = 'YES';
+------------------------------------------------------------------------------------+
| sql_stmt |
+------------------------------------------------------------------------------------+
| ALTER TABLE foodmart.sales_fact_1997 MODIFY COLUMN promotion_id int(11) NOT NULL; |
+------------------------------------------------------------------------------------+
I can’t run the DDL yet because of the duplicate data I introduced earlier, so I need to fix the data first. There are many ways to remove duplicate data, and this method will not work for all cases, but in this case the row count is small enough and I know the duplicate rows are true duplicates so my preferred method is just to dump the data to disk, truncate the table, run the DDL, then reload the data and let the index enforce uniqueness: