3 Methods to Extract a Subset of Your Data Using Mysqldump
A few years ago I wrote a tool to extract a subset of data from a production database for use in QA and development environments. My goal was to have some real data for functional testing, but to make the data set small enough that a developer could easily download it and install it on their laptop in a few minutes.
I implemented the tool using mysqldump. As I have maintained the tool over the years I’ve employed a couple of different approaches, each of which I will describe in more detail below.
The first step was to identify the records I wanted to include in my subset. I created a couple of tables to store the ids of the records I wanted, and then some queries to populate those tables based on various criteria.
For example, say I want to dump the data for the 10 shortest PG rated movies in the sakila database. Here’s an example of populating a single ID table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
It gets more complicated when we want to dump the data. The biggest question is: How should I form the where clause in the mysqldump statement? Running mysqldump with --where is useful, but we’re limited it what type of where clauses we can use with mysqldump.
Here’s one way to do it:
1) Use EXISTS
My earliest approach was to use an EXISTS query.
1 2 3 4 5 | |
That works, but it’s not necessarily going to be fast. It also gets more complicated for child tables. For example how do I get the data for only the actors who appeared in those films?
1 2 3 4 5 6 7 8 | |
Again, that works, but it’s adding complexity and has the potential to be slow for large tables.
Those commands are pretty similar to what I ended up using in my tool, and it worked pretty well for a few years. Eventually some of the EXISTS queries caused the tool to be slower than I wanted it to be, so I revisited the implementation. What I found was that in many cases it was faster to use an IN clause with the actual IDs rather than the EXISTS queries I had been using before.
2) Use IN
Here are my previous examples re-written using an IN clause. Here I dynamically generate a mysqldump command using my id table:
1 2 3 4 5 6 7 8 | |
Here’s the output:
1 2 3 4 5 | |
Okay, that’s pretty straightforward. How about the actors? Why not just use a separate id table for them?
1 2 3 4 5 6 7 8 9 10 11 12 | |
Then I can generate the mysqldump statement using the same method:
1 2 3 4 5 6 7 8 | |
That gives us the following command:
1 2 3 4 5 | |
In general I try to minimize the number of id tables. For one-to-many relationships I dump the child records using the parent’s id table. But for many-to-many relationships or multi-level relationships I consider adding a separate id table.
I did some profiling to determine which tables benefitted from using IN instead of EXISTS, but in some cases I had to use EXISTS even though IN would have been faster, because mysqldump only supports IN clauses up to a certain size. I forget what the exact limit was, maybe 10,000 values?
3) Use a Separate Schema
This method frees you from one of the biggest constraints of the other methods. It allows you to efficiently use joins to select the related rows you want, and it makes the mysqldump commands trivial.
The approach is to create a copy of every table in your schema, copy over the rows you want, and then dump the entire copy schema.
Following our film and actor examples from above:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
Now I can dump the entire sakila_subset schema:
1 2 3 | |
If you want your copy to maintain the original schema name you can edit the output file and replace references to sakila_subset with sakila.
