The in Operator in Cassandra CQL
In my post about the CQL WHERE clause I didn’t fully cover when the IN operator is supported and how it differs from the = operator. Based on the comments on that post and some questions I’ve gotten from people who read the post, I think it’s worth going into some detail about the IN operator.
Here’s where the IN operator is supported:
- The last column in the partition key, assuming the = operator is used on the first N-1 columns of the partition key
- The last clustering column, assuming the = operator is used on the first N-1 clustering columns and all partition keys are restricted
- The last clustering column, assuming the = operator is used on the first N-1 clustering columns and ALLOW FILTERING is specified
It’s worth mentioning that I can use the IN operator interchangably with the = operator on a single value, for example:
1
| |
is equivalent to:
1
| |
But in that case the semantics are that of = even though I’m using IN.
The real use case of IN is with multiple distinct values. The simple case involves a single column partition key and/or clustering column. For this I’ll use the temperature table I’ve used in past examples:
1 2 3 4 5 6 | |
I can use IN on the partition key:
1 2 3 | |
Or on the clustering column (with ALLOW FILTERING):
1 2 3 4 | |
In general it’s best to avoid queries that require ALLOW FILTERING because they often require lots of data to be scanned even if only a small amount of data is returned, but I show that example because it is a supported use of the IN operator.
A more feasible example is to use both together (without ALLOW FILTERING):
1 2 3 4 | |
If I try to use the IN operator on the temperature column I get an error:
1 2 3 4 | |
Even if I add a secondary index:
1 2 3 4 5 | |
Moving on to composite keys, I’ll use the table that Vasyl Boroviak used in the comments of my previous post:
1 2 3 4 5 6 7 8 | |
This table is useful because it contains a composite partition key and and has multiple clustering columns.
Since the partition key contains two columns, I need to use = on the first column in order to use IN on the second column:
1 2 3 4 | |
Likewise there are two clustering columns, so I need to use = on the first column in order to use IN on the second column, and I also need to use ALLOW FILTERING if this is the only criteria:
1 2 3 4 5 | |
Again I can combine the those two queries together without ALLOW FILTERING:
1 2 3 4 5 6 | |
