Breaking Down the CQL Where Clause
When explaining the differences between CQL and SQL I often start by rattling off a list of things you can’t do in CQL. The list usually starts something like this:
- No joins
- No GROUP BY
- No arbitrary WHERE clauses
- No arbitrary ORDER BY clauses
The first 2 items on the list are fairly obvious. Joins and GROUP BY are not supported at all.
But what about the WHERE clause? SQL supports a whole bunch of operators and predicates in the WHERE clause, and allows filtering on non-indexed columns. The CQL WHERE clause only works for indexed columns, specifically columns in the primary key (both the partition key and the clustering columns), and columns that have secondary indexes on them. Multiple predicates can only be connected with AND, not OR, only a few operators are supported, and all of the operators have restrictions. The Datastax CQL documentation describes what the WHERE clause supports. Here’s what it says:
Filtering data using WHERE
The WHERE clause specifies which rows to query. The WHERE clause is composed of conditions on the columns that are part of the primary key or are indexed. Use of the primary key in the WHERE clause tells Cassandra to race to the specific node that has the data. Using the equals conditional operators (= or IN) is unrestricted. The term on the left of the operator must be the name of the column, and the term on the right must be the column value to filter on. There are restrictions on other conditional operators.
Cassandra supports these conditional operators: =, >, >=, <, or <=, but not all in certain situations.
- A filter based on a non-equals condition on a partition key is supported only if the partitioner is an ordered one.
- WHERE clauses can include a greater-than and less-than comparisons, but for a given partition key, the conditions on the clustering column are restricted to the filters that allow Cassandra to select a contiguous ordering of rows.
I mostly agree with that language, but I think this line is wrong:
Using the equals conditional operators (= or IN) is unrestricted.
I’d say that the = and IN operators are restricted, for example you can only use IN on the last column in the partition key. More on that later.
From my experience, here’s the botttom line of which operators are supported for which columns in the CQL WHERE clause:
- Partition key columns support the = operator
- The last column in the partition key supports the IN operator
- Clustering columns support the =, >, >=, <, and <= operators
- Secondary index columns support the = operator
Read on for examples.
1. Partition key columns support the = operator
I’ll borrow a table from Patrick McFadin’s post about time series data modeling for my examples:
1 2 3 4 5 6 7 | |
Since this table has a composite partition key on (weatherstation_id,date) I need to include both of those columns if I want to use the = operator.
So this query is allowed:
1 2 3 4 | |
But these queries are not allowed:
1 2 3 4 5 6 7 8 9 | |
2. The last column in the partition key supports the IN operator
For single column partition keys, the IN operator is allowed without restriction. I’ll use a table with a single column partition key to illustrate that:
1 2 3 4 5 6 7 8 9 10 | |
For composite partition keys, I have to use the = operator on the first N-1 columns of the partition key in order to use the IN operator on the last column. So this query is allowed:
1 2 3 4 | |
But these queries are not allowed:
1 2 3 4 5 6 7 8 9 | |
3. Clustering columns support the =, >, >=, <, and <= operators
The clustering columns support lots of operators, with the caveats mentioned in the documentation snippet I posted earlier.
The simplest case is to use the = operator for the partion key and the clustering columns, like this:
1 2 3 4 5 | |
I can also use the = or IN operator on the partition key and do a range scan on the clustering columns:
1 2 3 4 5 6 | |
Regardless of the operator, if I try to filter on the clustering columns only, I’ll get an error if I don’t use the ALLOW FILTERING command. For example:
1 2 3 4 | |
If I add the ALLOW FILTERING command then I can use any operator on the clustering columns, but the query may be very slow and could eventually time out.
So these queries are all allowed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
But sometimes they may time out, like this:
1 2 3 4 5 | |
4. Secondary index columns support the = operator
For the secondary index examples I’ll add a secondary index to the un-indexed temperature column:
1
| |
Now I can filter on the temperature column with the = operator:
1 2 3 | |
I can also use a secondary index in conjunction with the = operator on the partion key and any supported operator on the clustering columns. For example:
1 2 3 4 5 6 7 | |
