Breaking Down the CQL Where Clause

When explaining the differences between CQL and SQL I often start by rattling off a list of things you can’t do in CQL. The list usually starts something like this:

No joins
No GROUP BY
No arbitrary WHERE clauses
No arbitrary ORDER BY clauses

The first 2 items on the list are fairly obvious. Joins and GROUP BY are not supported at all.

But what about the WHERE clause? SQL supports a whole bunch of operators and predicates in the WHERE clause, and allows filtering on non-indexed columns. The CQL WHERE clause only works for indexed columns, specifically columns in the primary key (both the partition key and the clustering columns), and columns that have secondary indexes on them. Multiple predicates can only be connected with AND, not OR, only a few operators are supported, and all of the operators have restrictions. The Datastax CQL documentation describes what the WHERE clause supports. Here’s what it says:

Filtering data using WHERE

The WHERE clause specifies which rows to query. The WHERE clause is composed of conditions on the columns that are part of the primary key or are indexed. Use of the primary key in the WHERE clause tells Cassandra to race to the specific node that has the data. Using the equals conditional operators (= or IN) is unrestricted. The term on the left of the operator must be the name of the column, and the term on the right must be the column value to filter on. There are restrictions on other conditional operators.

Cassandra supports these conditional operators: =, >, >=, <, or <=, but not all in certain situations.

A filter based on a non-equals condition on a partition key is supported only if the partitioner is an ordered one.

WHERE clauses can include a greater-than and less-than comparisons, but for a given partition key, the conditions on the clustering column are restricted to the filters that allow Cassandra to select a contiguous ordering of rows.

I mostly agree with that language, but I think this line is wrong:

Using the equals conditional operators (= or IN) is unrestricted.

I’d say that the = and IN operators are restricted, for example you can only use IN on the last column in the partition key. More on that later.

From my experience, here’s the botttom line of which operators are supported for which columns in the CQL WHERE clause:

Partition key columns support the = operator
The last column in the partition key supports the IN operator
Clustering columns support the =, >, >=, <, and <= operators
Secondary index columns support the = operator

Read on for examples.

1. Partition key columns support the = operator

I’ll borrow a table from Patrick McFadin’s post about time series data modeling for my examples:

CREATE TABLE temperature_by_day (
   weatherstation_id text,
   date text,
   event_time timestamp,
   temperature text,
   PRIMARY KEY ((weatherstation_id,date),event_time)
);

Since this table has a composite partition key on (weatherstation_id,date) I need to include both of those columns if I want to use the = operator.

So this query is allowed:

SELECT *
FROM temperature_by_day
WHERE weatherstation_id='1234ABCD'
AND date='2013-04-03';

But these queries are not allowed:

cqlsh:test> SELECT *
        ... FROM temperature_by_day
        ... WHERE weatherstation_id='1234ABCD';
Bad Request: Partition key part date must be restricted since preceding part is

cqlsh:test> SELECT *
        ... FROM temperature_by_day
        ... WHERE date='2013-04-03';
Bad Request: partition key part date cannot be restricted (preceding part weatherstation_id is either not restricted or by a non-EQ relation)

2. The last column in the partition key supports the IN operator

For single column partition keys, the IN operator is allowed without restriction. I’ll use a table with a single column partition key to illustrate that:

CREATE TABLE temperature (
   weatherstation_id text,
   event_time timestamp,
   temperature text,
   PRIMARY KEY (weatherstation_id,event_time)
);

SELECT *
FROM temperature
WHERE weatherstation_id in ('1234ABCD','5678EFGH');

For composite partition keys, I have to use the = operator on the first N-1 columns of the partition key in order to use the IN operator on the last column. So this query is allowed:

SELECT *
FROM temperature_by_day
WHERE weatherstation_id = '1234ABCD'
AND date in ('2013-04-03','2013-04-04');

But these queries are not allowed:

cqlsh:test> SELECT *
        ... FROM temperature_by_day
        ... WHERE weatherstation_id in ('1234ABCD','5678EFGH');
Bad Request: Partition KEY part weatherstation_id cannot be restricted by IN relation (only the last part of the partition key can)

cqlsh:test> SELECT *
        ... FROM temperature_by_day
        ... WHERE date in ('2013-04-03','2013-04-04');
Bad Request: partition key part date cannot be restricted (preceding part weatherstation_id is either not restricted or by a non-EQ relation)

3. Clustering columns support the =, >, >=, <, and <= operators

The clustering columns support lots of operators, with the caveats mentioned in the documentation snippet I posted earlier.

The simplest case is to use the = operator for the partion key and the clustering columns, like this:

SELECT *
FROM temperature_by_day
WHERE weatherstation_id = '1234ABCD'
AND date in ('2013-04-03','2013-04-04')
AND event_time = '2013-04-03 06:00:00';

I can also use the = or IN operator on the partition key and do a range scan on the clustering columns:

SELECT *
FROM temperature_by_day
WHERE weatherstation_id = '1234ABCD'
AND date in ('2013-04-03','2013-04-04')
AND event_time >= '2013-04-03 06:00:00'
AND event_time <= '2013-04-04 06:00:00';

Regardless of the operator, if I try to filter on the clustering columns only, I’ll get an error if I don’t use the ALLOW FILTERING command. For example:

cqlsh:test> SELECT *
        ... FROM temperature_by_day
        ... WHERE event_time = '2013-04-03 06:00:00';
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING

If I add the ALLOW FILTERING command then I can use any operator on the clustering columns, but the query may be very slow and could eventually time out.

So these queries are all allowed:

SELECT *
FROM temperature_by_day
WHERE event_time = '2013-04-03 06:00:00'
ALLOW FILTERING;

SELECT *
FROM temperature_by_day
WHERE event_time >= '2013-04-03 06:00:00'
and event_time <= '2013-04-04 06:00:00'
ALLOW FILTERING;

SELECT *
FROM temperature_by_day
WHERE event_time >= '2013-04-03 06:00:00'
ALLOW FILTERING;

But sometimes they may time out, like this:

cqlsh:test> SELECT *
        ... FROM temperature_by_day
        ... WHERE event_time >= '2013-04-03 06:00:00'
        ... ALLOW FILTERING;
Request did not complete within rpc_timeout.

4. Secondary index columns support the = operator

For the secondary index examples I’ll add a secondary index to the un-indexed temperature column:

CREATE INDEX on temperature_by_day (temperature);

Now I can filter on the temperature column with the = operator:

SELECT *
FROM temperature_by_day
WHERE temperature='74F';

I can also use a secondary index in conjunction with the = operator on the partion key and any supported operator on the clustering columns. For example:

SELECT *
FROM temperature_by_day
WHERE weatherstation_id = '1234ABCD'
AND date = '2013-04-03'
AND event_time >= '2013-04-03 06:00:00'
AND event_time <= '2013-04-04 06:00:00'
AND temperature = '74F';

Nov 5th, 2013

Cassandra, PlanetCassandra

Comments