The in Operator in Cassandra CQL

In my post about the CQL WHERE clause I didn’t fully cover when the IN operator is supported and how it differs from the = operator. Based on the comments on that post and some questions I’ve gotten from people who read the post, I think it’s worth going into some detail about the IN operator.

Here’s where the IN operator is supported:

The last column in the partition key, assuming the = operator is used on the first N-1 columns of the partition key
The last clustering column, assuming the = operator is used on the first N-1 clustering columns and all partition keys are restricted
The last clustering column, assuming the = operator is used on the first N-1 clustering columns and ALLOW FILTERING is specified

It’s worth mentioning that I can use the IN operator interchangably with the = operator on a single value, for example:

select * from foo where bar = 1;

is equivalent to:

select * from foo where bar in (1);

But in that case the semantics are that of = even though I’m using IN.

The real use case of IN is with multiple distinct values. The simple case involves a single column partition key and/or clustering column. For this I’ll use the temperature table I’ve used in past examples:

CREATE TABLE temperature (
   weatherstation_id text,
   event_time timestamp,
   temperature text,
   PRIMARY KEY (weatherstation_id,event_time)
);

I can use IN on the partition key:

SELECT *
FROM temperature
WHERE weatherstation_id in ('1234ABCD','5678EFGH');

Or on the clustering column (with ALLOW FILTERING):

SELECT *
FROM temperature
WHERE event_time in ('2013-04-03 06:00','2013-04-03 07:00')
ALLOW FILTERING;

In general it’s best to avoid queries that require ALLOW FILTERING because they often require lots of data to be scanned even if only a small amount of data is returned, but I show that example because it is a supported use of the IN operator.

A more feasible example is to use both together (without ALLOW FILTERING):

SELECT *
FROM temperature
WHERE weatherstation_id in ('1234ABCD','5678EFGH')
AND event_time in ('2013-04-03 06:00','2013-04-03 07:00');

If I try to use the IN operator on the temperature column I get an error:

cqlsh:test> SELECT *
        ... FROM temperature
        ... WHERE temperature in ('32F','33F');
Bad Request: Cannot use IN operator on column not part of the partition key

Even if I add a secondary index:

cqlsh:test> create index on temperature (temperature);
cqlsh:test> SELECT *
        ... FROM temperature
        ... WHERE temperature in ('32F','33F');
Bad Request: Cannot use IN operator on column not part of the partition key

Moving on to composite keys, I’ll use the table that Vasyl Boroviak used in the comments of my previous post:

CREATE TABLE ad_click (
reseller_id text,
day text,
time timestamp,
ad_id text,
amount float,
PRIMARY KEY ((reseller_id, day), time, ad_id)
);

This table is useful because it contains a composite partition key and and has multiple clustering columns.

Since the partition key contains two columns, I need to use = on the first column in order to use IN on the second column:

SELECT * 
FROM ad_click 
WHERE reseller_id='1' 
AND day IN ('1','2');

Likewise there are two clustering columns, so I need to use = on the first column in order to use IN on the second column, and I also need to use ALLOW FILTERING if this is the only criteria:

SELECT * 
FROM ad_click 
WHERE time = 123412341234 
AND ad_id IN ('1','2')
ALLOW FILTERING;

Again I can combine the those two queries together without ALLOW FILTERING:

SELECT * 
FROM ad_click 
WHERE reseller_id='1' 
AND day IN ('1','2')
AND time = 123412341234 
AND ad_id IN ('1','2');

Jan 8th, 2014

Cassandra, PlanetCassandra

Comments

The in Operator in Cassandra CQL

Posted by Ike Walker

Comments