Wednesday, March 18, 2015

Understanding SQL's Null

 

Tony Hoare invented the null reference in 1965, and he considers it his "billion-dollar mistake" for the amount of trouble it has caused. Even today, SQL's null value is the cause of several common mistakes.

Let’s go over the most egregious.

Equals Null

These two queries return the exact same result on a users table with many rows:

select * from users where deleted_at = null;
-- result: 0 rows

select * from users where deleted_at != null;
-- result: 0 rows

How can that be? It's because null represents an "unknown" type. This means it doesn't make sense to compare null to anything else with normal conditional operators. Null isn't even equal to itself:

select null > 0;
-- result: null

select null < 0;
-- result: null

select null = 0;
-- result: null

select null = null;
-- result: null

The right way to compare values with null is with the is, and is not operators:

select * from users 
where deleted_at is null;
-- result: all non-deleted users

select * from users
where deleted_at is not null;
-- result: all deleted users

If you want to check if two columns are different, you can use is distinct from:

select * from users
where has_address is distinct from has_photo
-- result: users with an address or
-- photo, but not both

Not in Null

One handy way to filter rows is with a subselect. For example, if you wanted the userswho did not have any packages, you could write a query like this:

select * from users 
where id not in (select user_id from packages)

But if one of the rows in packages has a null user_id, this query will return no results! To understand why this happens we need to factor the query like the SQL compiler does. Here's a simpler example:

select * from users 
where id not in (1, 2, null)

Which translates to:

select * from users 
where id != 1 and id != 2 and id != null

As we now know, id != null is an unknown value, null. Using and on any value withnull becomes null, so all of the other and conditions fall away. No rows match the resulting query since null is not equal to true.

If the condition is inverted, the query works fine. This time we'll look for users withpackages.

select * from users 
where id in (select user_id from packages)

Which we can simplify for the example:

select * from users 
where id in (1, 2, null)

This query translates to:

select * from users 
where id = 1 or id = 2 or id = null

Since the where clause is a list of or conditions, it doesn't matter that one of them is null. That condition is simply ignored because non-true values do not change the evaluation of the rest of the clause.

Sorting Nulls

When it comes to sorting, nulls are considered the largest possible value. This can lead to some frustrating queries when trying to sort values descending, since all the nulls will be on top.

This query is meant to show the users ranked by their points, but it's putting users without any points first!

select name, points
from users
order by 2 desc;
-- null points sort above
-- any number of points!

There are two ways to deal with this. The easiest way is to get rid of those nulls in the output or the comparison using coalesce:

-- treat nulls as 0 in output
select name, coalesce(points, 0)
from users
order by 2 desc;

-- keep nulls in output, but sort as 0
select name, points
from users
order by coalesce(points, 0) desc;

And if your database supports it, you can instead tell it where to put nulls when sorting with nulls first or nulls last:

select name, coalesce(points, 0)
from users
order by 2 desc nulls last;

Of course, nulls can also be used to prevent errors. One great use of nulls is in dealing with divide by zero errors.

Divide by Zero

Divide by zero errors are especially painful. Queries that ran fine yesterday all of a sudden are failing with divide by zero errors. One common solution is to check if the denominator is 0 before dividing with a case statement:

select case when num_users = 0 then 0 
else total_sales/num_users end;

The case statement approach is verbose and duplicates the denominator. That's OK if the denominator is simple, but if it's an expression, you're likely to get more bugs if you change the query later.

Here we can use null to our advantage. Use nullif on the denominator to make the denominator null instead of zero. Rather than failing, the query will return null on days where num_users = 0.

select total_sales/nullif(num_users, 0);

If you prefer the result to be 0 or anything else instead of null, use coalesce on the previous query:

select coalesce(total_sales/nullif(num_users, 0), 0);
-- nulls results become 0

Conclusion

Tony Hoare may regret his mistake, but at least it’s easy to work around the issues that null presents. Now go forth with your new knowledge and keep null from nullifying your future query results!