Thursday, May 1, 2014

Why I consider USING harmful

Often on the topic of the different JOIN syntaxes in SQL I mention that I consider ON the only reasonable way to specify JOIN clauses in production code, and I get asked for details.  It always takes me a while to recall the specific problem and to come up with a plausible scenario, so I figured it would be easier to document this once and for all.

Suppose you have a database containing information about cars and the parts that go into building said cars.  Your schema could look something like this:

create table cars (
carid serial primary key,
model text not null
);

create table manufacturers (
manufacturerid serial primary key,
name text not null
);

create table parts (
partid serial primary key,
manufacturerid int references manufacturers,
model text not null
);

create table car_parts (
carid int references cars,
partid int references parts,
primary key (carid, partid)
);

And then a query to get a specific car and the parts that go into building it might look like this:

select
  cars.model,
  car_parts.partid,
  manufacturers.name as manufacturer
from cars
join car_parts using (carid)
join parts using (partid)
join manufacturers using (manufacturerid)
where cars.carid = $1
;

Everything is working great, until some day someone thinks that you should also track the manufacturer for each car.  So you run the following DDL:

alter table cars
  add column manufacturerid integer
  references manufacturers;

.. and boom.  The query shown above that previously worked correctly won't work anymore.  Even worse, if it's in a view (as opposed to a function or SQL in the application), the view will continue to work, and you might not even know that it's broken until you try to restore a dump of the database.

This is a huge caveat, and also the reason I think USING does not belong to production code.  ON is far more reasonable, though it requires a bit more typing.  SQL is not for the lazy.

6 comments:

  1. In general I agree that ON is the better choice. The USING operator has one nice advantages though: If you select all columns (using *) only one of the join columns will be included with the USING operator, not both (I know SELECT * is bad in production code, but sometimes it is useful)

    ReplyDelete
  2. That doesn't sound like USING being better; it sounds like SELECT * being worse.

    I never use SELECT * in production code, either. It's too easy to break such applications by adding new columns. I've never found it useful, other than in being lazy. And that laziness comes with a price.

    ReplyDelete
    Replies
    1. In object oriented databases, SELECT * can be a good thing. SELECT * is preferred if you are going to and from the same user defined data type. This allows you to append columns to that data type without having to add columns to queries.

      Delete
    2. I don't find that convincing at all. I'm not sure what exactly you mean by that, but it seems like:

      1) This would never work, since tables aren't 1:1 mappable to "objects" you'd want to use in your application.
      2) Let's say you do want to add a column. With SELECT * you can't upgrade the database first, or the application would break. You also can't upgrade the application first, or it would break (since it's assuming a column that doesn't exist). So to do an upgrade, you need to bring all instances of the app down, add the column to your schema, and then bring up a new version of the app. That sounds like a lot of pain. If you specify the column list, you can upgrade the DB first, and then the applications one instance of a time, and then start using the new feature (i.e. put some non-NULL values into the new column). No downtime necessary. I've done this on several occations and it is a lovely feature.

      Delete
  3. To play devil's advocate, that sounds like a poorly deployed change to the data model. When making changes, there should be some impact analysis to ensure that just this type of thing doesn't happen. Perhaps on the Cars table, the field would instead be called car_manufacturerid, or some such.

    This seems more like the kind of thing that could be easily managed by carefully evolving the data model than it does a fundamental flaw in the queries which leverage the "using" keyword.

    I think a more compelling reason to avoid the use of "using" in production code is that many people are not familiar with its use, and so it could lead to confusion.

    ReplyDelete
    Replies
    1. Of course, any schema change will require careful review before deployment, but avoiding USING completely eliminates one possible failure vector. I don't see why you wouldn't want to grab that opportunity.

      Also, once you start prefixing columns like that, for consistency you'd want to call the other column part_manufacturerid, and then USING won't work anymore. Effectively, you have poorly reinvented the ON clause.

      Delete