+44 (0)20 3051 3595 | info@coeo.com | Client portal login

Watching out for SELECT * with UNION ALL

Paul Farnell

UNION ALL returning incorrect results?

I recently came across an interesting behaviour when executing queries that included a UNION ALL statement, the examples below were created using SQL Server 2016 Service Pack 1.

The statement used a UNION ALL statement to combine two SELECT queries from different tables and aggregate the data; when the order of the SELECT statements was reversed however, the data outputted by SQL Server changed. If you’re familiar with Set Theory this should not be possible, I have created example data below to show how this was happening:

First I created two tables, one to track physical purchases and another to track digital purchases of our customers, I then inserted some values into each table to use as example data:

CREATE TABLE dbo.DigitalPurchases (

    CustName varchar(20),
    Description varchar(50),
    Price decimal(5,2),
    Quantity smallint

);

CREATE TABLE dbo.PhysicalPurchases (

    CustName varchar(20),
    Description varchar(50),
    Quantity smallint,
    Price decimal(5,2)

);

 
It’s important to note here that the columns “Price” and “Quantity” have been deliberately mixed up in that in the DigitalPurchases table the column order is “Price” then “Quantity” whereas in PhysicalPurchases the order is “Quantity” then “Price”.

INSERT INTO dbo.DigitalPurchases (CustName, Description, Price, Quantity)
VALUES ('Paul','Premium CD', 10.00, 3);

INSERT INTO dbo.DigitalPurchases (CustName, Description, Price, Quantity)
VALUES ('Paul','Premium CD 2', 15.00, 2);

INSERT INTO dbo.DigitalPurchases (CustName, Description, Price, Quantity)
VALUES ('Matt','Punk CD', 12.00, 1);

INSERT INTO dbo.DigitalPurchases (CustName, Description, Price, Quantity)
VALUES ('Jon','Ska CD', 12.00, 2);

 

INSERT INTO dbo.PhysicalPurchases (CustName, Description, Quantity, Price)
VALUES('Paul', 'Stock Magazine', 2, 5.00);

INSERT INTO dbo.PhysicalPurchases(CustName, Description, Quantity, Price)
VALUES('Jon', 'Magazine Subscription', 1, 35.00);

INSERT INTO dbo.PhysicalPurchases (CustName, Description, Quantity, Price)
VALUES('Matt', 'Magazine Subscription', 1, 35.00);

INSERT INTO dbo.PhysicalPurchases (CustName, Description, Quantity, Price)
VALUES('Matt', 'Stock Magazine', 1, 5.00);

 
After inserting the values above into the two tables we can start joining this data together and find the aggregated data that we’re looking for using a derived table, for example if we want to find the total quantity of products sold: 

SELECT SUM(CombinedPurchases.quantity)
FROM
(
    SELECT * FROM dbo.PhysicalPurchases
    UNION ALL
    SELECT * FROM dbo.DigitalPurchases
) AS CombinedPurchases;
--Returns 54


Or swapping the select statements as it shouldn’t matter which order we combine the data in:

SELECT SUM(CombinedPurchases.quantity)
FROM
(
    SELECT * FROM dbo.DigitalPurchases
    UNION ALL
    SELECT * FROM dbo.PhysicalPurchases
) AS CombinedPurchases;
--Returns 88


Firstly, it’s odd that these two queries returned different results but secondly, we can see from the data that the returned results are also both incorrect as the actual total quantity of our sold items is 13.


So why is this happening? Well, when we are combining our two queries using SELECT *, SQL Server is comparing the columns in the order that we originally created the tables in. This means that the “Price” column of the first query is being matched to the “Quantity” of the second query and vice versa.

To correct this, we should always list the columns that we want to match deterministically as below:

SELECT SUM(CombinedPurchases.quantity)
FROM
(
    SELECT CustName, Description, Price, Quantity
    FROM dbo.PhysicalPurchases
    UNION ALL
    SELECT CustName, Description, Price, Quantity
    FROM dbo.DigitalPurchases
) AS CombinedPurchases;
--Returns 13


We can also verify that the correct data is being outputted by running the inner UNION ALL query which will now return the expected data:

Correct Sub Query Blog.png

This may seem like an edge case however when reference tables are being used as part of a database design to connect information together it becomes much easier for columns to be misaligned and produce incorrect results as you can see above.

 

Struggling with a tricky SQL Server issue? We can diagnose problems and optimise performance with our Data Platform Healthcheck:

Solve performance problems and improve security with Coeo’s thorough review

Subscribe to Email Updates

Back to top