The Coeo Blog

Calculating Age with an Inline Table Valued Function

Written by Simon Osborne | 28-Feb-2014 13:46:13

I was working on a data warehousing project recently where I was working with VERY large data sets. It was a customer insight warehouse and as part of that warehouse details about customers were stored including their DOB.

Part of the requirement was to keep a record of their age at the time of the transaction and also be able to track that person’s current age.

As with most things we always try to re-use something (a piece of code, a process) that we’ve used before to save time right?

So there was a UDF in the database called ufn_GetAge. This was a fairly typical scalar UDF that took two date parameters and then worked out the elapsed time between the two to calculate an age:

CREATE FUNCTION [dbo].[ufn_GetAge]
(
@DOB    DATETIME,

@Date       DATETIME
)

RETURNS INT

AS

BEGIN

DECLARE @Age         INT

IF @DOB >= @Date

RETURN 0

SET @Age = DATEDIFF(YY, @DOB, @Date)

IF MONTH(@DOB) > MONTH(@Date) OR

(MONTH(@DOB) = MONTH(@Date) AND

DAY(@DOB)   > DAY(@Date))

SET @Age = @Age - 1

RETURN @Age

END

This had always “worked fine” before and so it was re-used as part of an ETL load for a VERY large data set (300+ million rows). So what happened when we ran the ETL? Well, all I can tell you is it took a very long time…

Some debugging on the ETL was performed and this function was found to be the culprit. A new solution was called for, enter the Inline Table Valued Function.

So… we needed an accurate way of calculating the difference between two dates in order to come up with an Age (in whole years). I wrote several functions that should have done the trick but under testing showed the results generated were not always quite correct. Eventually I came up with a piece of code that does the trick:

CREATE FUNCTION [dbo].[iTVF_CalculateAge]

(

@DOB date,

@Date date

)

RETURNS TABLE

AS

RETURN

(

SELECT

CASE

WHEN (@DOB IS NULL) OR (@Date IS NULL) THEN -1

ELSE DATEDIFF(hour,@DOB, @Date)/8766

END AS Age

)

CAVEAT: This always seems to come up with the correct answer and passed all the unit tests we ran on it. If you wish to use it, ensure that you test it to make sure it generates the results you need. If you do find an inaccuracy then do please let me know in the comments below this post.

So… how does this new inline function perform when compared to the original UDF above? I’m glad you asked!

Firstly I created a Test database and two test tables. Once called Test and once called Test2. Both contain two columns, an ID (INT IDENTITY (1,1)) and a DOB (DATE). I inserted 1 million dates into Test and 10 million dates into Test2.

Testing, Testing, 123….

Firstly the 1 million row dataset:

SET STATISTICS IO ON;

SET STATISTICS TIME ON;

SELECT t.ID, t.DOB FROM dbo.Test t;

SELECT t.ID, t.DOB, [dbo].[ufn_GetAge](t.DOB,GETDATE()) FROM dbo.Test t;

SELECT t.ID, t.DOB, age.Age

FROM dbo.Test t

CROSS APPLY [dbo].[iTVF_CalculateAge](t.DOB,GETDATE()) age;

SET STATISTICS IO OFF;

SET STATISTICS TIME OFF;

The first query is run merely to get an idea of raw performance to select out the data (no functions or calculations applied), the second query runs the UDF and the third uses the new inline function.

And so onto the results:

(1000000 row(s) affected)
Table 'Test'. Scan count 1, logical reads 1986, physical reads 3, read-ahead reads 1973, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 328 ms,  elapsed time = 5164 ms.
(1000000 row(s) affected)
Table 'Test'. Scan count 1, logical reads 1986, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 5397 ms,  elapsed time = 6567 ms.
(1000000 row(s) affected)
Table 'Test'. Scan count 1, logical reads 1986, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 874 ms,  elapsed time = 5409 ms.

As you can see from the results, the raw select of all the data (1 million rows) it took 5164ms, the UDF took 6567ms and the iTVF took 5409ms. So on time alone the iTVF wins, but the real benefit is in the CPU cycles… the iTVF took only 874ms where the UDF took 5397ms! I declare iTVF the winner.

And over 10 million rows:

(10000000 row(s) affected)
Table 'Test2'. Scan count 1, logical reads 19838, physical reads 2, read-ahead reads 19834, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 3152 ms,  elapsed time = 49175 ms.
(10000000 row(s) affected)
Table 'Test2'. Scan count 1, logical reads 19838, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 52494 ms,  elapsed time = 66646 ms.
(10000000 row(s) affected)
Table 'Test2'. Scan count 1, logical reads 19838, physical reads 2, read-ahead reads 19834, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6724 ms,  elapsed time = 51387 ms.

The results are very clear. Over the 10 million row results set, using the iTVF we have saved nearly 48,000 ms in CPU time, and 15 seconds in elapsed time! When this was used where I’ve been working recently we took an 18 minute query down to 10 seconds. That is a hell of a saving.

As the result set grows the savings grow, I encourage you to consider converting your UDF’s to iTVF’s and start boosting the performance of some of your procedures!