Some of the scripts I have inherited or had to work on have been written by people who separate every logical step into a separate SQL statement and temp table. The longest script I have had to work on was over 20 000 lines which I managed to simplify to couple hundred lines. I frequently encounter scripts several thousands lines long. In this post I share a method I follow to simplify lengthy SQL scripts.
Start with the end result
Ignore all 20 000 lines of the SQL script and look at the query result. In most cases this is a a table or result set. Get the list of columns in the result set and if result is a table, use the Information schema to get the columns. Below is an example script to get columns in PostgreSql.
SELECT
column_name
FROM
information_schema.columns
WHERE
1=1
AND table_schema = 'public'
AND table_name = 'invoiceline'
First principles
Next step is to identify the table that is the most granular and from which all other metrics are derived. I usually refer to this as the master table. There is always a master table and you need to find it. Identifying the master table will need some knowledge of the data domain and the structure of the database. You should end up with a simple founding statements for the new SQL statement you are writing. The statement will be anchored on the master table. Examples of such statements are:
- There cannot be more items sold than items in the sales detail table.
- There can not be more customers than registered customers in the customer table.
Add columns to query
Now that you have the master table and columns you need in your result, you can can start building your query. You should follow these steps:
- Identify from which tables the columns are derived
- Add columns by joining to tables one at a time
I have been through this process more times than I can remember. The incremental process of adding columns to the query played a big role in how I came to format my queries the way I do. I start with the master table and incrementally add tables and columns but at any point I am able to test the results of my query by commenting out parts of the SQL statement.
Use Common Table Expressions (CTE)
There might be good some good reasons for the temp tables. These might include:
- Performance reasons
- Joins to tables of a different grain.
- The need to perform certain calculations at a different aggregation level.
Most SQL databases now support CTE and if need be you should use CTE to create the query. It results in a much cleaner and readable query.
Conclusion
For a reporting or data warehouse project to be successful the data produced must be accurate. Having simpler scripts that are readable and founded on first principles will go a long way towards providing accurate data. . If you have confidence in your SQL you will have confidence in the data and reports you provide to the business. If you have to work through thousands of lines of code you I have serious doubts you will stand up and proclaim your report/data extract to be accurate.