A blog centred around SQL and data warehousing is going to include a significant number of SQL statements as examples to show concepts covered in the blog. Instead of making up fictional use cases in my head as I blogged I decided to look around for open data sets I could base my examples on. Data set is a broad term, in this context I mean a database back-up that can be restored to a supported database server giving you the same database I am using to try the SQL examples. I found three such data sets, Foodmart, Northwinds and Chinook. I settled on Chinook database.
The first database was the Foodmart database. I have fond memories of Foodmart from the time I spent learning to use the Mondrian OLAP engine in about 2005. I found reference to this database in the Pentaho BI platform with the only link to download the database for MySQL being here. The reason for not using the Foodmart database is that it is dated, it was released with Microsoft Analysis Services 2000 (AS) and was replaced by the AdventureWorks database when AS 2005 was released.
The second database is the Northwinds database which is another sample database from Microsoft. Northwinds database was succeeded by the AdventureWorks database. The good thing with Northwinds database is that it has been around for a long time and hence there are downloads available for different database servers. Northwinds downloads for MySQL, Postgres and SQLite are available. I did not go with Northwinds because it too is a bit dated and seems to invoke groans from developers.
The third data set I found was the Chinook database. I liked the Chinook data set because it is well documented, comes with an ERD diagram and is available for different database servers including DB2, Oracle, MySQL, SQLite and PostgresSQL. It is licensed under the Microsoft Public license, an Open Source licence”) which a allows you to create derivative works for commercial use e.g. examples in an SQL book. From the Chinook web site, “Chinook is a sample database available for SQL Server, Oracle, MySQL, etc. It can be created by running a single SQL script. Chinook database is an alternative to the Northwind database, being ideal for demos and testing ORM tools targeting single and multiple database servers.” The description of the database and support for different databaser vendors suite my needs.
Do you know of any other sample data sets? I am interested in knowing of them so please leave a comment or send me an email. A final thought on sample data sets, as much as data warehouse design is about recognising data patterns, I think it would be beneficial for each industry i.e. Health Insurance, Telecoms and Banking to have a sample data set. An important aspect of data warehousing is understanding the industry you are in and a sample data set with the right terminology would go a long way towards more relevant data warehouse training samples, ultimately better data warehouses.
Edit: Thanks r3pr0b8 over at r/sql for pointing me to the MySQL Sakila Sample Database. There are PostresSQL versions of the database available for download.