No matter how many reports and cubes you build you are always going to have users who need large extracts of raw data in CSV format. Typically this data is used for analysis using tools such as Tableau, R and MS Excel. In my experience large data extracts are done by an individuals part of the Business Intelligence (BI) teams. In some cases the DBA does these extracts. This is not the best use of personnel especially for frequently requested extracts. This post advocates using a stored procedure called from your BI tool of choice. The stored procedure writes the data straight to the file system bypassing the BI tool.
Web based BI tools such as IBM Cognos are not suitable for doing large data extracts. This is because the data goes on a two legged journey. First from the database to the BI server, then from the BI server to your computer. If a report is generating too much data, your first solution should always be to engage the user. Find out exactly what it is they would like to see from the data. In a lot of cases you will end up with a better defined report that does not need the mountain of data.
Data scientists/analysts tend to need raw extracts of the data to do their work. It is for these users that I would I recommend using a stored procedure to export the data directly from the database to file. The advantages of this solution include:
- You do not have to give every statisticians and data analyst/scientists access to the database. This keeps your database more secure
- This solution will be faster than trying to do data extracts using most if not all BI tools. Data does not go through a journey from the database to the BI web server then onto the user’s computer
- Stored procedures can be parametrized. This gives you flexibility to dynamically alter your SQL. Parameters can be used to:-Filter data-Select columns to include-Set file name of extract
- A stored procedure gives you an abstraction layer you can use to shield users from underlying database changes.
- Most databases if not all, have built in functionality to export data in CSV format. In short you are utilising a feature database are good at.
- You can use your BI application built in security to grant and deny access to the data extract. Most BI applications integrate with Active Directory or LDAP.
- BI application logging can be used to can keep track of who is performing the data extracts.
Below is a MySQL stored procedure that extracts the Invoice data from the Chinook database straight to the file system.
CREATE PROCEDURE sp_extract_invoices(
country VARCHAR(30)
)
BEGIN
SET @sql = concat("SELECT 'InvoiceId' ,'CustomerId' ,'InvoiceDate' ,'BillingCity' ,'BillingCountry'
UNION
SELECT
InvoiceId
,CustomerId
,InvoiceDate
,BillingCity
,BillingCountry
FROM
Invoice
WHERE
BillingCountry = '", country, "'
INTO OUTFILE '/tmp/", country,".csv'");PREPARE stmt
FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END
A PostgreSQL alternative is available from my Github repository.
Having a painless method to export large amounts of data can open another can of worms. I would advise you to follow these guidelines:
- Limit the number of users who can run this report. Typically these should be your data scientists, statisticians and actuaries.
- Make the output file of the stored procedure the user name of the person running the report. Overwrite existing files ensuring there is only ever one output file per user. This forces the user has to remove files from network folder to their own computer before running another extract.
Conclusion
With the explosion of data visualization tools and the high demand for data insights it is impossible for BI departments to keep up with all data requests. Overwhelmed by the number of requests data BI departments follow one of two methods:
- Give everyone access to the database
- Put in stringent requirements for end users to make data requests. The result is only the most determined of users will make requests
Using stored procedures to extract data provides a standardised interface for individuals to use third party tools to visualise the data. You give end users the freedom to user different tools to analyse data but through your stored procedure you are able to standardise and monitor extracts.