We have all seen it before. You finally get access to new data sources and want to extract insight from that data. You open your favorite reporting or data visualization tool, connect and start updating the tool’s semantic layer. The most common items that get done there are:
- Organizing attributes and measures into folders
- Renaming attributes and measures in readable business terms
- Creating hierarchies
- Defining number format and default aggregations for measures
- Creating computed fields for missing measures and attributes. For example:
- Profit := Revenue – Cost
- Region: = if State in (NY,…) then “East” elseif State in (CA,…) then “West” else “Unknown”
Some of the above items belong in the tool’s semantic layer, especially items relating to formatting, aggregation and defining drill-downs. The rest should be in your data engine of choice. I know that it’s easier just to shove everything in your tool of choice, but that method has several major drawbacks:
- Development velocity: It’s much faster to create your data insights by dragging attributes and measures versus defining them. Getting the formulas just right can be very challenging if you don’t understand the nuances in your data.
- Performance: It’s much faster to have all the measures and attributes pre-calculated and stored in the data engine. There is no math involved and retrieving data from modern data stores is very cost effective these days. Combinations of filters, joins and complex formulas can generate poor-performing queries.
- Accuracy/Consistency: No matter which tool is used, the formulas are consistent and built by the people who understand the nuances of the underlying data. Attribute data is consistent across reporting/data visualization tools. There is nothing worse than having two people for on data and presenting numbers that don’t add up with each other. You end up spending lots of time figuring out which one is correct rather than making business decisions.
- Scalability: Multiple people using multiple tools don’t have to recreate computed fields over and over again. A change to a formula doesn’t necessitate updates to every tool’s semantic layer. Having the atomic math done as part of the data pipeline allows multiple tools to be used more efficiently and is easier to govern.
We have now defined all of the formatting in the front end tools and the bulk of the math in the data store but we are not quite done yet. You can inadvertently create another problem in the data layer which involves an over-reliance on views, especially hierarchy of views. It sounds like the right thing to do – easy to change – but again, you are making the data engine work more than it really needs to. The best approach is to have the computes done within the data pipelines and have just strait tables that you point to.
The main goal, after all, is to make it easy and fast to use – not transform.