“Plan for 1000 sources of data” and “Don’t move data any more than you have to.” These were just two of the many interesting best practices shared with a room of IT executives at TDWI’s Orlando analytics leadership summit earlier this week. Analytics leaders from companies like Red Hat, Disney, Macy’s, Skullcandy and Quicken Loans shared their experiences from their journey towards becoming data enabled organizations.
When it comes to architecting analytics platforms to make data available to users, thinking big while staying efficient was a recurring theme throughout the conference. While not all platforms actually bring in thousands of data sources, the day is coming soon when they’ll have to handle these massive volumes. That means forward thinking and planning are critical when making decisions today on architecting or buying a platform to handle your data.
Getting ALL THE DATA in one place is also a key component of a solid data enablement strategy. When exploring all of the options for data platform architectures, a number of solutions were discussed. All of them included data lakes in addition to data warehouses, along with a combination of open-source technologies. For example, in one scenario Hadoop was used for a company’s data lake and R was the solution for analytics, while commercial software like Tableau and SAS was used for visualizations. Most used a mix of cloud and on-premises environments. One speaker described his data lake on the cloud as “Deep Data” and his traditional data warehouse on-premises as “Shallow Data.” People spoke of the cloud as a facilitator of scale and agility, and sometimes— but not always, cost savings.
The need for access to ALL THE DATA is driven by more than just the desire for a richer data set. It’s a response to the growing awareness that different user personas exist and they all have different data access needs. While the majority of users will access nicely curated and governed data in the data warehouse, the growing ranks of data scientists want access to ALL THE DATA, even the messy ungoverned data that logically sits in the data lake. And with the emergence of “Citizen Data Scientists” or power users who fall somewhere between business users and data scientists, another class of data is emerging – a lightly governed and curated subset of the raw data also in the data lake. Data can no longer be assumed to be governed to be useful. With ALL THE DATA comes choice.
Cloud was a recurring thread woven throughout many discussions around ALL THE DATA — and the realization that most of the data being brought into the analytics platforms comes from outside the enterprise, and from the cloud. Moving data from a source on the cloud to a data lake on the cloud is simply more efficient than moving it to and from the cloud, leading to the best practice “Don’t move your data any more than you have to.” When you couple this advice with the knowledge that data flows not just from the data warehouse into the data lake but also from the data lake into the data warehouse, you can expect to see more data warehouses move to the cloud in the future using services like Microsoft’s Azure SQL Data Warehouse, Google’s BigQuery and AWS’s Redshift. We are seeing this every day at Pythian and we specifically designed our Kick Analytics as a Service offering to meet the growing demand for cloud-native analytics solutions.