Almost every company’s success is contingent upon its ability to effectively turn data into knowledge. Data, combined with the power of analytics, brings new, previously unavailable business insights that help companies truly transform the way they do business—allowing them to make more informed decisions, improve customer engagement, and predict trends.
Companies can only achieve these insights if they have a holistic view of their data. Unfortunately, most companies have hybrid data—different types of data, living in different systems that don’t talk to each other. What companies need is a way to consolidate the data they currently have while allowing for the integration of new data.
The Hybrid Data Challenge
There are two primary reasons why data starts life in separate systems. The first is team segregation. Each team in an organization builds its systems independently of one another based on its unique objectives and data requirements. This, in part, is also the second reason. Teams often choose a database system that specializes in handling the specific application tasks. Sometimes they need to prioritize for scalability and performance. Other times, flexibility and reliability are more important.
Businesses need to evolve their data strategy to establish and create a “data singularity,” which summarizes the total of their information assets as a competitive, strategic advantage and enables real-time knowledge. A key objective of this strategy should be to reduce data movement, especially during processing and while executing queries, because both are very expensive operations. As well, data movement, or data replication in general, is a very fragile operation, often requiring significant support resources to run smoothly.
Enter The World Of Hybrid Databases
A hybrid database is a flexible, high-performing data store that can store and access multiple, different data types, including unstructured (pictures, videos, free text) and semi-structured data (XML, JSON). It can locate individual records quickly, handle both analytical and transactional workloads simultaneously, and perform analytical queries at scale. Analytical queries can be quite resource-intensive so a hybrid database needs to be able to scale out and scale linearly. It must also be highly available and include remote replication capabilities to ensure the data is always accessible.
In the past, many organizations consolidated their data in a data warehouse. Data warehouses gave us the ability to access all of our data, and while these systems were highly optimized for analytical long-running queries, they were strictly batch-oriented with weekly or nightly loads. Today, we demand results in real time and query times in milliseconds.
When Cassandra and Hadoop first entered the market, in 2008 and 2011, respectively, they addressed the scalability limitations of traditional relational database systems and data warehouses, but they had restricted functionality. Hadoop offered infinite linear scalability for storing data, but no support for SQL or any kind of defined data structure. Cassandra, a popular NoSQL option, supported semi-structured, distributed document formats, but couldn’t do analytics. They both required significant integration efforts to get the results organizations were looking for.
In 2008, Oracle introduced the first “mixed use” database appliance—Exadata. Oracle Exadata brought a very high performant reference hardware configuration, an engineered system, as well as unique features in query performance and data compression, on top of an already excellent transactional processing system, with full SQL support, document-style datatypes, spatial and graph extensions, and many other features.
In recent years, vendors have started more aggressively pursuing the hybrid market and existing products have started emerging that cross boundaries. We can now run Hadoop-style MapReduce jobs on Cassandra data, and SQL on Hadoop via Impala. Microsoft SQL Server introduced Columnar Format storage for analytical data sets, and In-Memory OLTP—a high performance transactional system, with log-based disk storage. Oracle also introduced improvements to their product line, with Oracle In-Memory, a data warehouse with a specific high-performance memory store, bringing extreme analytical performance.
Choosing A Hybrid Database Solution
Rearchitecting your data infrastructure and choosing the right platform can be complicated—and expensive. To make informed decisions, start with your end goals in mind, know what types of data you have, and ensure you have a scalable solution to meet your growing and changing organizational and business needs—all while maintaining security, accessibility, flexibility, and interoperability with your existing infrastructure. A key design principle is to reduce your data movement and keep your architecture simple. For example, a solution that relies on data being in Hadoop but performs analytics in another database engine is not optimal because large amounts of data must be copied between the two systems during query execution—a very inefficient and compute-intensive process.
Have questions? Contact Pythian. We can help analyze your business data requirements, recommend solutions, and create a roadmap for your data strategy, leveraging either hybrid or special purpose databases.