This is the first in a series on real-time or streaming data analytics best practices, including terminology, system design and examples of real-world success. In this initial post, we’ll focus on terminology and possible use cases.
When launching a multi-part series on real-time and streaming analytics in a modern data platform, it’s always best to make sure everyone’s on the same page. So let’s begin with some definitions and use cases.
The terms “real-time” or “streaming”, for example, may seem obvious. But they can actually mean different things to different people within the context of a cloud data platform. Adding to the potential for confusion, these terms are also relevant in two different areas of a layered data platform – the ingestion layer and the processing layer:
- Real-time or streaming ingestion takes place via pipelines that stream data, one message at a time, from a source into data storage, the data warehouse, or both.
- Real-time or streaming processing typically refers to straightforward data transformations applied to streaming data, such as converting a date field from one format to another, or more complex data cleanup like enforcing a consistent address field format.
- Real-time or streaming data analytics is usually reserved for the application of complex computations on streaming data, such as calculating the probability of a certain event happening based on previous events.
While the differences we’ve described above matter, for the purposes of this blog series we’ll refer to all real-time data processing and real-time data analytics use cases as “real-time processing”. Note that this is different from real-time ingestion, which involves ingesting fast-moving streaming data from sources such as the internet of things (IoT) and edge devices.
Real-time ingestion can take place without real-time processing. Just because data is brought into your system in real time doesn’t necessarily mean it needs to be processed at the same pace, after all. But real-time processing almost always requires real-time ingestion, for the simple fact that without real-time ingestion, the processing layer won’t have access to real-time data in the first place.
Whether you need one or both, however, depends entirely on the use case. Let’s dive into a couple examples below.
Real-time or streaming data use cases
Let’s consider two use cases that rest entirely on the needs of the data consumer. One can be satisfied by ingesting real-time data into the data warehouse without real-time processing, while the other requires a combination of real-time ingestion and processing.
Use case #1: The “real-time” sales dashboard
In our first use case, the final data consumer is a human analyst using a sales dashboard fed by data residing in a data warehouse. This analyst, like most business users, expects “real-time” insights.
That’s a fair demand, especially in today’s business environment. But let’s be honest – she or her team members aren’t going to refresh the dashboard continuously, obsessively clicking away all day so they can act on second-by-second changes. It’s more likely what they really want is to ensure the dashboard can be refreshed on-demand, whenever they choose, and that when refreshed, the data is up to date and reflects the current state of things.
To meet the above requirement, we can develop and deploy data pipelines that deliver data to the data warehouse in real time. But even when streaming data arrives continuously into a data warehouse, the data warehouse itself is designed to process that data in absolute real time. So while analytics dashboards can be updated as often as users desire, it usually takes a bit more than a few seconds for a typical dashboard to refresh.
As a rule, when analysis outputs are used by humans, a refresh response time of several seconds (or even minutes, in the cases of very large datasets) is usually an acceptable tradeoff between performance and data platform architecture complexity.
So in this use case, data is delivered in real time (streaming) into the cloud data warehouse – but it’s not consumed in real time, or even close to real time. However, since it meets our analyst’s requirements, a refresh that takes even a few minutes is close enough (she and her team even call this a “real-time dashboard” because that’s essentially what it is, for their purposes). Delivering the data to the data warehouse in real time gives our analyst exactly what she and her team need, with no real-time processing required.
“Always explore exactly what your business users mean when they ask for real-time analytics. You can often save yourself extra work and cost by implementing real-time ingestion, while skipping real-time processing.”
This use case illustrates why you should always explore exactly what your business users mean when they demand real-time analytics. If the real-time requirement is simply to make current data available for analysis at any time – but the analysis itself is ad hoc, such as scheduled reports or intermittent dashboard refreshes by the user – you can save yourself extra work and cost by implementing real-time ingestion and skip the real-time processing.
Use case #2: Real-time online gaming
We’ll say goodbye to our analyst because in our second use case, human data-crunchers aren’t involved at all. Take the example of online gaming companies: their software constantly collects streaming data from players, with engagement analytics then used to change the game’s behavior in real time. This, obviously, has to happen quickly – much quicker than in Use Case 1, because gamers won’t wait seconds or minutes for the game to react (and if they do, you can pretty much guarantee they’ll be upset and stop playing).
Unlike human analysts, the game is perfectly capable of instantly reacting to real-time data ingestion. That’s why in this case, it must be coupled with real-time processing to ensure a strong user experience.
Indeed, if the data consumer is an application that needs to perform actions based on incoming data, it’s a good indicator that both real-time ingestion and processing should be implemented (as is the case here).
Summary: Real-time ingestion or real-time processing?
These two use cases describe the two very different, yet complimentary flavors of real-time data processing. Use Case #1 is an example of real-time data ingestion (sometimes called “streaming ingestion” or just “data streaming”) without real-time processing. If the only requirement for real time is that you need current data available for analysis, but the analysis itself happens in an ad-hoc manner, then real-time ingestion is what is required.
If, as in Use Case #2, the requirement is to have the analysis itself done in real-time to be passed on to another system for action, then both real-time ingestion AND real-time processing are required.
In the next blog post in this series, we’ll show how these two use cases can be architected in a cloud-native modern data platform.