Is X a Big Data Product?

Posted in: Big Data, Technical Track

Virtually everyone in data space today claims that they are a Big Data vendor and that their products are Big Data products. Of course — if you are not in Big Data then you are legacy. So how do you know whether a product is a Big Data product?

While there might not be fully objective criteria (and mainly because Big Data definition is still in the air and people interpret it as they see fit for their purpose), I think I can provide one good suggestion on how to determine when a certain product is NOT a Big Data product. Of course, it will depend on the definition of Big Data that you believe in.

I believe that Big Data is mostly about being “affordable at scale“, quoting Jeff Needham, my good friend and fellow member of OakTable Network. In practice, that means commodity software, commodity hardware and commodity operations of the solution. I won’t define the thresholds of scale in terabytes or levels of complexity and etc but I can provide some guidelines.

Talking about commodity hardware, it’s generally based on x86 architecture (though, some say ARM is emerging but it’s been emerging way too long for my liking) with some reasonably priced components. That would typically be dual socket systems with up to few hundred GB of RAM and maybe a dozen disks or some SSDs and cost effective networking. If we narrow down to Hadoop-like architectures then a cluster node would typically cost between $4,000 and $10,000. Anything significantly above that is probably overpriced or overspec’ed.

OK. Now that we are good with hardware let’s look at software. Obviously, open-source software without any commercial support qualifies for commodity and being affordable. If you are Facebook-scale (or getting relatively close), your commercial support can be you own large scale, capable engineering team. Otherwise, you will most likely have commercial support. Back to Hadoop world, you should expect to pay for commercially supported Hadoop distribution (whoever it is out of three leading distributions — Cloudera, Hortonworks or MapR) the same order of magnitude as for the hardware itself. Annually, it would be a fraction of hardware cost or over three years it would be about the cost of hardware purchase or slightly above depending on the level of support and platform features. You get an idea. Non-open-source products licensed on similar pricing levels are Big Data products too — you don’t have to be open-source to call your technology Big Data.

Let’s take an example of a supposedly Big Data product. If a product has “Big Data” in the name, it surely must be a Big Data product. Eh?

I love quite a few Oracle products so why don’t I look at their line up… Big Data Appliance is a prebuilt Hadoop system or Hadoop appliance with 18 powerful data nodes per rack and list price tag of $525K per rack. That gets you to almost $30K per data node which is quite high and you would likely not build your own clusters like that. Add to that about $100K per year of support and maintenance for systems and OS (you can check pricing in the public engineered system price list). Big Data Appliance does include commercially supported Cloudera distribution so it might not be that terrible pricing-wise. If you have experience buying Oracle products you also know that customers don’t pay list prices. Thus, I can accept that Big Data Appliance can actually be called a Big Data product… just.

Now let’s looks at another product — Big Data SQL. It has been announced but hasn’t quite been released just yet (or did I miss it?). Awesome product, by the way. Great way to push some of data-intensive SQL processing from Oracle Database down to Hadoop. Now, it’s probably not widely known (since it wasn’t really publicly released and sold yet) that Big Data SQL is licensed per disk spindle and it’s $4,000 per spindle as list-price. Add to that typical 22% of annual software support and maintenance from Oracle. If I were to license Big Data SQL for a 100 nodes Hadoop cluster with 12 disks per node, it would cost me almost $5M based on list-price. Don’t forget to add 22% annually. This is order of magnitude more than I would spend on the hardware building such cluster. But wait, it looks like Big Data SQL is only working with Big Data Appliance. Even in this case, the cost of Big Data SQL per single rack appliance is $864K + 22% annually and that’s just one additional tool for your Big Data platform.

Based on what I know about Big Data SQL (and assuming it works as advertised when released), I love it — push code to data, scalable massive parallel processing, leveraging great features from Exadata Storage software. Great job to the folks who developed this product. Unfortunately, I cannot call it a Big Data product — it’s not affordable at scale.

So when you look at other vendors calling their product Big Data — do this costing assessment and if it doesn’t come as affordable at scale then it’s not a Big Data product. And feel free to share your assessments for the rest of us here. I’m sure not everyone will share my line of thinking here either. Fire way.

Discover more about our expertise in Hadoop.

email

Interested in working with Alex? Schedule a tech call.

About the Author

What does it take to be chief technology officer at a company of technology experts? Experience. Imagination. Passion. Alex Gorbachev has all three. He’s played a key role in taking the company global, having set up Pythian’s Asia Pacific operations. Today, the CTO office is an incubator of new services and technologies – a mini-startup inside Pythian. Most recently, Alex built a Big Data Engineering services team and established a Data Science practice. Highly sought after for his deep expertise and interest in emerging trends, Alex routinely speaks at industry events as a member of the OakTable.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *