Shortly before we all went on break for the holiday, Oracle announced the new BDA X3-2. Now, I have time to properly sit down with a glass of fine scotch and dig into the details of the release. Turns out that there are quite a few changes packed in. We are getting new hardware, new Hadoop, new Connectors, and new NoSQL. Tons of awesome features are included.
For those in a hurry, these are the best new features in the release in my opinion:
- CDH4.1 with high availability name nodes.
- Oracle SQL Connector for HDFS
- Oracle NoSQL with elastic sharding
- Faster CPUs and more memory
And now in more detail:
- New Intel processors. Each node now arrives with two, 8 core Intel Xeon E5-2660.
- 64G RAM per node, expandable up to 512G.
Disregard what I said in the past about this being too much memory for workloads that tend to be IO-bound. There is no such thing as too much memory, especially not for map-reduce jobs, which tend to be written in Java – not a language known for its efficient use of memory. You will want at least 1GB RAM per job, with 2G being more reasonable. 64G RAM will allow you to do around 30 jobs per machine, which is rather balanced for 16 cores. More memory will allow you to configure larger IO and network buffers, reduce-side joins using more memory as “scratch” space, and, of course, Impala.
Hadoop is upgraded from CDH3 to CDH4.1 with the following new features:
- High-availability name node. This means that the name node is no longer a single point of failure and removes the biggest issue with deploying Hadoop as a production enterprise cluster.
- Federated name nodes. The amount of data stored in the cluster is limited by the amount of memory in the name node. Federated name nodes allows working around this barrier by splitting the filesystem between multiple nodes for higher total memory limit. I doubt anyone with BDA will require this feature.
- YARN – the new job framework with the new resource manager.
- Impala! It actually doesn’t arrive on the BDA, and I’m not even sure if it’s supported. Impala is pretty beta anyway. But it’s also pretty awesome. Queries that take 10 seconds to run on Hive take milliseconds in Impala. It does everything in-memory, which is the best excuse for upgrading to the 512G RAM version of BDA. Remember that it is not part of the BDA, but if you decide to install it, it should run fine on CDH4.1. If you try to install it and break something, don’t tell support “Gwen said it should run fine”.
BDA arrives with connectors that are compatible with CDH4. The release notes were a bit useless as they listed all connector capabilities without breaking them down into old and new features. I had to dig into the docs to try to find what is actually new. Here’s what I figured out. Feel free to correct me if I got it wrong:
- Oracle SQL Connector for HDFS (OCSH) is the new name for Oracle Direct Connector for HDFS. The direct connector worked as a pre-processor for external tables, allowing us to reference a file on HDFS. This was pretty cool. The new connector runs as a MR process and creates the external table for you. If the data you want is in Hive, it will read the Hive metadata store to create the external table definition for you. Normal files get external tables in which all columns are varchar2. The deeper integration with Map Reduce seems to allow better support for parallel queries. It also looks like Avro file format and encryption codecs are now supported – which is awesome considering how much trouble the lack of support caused me in the past.
- Oracle Loader for Hadoop seems to support loading data from NoSQL 2.0 in addition to Hadoop. Support for Avro was added here as well.
- There is also new Connector for R, but I didn’t dig into the features there yet.
New management tools:
I haven’t looked into any features there either, but the new BDA includes Cloudera Manager and Big Data plugin for OEM
Oracle NoSQL 2.0:
The new Oracle NoSQL release is the first release that has features specifically for the Enterprise Edition.
- Hadoop integration – New classes allow Hadoop MR jobs to read data stored in Oracle NoSQL. I’ve seen a lot of requests for this feature, but I never really understood it. I’m curious to see how this will be used. If anyone uses this feature, I also want to hear why they don’t use HBase. However, MongoDB and Cassandra already have support for MR jobs, so it’s nice to see Oracle NoSQL closing this gap.
- Access from Oracle RDBMS through External Tables (Enterprise Edition only) – I think it’s implemented through the new support for MR jobs and the new Hadoop connectors.
- Avro support – It defines schema for the data contained in the record value. Schemas are defined with JSON, and there is some support for schema evolution.
- Support for different numbers of replication nodes per physical storage node – This allows heterogeneous hardware in the NoSQL cluster. Not that exciting for BDA owners.
- Elastic sharding – This is a feature that was sorely missing from the previous release. You can now add replication nodes, mode them around, rebalance load between nodes, etc. A number of partitions are still static, so you still want to configure this right from the installation.
- Stream based API for storing very large values without materializing them in memory in full.
Clearly an awesome release, packed not just with new features, but with features that the customers actually need. I can see existing BDA owners looking to upgrade, not for the hardware boost, but for critical features like HA namenode and elastic NoSQL.
This brings me to a painful point: Nothing in the release notes or white papers even mention the possibility of an upgrade. Sure, owners of BDA can just start installing new Cloudera Manager, upgrade to CDH4, install new Oracle NoSQL, etc. However, if this is indeed enterprise software, the word “patch” should have been mentioned somewhere, in my opinion. To further complicate things, while there are very clear instructions on upgrading a cluster from CDH3 to CDH4 intact, upgrading Oracle NoSQL to release 2.0 without the equivalent of exporting the data out and re-importing it into a new cluster remains unclear. What’s even less clear is whether Oracle will even support CDH4.1 and Oracle NoSQL 2.0 on the old appliance. If they do, it’s not really an appliance, and if they don’t, it makes the whole BDA proposition far less attractive.
The only other complaint I have about BDA X3-2 is the operating system. OEL 5.8. Meh. I guess Oracle decided that upgrading every single component in this release is too much and left the OS alone?
If you are considering buying BDA or just wondering why Oracle geeks like me dig this Hadoop stuff, I’m giving a webinar for IOUG Big Data SIG. I will explain why Hadoop is just what your enterprise data warehouse needs and what are the best ways to integrate Hadoop and Oracle. The webinar is on Jan 8, 11am PST, and I’m racing against the clock to update the presentation with all the cool BDA x3-2 features.
Register here: https://www1.gotomeeting.com/register/150023664