Oracle announced the Big Data Appliance on Monday morning keynote. Many people, me included, were long waiting for this to happen. Others didn’t think it will ever happen. So naturally, there is a lot of buzz and excitement around the new device in Open World. The keynote announcement was very short on details and certainly did not satisfy my technical curiosity. So I went to a few presentations to hear what exactly is included in the offering.
The Big Data Appliance (BDA) has 18 Sun x4270 M2 servers per rack. As usual, you can add racks together for larger clusters.
Each node has 48G RAM, 12 intel cores and 24Tb of storage. Less memory than in the Exadata 2×2 nodes and no SSD indicates that the plan is to hit the spinning magnetic devices a lot for data storage and processing. Not a big deal in Hadoop where this is the design assumption, but not optimal for the NoSQL portion of the device.
In addition there is 40gb/s infiniband and 10g/s Ethernet. The choice of infiniband for Hadoop machine is a bit odd, since Hadoop was designed to do most of the processing on the machine that holds the data and avoid overloading the network. On the other hand, connecting the Hadoop cluster to an Exadata machine with infiniband will allow for fast data loading. Which is exactly what Oracle is after.
From the software side there will be an “Apache compatible” distribution of Hadoop – it is unclear whether it’s straight-up Apache Hadoop, or if there are any enterprise improvements in it. We are betting on straight Apache at this point and hopefully improvements later.
There is the Hadoop Loader – if someone told you it’s just a repackaging of Cloudera’s Scoop, they would be wrong. It was built from the ground up, and Oracle is very proud that they used their exclusive knowledge of Oracle internals to make the data loading much faster than competing software.
I chatted a bit with Greg Rahn at dinner, and he explained that the Hadoop Loader will have 4 loading methods: DataPump – where the loader will massage the data (in Hadoop) into DataPump format that can be used for external tables. JDBC – We didn’t get into details on this, but it sounds like normal SQL inserts, maybe with direct path. OCI – Thats the fastest loading method. The loader will massage the data (again, in Hadoop) and uses temporary segments to merge the data directly into Oracle data files. The fourth option is loading the data into text files. (This mostly from a dinner chat over a beer or two, so take this with a grain of salt. Mistakes are mine and not Greg’s.)
And there is the NoSQL database. The NoSQL database is based on BerkeleyDB Java Edition. At first I thought “Meh. I know BDB, and am not impressed by it”. Once I heard the details, my opinion completely changed – Oracle’s NoSQL team did a first rate job of renovating BerkeleyDB based on the most recent advances in the NoSQL field and released a product that has comparable capabilities to the popular NoSQL databases – Cassandra and Voldemort. In fact, the NoSQL product manager said that their design was inspired by the work done on Voldemort.
Oracle’s NoSQL delivers what you would expect a reasonable NoSQL database to deliver. It is a key/value store, but it has two keys per item – major key and subkey. Data is sharded between nodes based on the major key, with the usual hash partitioning. Each item is replicated on three nodes by default, but this is configurable. There is no central management for nodes or locks, so no single point of failure. Writes are done to node that is considered “key owner” at that point in time, and you can configure how many replicas should acknowledge the write before its considered committed. Commits can be to memory only or to disk. There is even a limited kind of transaction – if you need to modify several items that have the same major key (so they are on the same node), you can define the modification as a single transaction. BTW. “node” in this context is not a single server (because 18 server rack will make a small cluster indeed). You are encouraged to install one node per core on each server for a total of 216 nodes per rack.
Of course with 18 servers and 216 potential nodes, you need a manageability tool. Oracle added Hadoop and NoSQL management into their Enterprise Manager 12c. I didn’t see the UI yet, but according to Greg (standard disclaimer apply here too), you can decide how many Hadoop nodes you want and how many NoSQL nodes, and Oracle will automatically add them on the servers. The cores can be divided between the two server types in any way that matches your business.
Where does it all fit in Oracle’s technology stack?
Oracle are now talking about acquire-organize-analyze data cycle. NoSQL will be used to quickly acquire simple key-value pairs type of data (someone’s shopping cart, data collected from agents and sensors). Hadoop’s HDFS will be used to collect unstructured data (logs, social media, photos, video). OLTP systems will still collect their data in Exadata. The map-reduce portion of Hadoop will be used to organize the data, transform it into Oracle formats and the Hadoop Loader will copy the transformed data into Exadata. The data analysis part will be done in Exalytics on top of Exadata. Exalytics will not talk directly to BDA, but there was some discussion of another version of R integrated directly into the BDA. I didn’t get many details about it though.
ETL is one use case for Hadoop, but not the only one. Oracle encourages this use case (for obvious reasons), but they admit – you get fast servers with Hadoop, you can do whatever you want with them.
All this is from presentations and discussions, I haven’t touched one of them yet.
There is already a beta customer or two for the Big Data Appliance. The Hadoop Loader is in beta. NoSQL database will be released real soon now and will include an open-source community edition and an enterprise edition. Nothing is currently known about licensing. I’m looking forward to download NoSQL in few weeks to see what it is like. I suspect that getting a full machine will take longer.