Last Friday (September 26), Paul Vallée and I were lucky enough to interview Kevin Closson about the Oracle Exadata Storage Server.
The audio quality is a little spotty here and there, so you might like to follow the transcription below.
Paul gets the interview started.
Paul Vallée (PV): Christo Kutrovsky and myself, Paul Vallée. We’re on the line with Kevin Closson of Oracle (and prior to that with Hewlett-Packard, and prior to that with Polyserve, and prior to that with Sequent). A giant of our industry, and I’m honoured to be speaking to him. Kevin, hello.
Kevin Closson (KC): Well, they always say that flattery gets you nowhere, but apparently it’ll get you on the phone.
PV: [laughs] Very nice!
KC: No seriously, it’s more than a pleasure to be here. I like what you guys do, so this is good.
PV: Thank you, Kevin. So, we are here to talk about the work that Larry Ellison announced yesterday, specifically the work around the Oracle Database Machine and the Exadata Storage Server. Kevin, can you just quickly introduce yourself and how you came to be involved in the project?
KC: Right. So, I’m a performance architect with Oracle, and the project that I’m stationed on, if you will, is the development team for Oracle Exadata Storage Server. And the way I came to Oracle is, quite a few of the folks who are involved with the very genesis of Exadata are people that I’ve known and worked with closely dating back to the early ’90s. And after a fruitful endeavour as the chief software architect for Oracle solutions at Polyserve, it became an opportunity
to latch onto Oracle, because we sold our company to them. So there we are.
PV: How exciting! Congratulations! So I noticed that there’s still a little, I guess a diversion in terms of the branding. Larry definitely introduced it as the Exadata Programmable Storage Server, and I double-checked the video. But in your blog, you’re calling it, for sure, just the Exadata Storage Server. Just how recently was the marketing/messaging developed for this?
KC: You know, I’m not a part of the Go-To-Market (GTM) efforts, but, you know, honestly, the way these things are brought to market . . . They’re developed under a project name, and the project name remains the same for years. It was over the last few months that Marketing began cooking the name and what-have-you. Now, if you’re referring to something that Larry said in his keynotes, I have to admit I didn’t commit to photographic memory all the slides. And certainly, if he used the term “programmable”, I’m not going to correct Larry Ellison.
PV: [laughs] That would be risky.
KC: Having said that, I’m here to tell you that if somehow the word “programmable” sticks and becomes pervasive, it’ll be misleading. Because the connotation of “programmable” is akin to what’s possible on products like Netezza where there’s real programmable data rates that on-site people can fiddle with. And that’s not the nature of the Exadata Storage Server. The Exadata Storage Server is programmable by us, the folks on the development team, and it comes to you in the form of software, and occasional lower-level firmware fixes from HP. I hope that answers the question—safely for me—and for you as well.
PV: Sounds good. So, I must say—just to go on the record—that this is not at all what I thought you might be cooking up, but by the same token — and obviously the evidence for that is all over the Pythian blog, where we had a speculative blog entry about what might be in-store. But that being said, I am tremendously impressed and really excited, and already getting to work on identifying which Pythian customers have a use-case for the technology, and trying to get a good early-adopter case to co-sell with Oracle. So I’m that jazzed about it, and definitely, congratulations are in order.
KC: Well, I’m glad to hear that, and I appreciate congratulations. You know, when it comes down to it, this is a new product that solves problems that cannot be solved with any other storage technology. Of course, everybody’s already crafted their position papers and will argue the fact that perhaps the speeds and feeds are different than them. But what I just said, I stand by, full-stop. You cannot solve this problem with any other technology. And what I mean by that is, in order to have Oracle as your commercial store for all of EP or ERP (which, you know — nobody’s going to argue [about] our footprint in that space), in order to do EWBI on that with someone else’s offering — you now have two vendors, you have two different types of technology. Exadata is a unifying storage platform for Oracle. And now Oracle is all.
PV: I would like to introduce Christo Kutrovsky, who is one of Pythian’s leading experts especially on how Oracle talks to storage — certainly not our only such expert given than Alex G. [Gorbachev] presented on “Under the Hood of the Oracle Clusterware”, just at Open World earlier this week. But that being said, this is a definite subject matter of interest for Christo, and he has prepared a couple questions for you Kevin.
KC: Oh good!
Christo Kutrovsky (CK): Hi Kevin, how are you?
KC: Hi Christo. Just fine. You and I have met in passing, so this is not a first meeting by any means.
CK: Oh yeah, absolutely. I admire your presentations — they are always to-the-point. I hope you keep up with that.
KC: Well thank you.
CK: So back to the questions and to the nitty-gritty stuff. I want to start with a couple of questions [about] the Infiniband implementation. Each Exadata instance is linked via 40Gb Infiniband link to the switch. Correct?
KC: Partially. So the numbers are off. Each Exadata cell—as well as all of the RDBMS hosts in the HP Oracle Database Machine—all of them are connected with two 20Gb paths to the switch.
CK: Okay. So does that mean that each database server can only accept 2*20 Gb of data?
KC: No. We have two paths of 20Gb of bandwidth. They are joined together in a bonded relationship, therefore only one path is active at one time. We don’t need more than one path, because arithmetically, over 2 gigabytes-per-second is not only more than a single two-socket, eight-core Xeon server can ingest on the RDBMS host, it’s most certainly more than a single Exadata storage cell can produce. So from a produce-and-consume perspective, we’re not starving anybody. So what’s failover if, for instance, the HCA happens to fail. It’s a redundancy purpose for the dual paths, but as the numbers work out, it’s over 2 gigabytes-per-second. With 20Gb paths, we’re not feeding something like a 128-core Superdome or something like that. In that particular case, the plumbing would be slightly different.
But look at what we’ve built. The HP Oracle Database Machine consists of VL360s. Each of those has the traditional Xeon processors in them, and they don’t have unlimited bandwidth, so they can ingest as much as even one single 20Gb path can provide.
CK: Okay. And, just to clarify on the [subject of] the paths internals — it’s like a switch, basically. Meaning that any two pairs can talk at full bandwidth without affecting other talking pairs. Is this correct?
KC: Yes. There’s no bottlenecks in the point-to-point meter. So if a cell is delivering I/Os to three or four of the RDBMS hosts, that’s happening concurrently in-flight with, let’s say, I/O requests from RDBMS hosts 6 and 7 talking [to yet] other cells. So there’s no queuing.
CK: Basically what I was trying to clarify is that — imagine you have one database server running one query, and you have two Exadata storage systems, that will pretty much saturate your bandwidth, assuming no filtering is happening on the Exadata. That means, if you have two database servers, each database server can talk at 20Gigiabits to two separate Exadata servers.
KC: Yes. Every point is 20Gb.
CK: Alright, I just wanted to clarify this. Those were my questions [about] Infiniband.
KC: Oh good, and along those lines, I should hope that that puts to rest any concerns about—as I always say—“plumbing.” We didn’t build any bottlenecks into this system.
PV: Just a comment for out listeners who aren’t familiar with Infiniband as a networking protocol—not only as a networking protocol, but also as a disk-access protocol—Infiniband has not only very high bandwidth characteristics, but also ultra-low latency characteristics, and that’s why it’s such a good choice for cluster solutions like this.
KC: And Infiniband support multiple communications protocols, and to that end, we’ve developed and brought to market the lightest—at least in our assessment—the lightest and most adaptable of all of them, which is Reliable Datagram Sockets [RDS]. Sure, you can do IP over Infiniband, and that starts to chew into some of the value propositions involving Infiniband, but we’ve done none of the sort. We are fully Remote Direct Memory Access (RDMA) from point to point over RDS.
CK: Alright. So continuing onto more details of how the system works, a question on OLTP. For OLTP systems, and considering Database Machines and Exadata, does OLTP benefit in any way other than a best-practices-built system.
KC: That’s such a fair question, but let me answer it this way. The design center for Exadata is providing uninhibited utilization of all of the bandwidth that all of the disks are capable of delivering. Workloads that require that generally look a lot more like Business Intelligence Data Warehousing. It’s not that common to see, strictly speaking, OLTP-style workloads suffering bottlenecks like traditional storage arrays—fibre-channel SANs, iSCSI, or even high-performance NAS.
That doesn’t mean, however, that there are no benefits for OLTP using Exadata, because indeed there are. Probably the most substantial benefit that OLTP-style workloads will derive from being serviced by Exadata is the fact that I/Os from the Oracle kernel during OLTP will no longer have to interface through standard C libraries and making system calls to instantiate and report completions of I/Os. Doing that has always been entirely too costly, in processor-cycles terms. All of the requests for I/O from all Oracle processes to Exadata are done from userland using Remote Direct Memory Access (RDMA) directly into the processes of the Oracle Storage Server. So there’s no queuing up and de-queuing as far as the sending and the delivery of I/O requests.
Now, what does all that mean to us? It means that if you’re doing thousands-upon-thousands of I/Os-per-second of OLTP size—let’s say 8 kilobytes—you’ll see the reduction in processor cycles lost just doing I/O, substantial. You can do well over 8,000 [or] 10,000 random I/Os per second with much less than five percent of all processor cycles spent in kernel-mode. So we start freeing-up substan— Oh, and doing so with traditional SAN host bus adapters can often cost as much as twenty percent of all processor cycles spent just doing I/O — not doing anything with the results of the I/O, which is why somebody bought the computer in the first place. [13:44]
We’re talking about relief on processor cycles, lost I/O—which I think should be substantial for a lot of people—and, as well as, we have a balanced system. You don’t have to worry about collusion of applications, for instance, on the same fibre channel arbitrated group, where your disks for OLTP reside. And those sorts of balance aspects. They do pay off.
CK: Excellent, very cool. Okay—moving on now, this brings me to my next question. What’s the 8 gigabyte memory on the Exadata cells? Is this useful [for] caching, or is it just working memory to manage the software running on the Exadata?
KC: Well, that’s a great question. And people keep saying “cache cache cache”, although unless you have as much cache as you have dataset, cache actually gets in the way. I’m talking specifically—in that case—about, the types of work, the types of I/O profiles you see with DW/BI. To that end, we do configure 8 gigabytes of RAM per Exadata cell, and your question was, what is that for. Well, each Exadata cell has an operating system kernel, and that is Oracle Enterprise Linux. And that shouldn’t surprise anybody, because every storage device that’s out there—all SAN arrays, all NAS filers—they all have operating systems in them as well. And that takes up a portion—let’s say for instance 1GB—and that leaves us something on the order of 7 gigabytes of working memory. The primary value proposition of Exadata is to service, to scans, the type of I/O profiles you see with DW/BIs, and we scan this using 1 Megabyte frees.
So, let’s say for instance, we’re trying to push through a gigabyte per second. Depending on how long the I/Os take, we need to be able to buffer on the order of 5,000 of those I/Os per second. So 5,000 1 Meg buffers is 5 gigabytes. That leaves us about 2 gigabytes of cushioning. I think you can see where I’m going with this. In order to handle thousands of I/Os per second at that size requires just buffering, and that different from cache, right? Buffering is the memory that you pin down so that the disks, through the drivers, can DMA to memory. And then from there, of course, Exadata will RDNA that result over Infiniband with RDS directly into the address space of the database server process. But you have to have some holding space. Buffers are reused immediately, so in that case, it’s not cache. Does that answer your question?
CK: Absolutely. I was kind of suspecting this is the case, but I just wanted to hear it from you. And I assume this is also used to facilitate filtering and joining and extrapolation of query data?
KC: We all love the fact that Exadata is brute-force, but it’s also brainy. So, if we’re pushing through a gigabyte of I/O per second, from disk through the Exadata server out onto Infiniband, in the meantime, we’re also applying our intelligence. And our intelligence is, we perform predicate filtration. So if you’re querying for rows of HR records where “salary is greater than a million dollars”—and I’m sure there’s a lot of folks [garbled] like that. In order to filter through that data, that buffer has to be held down by the Exadata server for the amount of time it takes to rip through there looking for the rows that match. And then after that, whatever columns are sighted in that query, we have to walk through each of the rows and pick out the columns and send them back. [17:43]
So the buffer stays pinned-down while we’re doing intelligent processing of the contents of the buffer. And then we wipe the buffer out and hammer it with another I/O. Did that make any sense? [17:53]
CK: Absolutely. Now that we’ve touched on the subject of filtering — do only parallel queries benefit from this filtering push down to the Exadata cells?
KC: Right. So I think what you’re asking is whether or not, somehow, OLTP operations will benefit from what we currently offload to storage. The answer is no, but if that wasn’t your question, go ahead and state it.
CK: Yeah. No, the question would be, can you run Standard Edition on the database servers?
KC: No. That’s a topic regarding licensing, but I happen to know that this is Enterprise Edition only.
CK: Okay. Maybe you know the answer to this: does the cost that Larry included on one of his slides include the RAC and partitioning options?
KC: I do recall that he put up a slide that compared on a per-terabyte basis the HP Oracle Database Machine to Teradata and Netezza.
KC: And built into that cost-per-terabyte was the cost for software. Because this is a pre-packaged and ready-to-go deal, and because you have to have Real Application Clusters and you have to have partitioning, the answer to that is, yeah it’s built-in to that cost. What that number is, I wouldn’t quote. I don’t deal with the money.
CK: Yeah, absolutely. And talking further about the cells. In one of the datasheets it’s mentioned that backups benefit from this, and that sheet mentions specifically incremental backups being processed and filtered by the cells. Which is pretty cool. Now the question is, is the cell CPU used for compression for the backups — when you do full backups, for example?
KC: That’s an excellent question. And there’s a lot of other functionality that’s offloaded to the cells—and I’m sure we’ll talk about that—specific to backup. The answer to your [question about the] compression aspect of backup is, no — cells do currently do the compression. Could they? Sure. We’ll talk about that in the future. So you still use host CPU to do the compression. Will that mean that, when it comes times to access compressed data, Exadata is helpless? The answer is no, because Exadata is able to do filtering and projection on data that is compressed. It understand enough about compressed data to be able to do that.
But back to the backup issue. The value proposition for backup is, if you’re doing something like, let’s say, an incremental backup, and you have to go through terabytes of data to find several blocks that have been changed before the beginning of the backup, they have to be backed up. Instead of troubling the RDBMS hosts to go looking for those blocks, what they do is send off a smart operation to Exadata cells, who then go and look for the blocks that are old enough to need to be backed up. So it’s offloading finding the blocks that need to be backed up. Did that make any sense?
CK: Absolutely. So, what other operations do we have? Incremental backups, tablespace creation, joining . . . Does sorting also count? Can you offload sorting, grouping by? These are data warehouse operations that usually consume a lot of CPU. Can those be assisted by the CPU power from the Exadata cells?
KC: The answer to sorting and joining and grouping and aggregation . . . The only joining that we do is very good technology. It’s very beneficial on so many queries (I intend to blog exactly about that). It’s a bloom filter join, and we can discuss that some other time. We don’t do hash joins in storage cells because cells would have to see all of the other cells’ data, and filter idempotent storage—pods, if you will—they don’t know about each other. But as far as storage sorting and aggregation — that’s too far upstream in the Oracle kernel. It’s too far up above where Exadata adapts. Exadata is a storage that happens to be able to do a few things that are intelligent. In future, perhaps — but at this point the answer to that is, no. [22:44]
CK: It seems like there is a lot of extra things that can be added to the cells — software, et cetera. Is the roadmap for cell upgrade similar to database? Like every two years or something like that? Has it been discussed at all?
KC: Although I know a little bit about the product, what I don’t know is what the release schedules are. That would be committed by someone clear outside of my group, for certain. Do I think that the software will rapidly evolve? I would say most certainly. This is the initial release of a product that has been in development for three years. What are the odds that we knew everything over three years that we’ll find in just a short period of time [garbled]? Pretty slim. We intend to be very aggressive—not chaotic, but aggressive—in enhancing Exadata.
CK: Disk failures. How are those handled? I know that disk failures will not affect data or running queries or anything like that. But I’m curious — when you lose a disk, a single disk in a single cell, where is this handled? Is this still something that is happening on the ASM level, is it something that is happening on the cell/disk level? Where does this happen? Who notices this and who cleans up?
KC: That’s an excellent question. Disks in storage cells are treated by ASM really no differently than disks out in a fibre channel SAN. So ASM will respond to failure the same way it does to a single-disk failure in a fibre channel SAN. So that’s a two-part answer. ASM will shield the database processes from knowing anything about that disk failure, as we expect it to because it runs on a fibre channel SAN.
Now, we haven’t [garbled] about the fact that you’ve got a physical disk failure, and a human being has to get involved at some point. Exadata is physical storage management. Gone are the days when you’d have to do fdisk and all of that sort of stuff. So from soup to nuts, we manage the physical disks. And that means, when a physical disk fails, you can get an alert via email (I suppose that could be an email to an SMS or what-have-you). You’ll know the disk has failed, and there will be enough information regarding the disk failure so that you can attach to the correct cell and interface with the cell command-line interface to execute a command that will vacate that disk from Exadata ownership. You put another physical disk in there, and there’s a very short command to bring that disk online. And, just like you would in a fibre channel SAN environment, you go and ask ASM to re-balance. [25:45]
CK: So ASM actually sees each individual disk on the Exadata cells?
KC: Yes. Every disk in Exadata is an ASM disk. In fact, and nobody’s been talking about this yet because we have to roll this out — we can’t just turn on the garden hose and hose everybody down.
The way this goes is, physical disks become, conceptually, to Exadata a cell disk, and so that’s a single physical disk and a single logical management-level disk. So a single physical disk is a single cell disk. You can take cell disks and carve them up into what we refer to as grid disks. If you took at 300GB SAS from our SAS option, if you took a 300GB SAS drive and created a groupdisk on it that was 100GB, you’ll get the outermost 100GB of space from the platter. Now we have a lot of shorthand for doing this. Don’t believe for a moment that it’s a bunch of laborious scripting commands. We’re very good about that. If, for instance you wanted to—just in two commands—create cell disks on all disks, and then create a set of grid disks for something called “data”, you would just simply say, create celldisk all [garbled] initialize cell (and I’m blogging about that syntax, you’ll see that soon) . . .
initialize cell, followed by
create celldisk all, followed by
create griddisk --prefix=data --size=100G, and at that point in time you’d be able to go over on the database hosts and ASM would be able to see those disks.
CK: It’ll be like a candidate on the ASM level?
KC: It would be a candidate. It would be ready to use. So at that point in time, you would have twelve ASM disks, each of 100GB.
CK: Oh, okay! So, basically, a cell grid is like a group command, not a virtualization. In the end it will present it as twelve different ones, but it will present it as a group?
KC: Actually, more to the point, a griddisk is an ASM-usable chunk of a celldisk. And if you create. let’s say, at one blast on a cell, you create twelve griddisks called “data” you would have griddisks named “data01” through “data12”, and they would each be 100GB. And when you go and add them to a diskgroup, you wouldn’t have to list out all those happenings, you could use wildcard characters to say /0/*data* — and now you have a diskgroup that consists of twelve 100GB slices of celldisk. [28:55]
CK: In a way, you’re exporting the slices?
CK: Making the slices visible to ASM?
KC: Yes. griddisk to a cell is a logical object. griddisk to ASM treats it just as if it was a physical disk.
CK: And what about failure groups? Is this created automatically?
KC: No, you still use the age-old ASM incantations to create your failure groups. If you just create . . . let’s say, for instance, you have the world’s smallest Exadata configuration, which is two cells. If you create a diskgroup of that, you would have the same normal redundancy, and it would create a failgroup out of that. It would all be mirrored. No disks are mirrored within a cell — it’s smart enough between the cells.
CK: The minimum number of cells is two?
KC: Yes. Because it you don’t have two cells, you have no redundancy.
CK: Perfect! Very cool! Now we can envision how this works. Well, those are my questions for now. I’m sure there will be a lot more once you start hosting again and giving us more details on how all this works. Some part of me wants to ask you, do you have all these [blogs] prepared in advance and you’re just pushing the “Publish” button on a pre-determined day. I’m hoping you’re answering some of them as people ask them.
KC: Are you referring to my blog thread on Exadata Questions and Answers?
KC: Listeners will know where my blog is because when you post this, of course you’ll post up a URL to my blog. The way I’m approaching this is, you know, look — I’ve got a day-job. My night-job, at this point in time, is also very exciting because I’m taking it on myself to disseminate some information about Exadata. So what I do throughout the day, I’m just like all of you folks — connected and getting your RSS updates and what-have-you. I’ll see when people are speaking about Exadata, and if I see something that looks like it’s fodder for a blog Q&A, I just cut the question and paste it into my working sheet, and when I get around to it I type in an answer ,and once I’ve got something cooked, I post it up to the blog.
CK: Filling up the gaps.
KC: Filling in the gaps. You know, honestly — this is the twenty-first century and it’s Web 2.0. I think we would be way behind the times to not be handling some of our information dissemination the way we are. Because if you wanted to collect even the information that we’ve disclosed in this conversation by going out and trolling through white papers and what-have-you, the odds that you would actually get all of the information is pretty slim. So how many books and white papers do people want to troll through? I think blogging is a very effective way to get some timely information out, and I’m hoping I’ll be able to continue that.
CK: Very cool. Alright, thank you so much for this interview, Kevin. We really appreciate your finding the time to talk to us. We’ll see you in the blogs, I guess.
KC: Indeed. And I appreciate the opportunity to do this. Gee whiz, wouldn’t it be good if we could do this again some time!
CK: Oh, absolutely! Let’s wait until the gap gets a bit bigger, and then we’ll get it in.
KC: Okay, very good!