If you are interested in understanding HDFS permissions model, you would google something like “hdfs adming guide”, which will probably bring you to this page. If you start reading it you will soone see this:
In this release of Hadoop the identity of a client process is just whatever the host operating system says it is. For Unix-like systems,
- The user name is the equivalent of `whoami`;
- The group list is the equivalent of `bash -c groups`.
In the future there will be other ways of establishing user identity (think Kerberos, LDAP, and others). There is no expectation that this first method is secure in protecting one user from impersonating another. This user identity mechanism combined with the permissions model allows a cooperative community to share file system resources in an organized fashion.
This is as scary as it looks: If somebody substitutes `whoami` command, he will be able to fake any user when connecting to HDFS.
But wait, the documentation is for version 0.19.1, which is an old one. Indeed, look at the code for org.apache.hadoop.security in this version you will see this:
static String getUnixUserName() throws IOException {
String[] result = executeShellCommand(
new String[]{Shell.USER_NAME_COMMAND}); <-- this is not good!
if (result.length!=1) {
throw new IOException("Expect one token as the result of " +
Shell.USER_NAME_COMMAND + ": " + toString(result));
}
return result[0];
}
Okay, let’s now take a look at the same code in a more recent 0.20 version:
static Subject getCurrentUser() {
return Subject.getSubject(AccessController.getContext());
}
The whole user authentication module was rewritten to use Java’s SecurityManager API, which is not as easy to trick into thinking you are someone else. So, the code was changed. Now let’s check documentation for the latest versions. This is latest documentation version on Apache Hadoop page. I hope you will be as surprised and amazed as I was when I found out, but docs were not updated and are still referring to the old implementation of security module, which is very confusing and can lead many people to think Hadoop is using an even more primitive authentication model than it does.
Discover more about our expertise in Hadoop
3 Comments. Leave new
It’s still primitive I think. All you need it to tweak the client code (which is fully open) and you can enter as any user. Is my understanding not correct?
Alex, you are absolutely right. If you write and build your own HDFS client your are free to do whatever you want :)
Danil, you make it sound like that’s a major undertaking. In fact, since HDFS is open source, all you need to do is take the existing code and tweak it (perhaps just reverting the change to the snippet you quote), then build it the way it’s always built. Alternatively, since all of the traffic is unencrypted, it’s pretty trivial to intercept and tamper with the traffic that contains user ID information. Or you can run inside a VM where you made up whatever identities you want. The possibilities are endless.
Anything that doesn’t *at least* involve identities that are assigned by the kernel on a machine where normal users don’t have root, plus a secure and authenticated connection from that machine, might as well just let users make up their own identity. All security less than SSL/Kerberos/GSSAPI is suspect.
Also, calling out to ‘whoami’ via the shell to get information available from getuid is not only insecure but grossly inefficient. Whoever committed that code – author *or* reviewer – deserves a place on any respectable manager or tech lead’s “do not hire” list.