It was revealed today that about 100 Million records were leaked from Capital One and because the press has published court documents pertaining to this case, at least some details are known. According to that, data was stored in cloud storage buckets it was accessible using a service account from a compute instance that itself had a misconfigured firewall, allowing a hacker to connect to the VM and transfer, extricate data from the bucket. While it is always easy to see mistakes in hindsight, I also reflected about what we have been doing to protect our customer’s data in the cloud.
The first step to protecting your data is knowing your data. Which tables or datasets contain sensitive information? Are there different protection levels? Which fields within each table contain what kind of data? Where is the data stored and/or replicated? All of this metadata together with other profiling information is going to end up in a data catalog which will serve as the central hub driving all data protection initiatives.
How your employees and your systems can get access to data will follow a balance between convenience and security. In general, data is protected by access mechanisms such as ACLs and/or cryptography. Availability of software library and more powerful CPUs have made encryption an easy commodity. The storage bucket that held Captial One’s data was encrypted, you can’t even configure cloud storage without it. But also by default, the encryption key is managed by the cloud provider so that a user or machine that has properly authenticated can decrypt the data. In this case it sounds like the compute instance was using a service account that had at least read permissions to one or more buckets holding customer data. While being convenient, this also means that anyone gaining access to the VM cannot be stopped from getting to the data. To prevent that, the usage of highly privileged service accounts should be eliminated or reduced to tightly governed and controlled machines and processes. You won’t be able to avoid service accounts completely as something will have to run ETL and other processes, but those environments should not be made accessible to the internet.
Regular data access should be through user identities following the policies that your organization has set such as locations or multi-factor authentication. Another opportunity to improve has to do with what is governed in access control. The norm today is to control access to rules such as “read from bucket” or “read/write on dataset or table”. The principle of least privileges is not new by any means, it just has to be applied and we can push the limit of what “least” means. There are two items that can be improved here. One is the granularity at which access is granted. The spectrum of possibilities stretches from “everything” to “only a single column of a certain table”. More often than not, a compromise is made to manage coarser permissions to avoid the overhead of fine-grained control. Go back to my first point about “catalog” and you will see that managing permissions is actually pretty easy if you have properly cataloged your data assets and can create an access model that is based on labels and metadata.
The other aspect to improve would be to govern additional restrictions on data access. Most users should never need to fetch all records from a table, but would typically have permissions to do so. This is where controls such as “retrieve a maximum of 100 records per hour during regular business hours” are useful. Or in the case of those service accounts that run your ETL? An orchestration job could grant privileges just before job execution and revoke them again right after.
After encrypting data and controlling access to it, the job is not yet done. You will also want to audit every access to the data. The very worst case scenario is that you were hacked but you don’t even know how much data was accessed and leaked. Luckily, the cloud is giving you a head start as most cloud storage and cloud databases have built-in audit logging for all access. But it is still your responsibility to use these logs. At the very least make sure that they are persisted long enough. Or if you want to do better, analyze the logs for fraudulent access by finding unusual access patterns. You may or may not be fast enough to stop an attacker in the act but at least you don’t have to get embarrassed by someone else uncovering the leak months or years after it actually happened.
Security for your cloud-based data platform is not fundamentally different from other projects. If anything, we have more tools in our arsenal to support governance, security controls and auditing. But proper planning is crucial, and without correct oversight, compromises to convenience or implementation speed may expose your organization to risk.