Obviously security is a key concern in making the decision to move to cloud-based genomic storage and analysis. It’s difficult (if not impossible) to quantify security in an absolute sense. But since most researchers currently use their institutional IT infrastructure for storage and analysis, it’s possible to assess current institutional IT security relative to that provided by BaseSpace.
BaseSpace has been built by Illumina on Amazon’s AWS cloud infrastructure. AWS hosts cloud-based services such as Netflix, Quora, Reddit and Foursquare as well as providing customer-facing services for government departments including Treasury, DOE, and State. Amazon’s security webpage can be found here. I’ve found that the most useful security overview to be this white paper: “Amazon Web Services: Overview of Security Processes“. Another useful resource is the AWS blog. Here are some key points to note about AWS:
- Standards and accreditation: SOC 1/SSAE 16/ISAE 3402 (auditing), FISMA moderate (US Federal Government), PCI DSS Level 1 (electronic payments), ISO 27001 (international security standard), and FIPS 140-2 (encryption). (For reference, the NIH’s own data centers are rated FISMA moderate.)
- Data centers are protected by security staff and controlled access procedures. Staff with system access undergo background checks.
- All hardware is located behind firewalls which are configured by default to block all traffic.
- Operating security patches are automatically applied.
Let’s also examine a few of the more general questions that are sometimes raised about the security implications the cloud:
Isn’t a big public cloud provider a huge target, and so inevitably vulnerable to attack?
- It’s safe to assume that the size of the prize means that AWS is under constant attack. One advantage of this is that security researchers are always (a) working to identify vulnerabilities as any discovery will be high profile and (b) informing the operator of the problem so as to be seen as one of the good guys. A recent example of this was a security issue identified in October by researchers at Germany’s Ruhr University. The vulnerability, which has not been tied to any actual attacks, was immediately addressed by AWS. And it got a lot of press for Ruhr!
- Obviously a criminal attacker that finds a vulnerability isn’t going to tell AWS about it. But in the words of the famous cartoon “I don’t have to outrun that bear, I only have to outrun you”: if someone breaks into Amazon their target will almost certainly be easily monetized data such as credit card numbers, not genomic data.
If my data is in the cloud, then it’s “on” the internet, and that must be risky, right?
In reality virtually all the world’s computers are connected to the internet. A computer in isolation is rare, and not terribly useful. So it’s highly likely that any existing computer that you use for storing genomic data is already connected to the internet, or at the very least on an intranet that is in turn connected to the internet. Secure isolation from the internet is typically provided by a firewall device configured to protect the internal network from outside attack. AWS computers are protected in the same way by firewalls – and AWS actively monitors its firewalls to check for vulnerabilities (a service beyond the resources of most institutions). And we also encrypt your data, something else that’s rarely done in the institutional IT setting.
My data has to travel to and from the cloud over the internet – isn’t that a big risk?
E-commerce has been with us since web retailers such as Amazon began to emerge. SSL (Secure Sockets Layer) is an internet standard that has been developed to encrypt sensitive communications as they pass over the internet. SSL is regularly updated to allow for new technologies and new threats. Every day millions of people and institutions rely on SSL to protect financial transactions. We use SSL to protect BaseSpace data uploads and downloads. Think of it this way: most of us now access bank accounts over the internet: so just because something is accessible over the internet, doesn’t mean it’s inherently insecure – it’s all about the quality of the security being implemented.
The entire subject of genomic data storage and analysis in the cloud is undergoing constant change, and we’d really like to get your input in the comments below. Let us know your experiences and concerns – we want to learn!