The Server Labs Blog Rotating Header Image


Genomic Processing in the Cloud


Over the last decade, a new trend has manifested itself in the field of genomic processing. With the advent of the new generation of DNA sequencers has come an explosion in the throughput of DNA sequencing, resulting in the cost per base of generated sequences falling dramatically. Consequently, the bottleneck in sequencing projects around the world has shifted from obtaining DNA reads to the alignment and post-processing of the huge amount of read data now available. To minimize both processing time and memory requirements, specialized algorithms have been developed that trade off speed and memory requirements with the sensitivity of their alignments. Examples include the ultrafast memory-efficient short read aligners BWA and Bowtie, both based on the Burrows-Wheeler transform.

The need to analyse increasingly large amounts of genomics and proteomics data has meant that research labs allocate an increasing part of their time and budget provisioning, managing and maintaining their computational infrastructure. A possible solution to meet their needs for on-demand computational power is to take advantage of the public cloud. With its on-demand operational model, labs can benefit from considerable cost-savings by only paying for hardware when needed for their experiments. Procurement of new hardware is also simplified and more easily justified, without the need to expand in-house resources.

Not only does the cloud reduce the time to provision new hardware; it also provides significant time-savings by automating the installation and customization of the software that runs on top of the hardware. A controlled computational environment for the post-processing of experiments allows results to be more easily reproduced, a key objective to researchers across all disciplines. Results can also be easily shared among researchers, as cloud-based services facilitate the publishing of data over the internet, while allowing researches control over their access. Finally, data storage in the cloud was designed from the ground-up with high-availability and durability as key objectives. By storing their experiment data in the cloud, researchers can ensure their data is safely replicated among data centres. These advantages free researchers from time-consuming operational concerns, such as in-house backups and the provisioning and management of servers from which to share their experiment results.

Given the vast potential benefits of the cloud, The Server Labs is working with the Bioinformatics Department at the Spanish National Cancer Research Institute (CNIO) to develop a cloud-based solution that would meet their genomic processing needs.

An Environment for Genomic Processing in the Cloud

The first step towards carrying out genomic processing in the cloud is identifying a suitable computational environment, including hardware architecture, operating system and genomic processing tools. CNIO helped us identify the following software packages employed in their typical genomic processing workflows:

  • Burrows-Wheeler Alignment Tool (BWA): BWA aligns short DNA sequences (reads) to a sequence database such as the human reference genome.
  • Novoalign: Novoalign is a DNA short read mapper implemented by Novocraft Technologies. The tool uses spaced-seed indexing to align either single or paired-end reads by means of Needleman-Wunsch algorithm. The source code is not available for download. However, anybody may download and use these programs free of charge for their research and any other non-profit activities as long as results are published in open journals.
  • SAM tools: After reads alignment, one might want to call variants or view the alignments against the reference genome. SAM tools is an open-source package of software aplications which includes an alignments viewer and a consensus base caller tool to provide lists of variants (somatic mutations, SNPs and indels).
  • BEDTools: This software facilitates common genomics tasks for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (.BED) format. BEDTools supports the comparison of sequence alignments allowing the user to compare next-generation sequencing data with both public and custom genome annotation tracks. BEDTools source code in freely available.

Note that, except for Novoalign, all software packages listed above are open source and freely available.

One of the requirements of these tools is that the underlying hardware architecture is 64-bit. For our initial proof of concept, we decided to run a base image with Ubuntu 9.10 for 64-bit on an Amazon EC2 large instance with 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each) and 850 GB of local instance storage. Once we had selected the base image and instance type to run on the cloud, we proceeded to automate the installation and configuration of the software packages. Automating this step ensures that no additional setup tasks are required when launching new instances in the cloud, and provides a controlled and reproducible environment for genomic processing.

By using the RightScale cloud-management platform, we were able to separate out the selection of instance type and base image from the installation and configuration of software specific to genomic processing. First, we created a Server definition for the instance type and operating system specified above. We then scripted the installation and configuration of genomic processing tools, as well as any OS customizations, so that these steps can be executed automatically after new instances are first booted.

Once our new instances were up and running and the software environment finalized, we executed some typical genomic workflows suggested to us by CNIO. We found that for their typical workflow with a raw data input between 3 and 20 GB, the total processing time on the cloud would range between 1 and 4 hours, depending on the size of the raw data and whether the type of alignment was single or paired-end. With an EC2 instance pricing at 38 cents per hour for large instances, and ignoring additional time required for customization of the workflow, the cost of pure processing tasks totalled less than $2 for a single experiment.

We also found the processing times to be comparable to running the same workflow in-house on similar hardware. However, when processing on the cloud, we found that transferring the raw input data from the lab to the cloud data centre could become a bottleneck, depending on the bandwidth available. We were able to work around this limitation by processing our data on Amazon’s European data centre and avoiding peak-hours for our data uploads.

Below, we include an example workflow for paired-end alignment that we successfully carried out in the Amazon cloud:

Example single-end workflow executed in the cloud.

Example single-end alignment workflow executed in the cloud.

Maximizing the Advantages of the Cloud

In the first phase, we demonstrated that genomic processing in the cloud is feasible and cost-effective, while providing a performance on par with in-house hardware. To truly realize the benefits of the cloud, however, what we need is an architecture that allows tens or hundreds of experiment jobs to be processed in parallel. This would allow researchers, for instance, to run algorithms with slightly different parameters to analyse the impact on their experiment results. At the same time, we would like a framework which incorporates all of the strengths of the cloud, in particular data durability, publishing mechanisms and audit trails to make experiment results reproducible.

To meet these goals, The Server Labs is developing a genomic processing platform which builds on top of RightScale’s RightGrid batch processing system. Our platform facilitates the processing of a large number of jobs by leveraging Amazon’s EC2, SQS, and S3 web services in a scalable and cost efficient manner to match demand. The framework also takes care of scheduling, load management, and data transport, so that the genomic workflow can be executed locally on experiment data available to the EC2 worker instance. By using S3 to store the data, we ensure that any input and result data is highly available and persisted between data centres, freeing our users from the need to backup their data. It also ensures that data can be more easily shared with the appropriate level of access control among institutions and researchers. In addition, the monitoring and logging of job submissions provides a convenient mechanism for the production of audit trails for all processing tasks.

The following diagram illustrates the main components of The Server Labs Genomic Processing Cloud Framework:

The Server Labs Genomic Processing Framework

The Server Labs Genomic Processing Cloud Framework.

The Worker Daemon is based on The Server Labs Genomic Processing Server Template, which provides the software stack for genomic processing. It automatically pulls experiment tasks from the SQS input queue along with input data from S3 to launch the genomic processing workflow with the appropriate configuration parameters. Any intermediate and final results are stored in S3 and completed jobs are stored in SQS for auditing.

Cost Analysis

Given a RightGrid-based solution for genomic processing, we would like to analyse how much it would cost to run CNIO’s workflows on the Amazon cloud. Let us assume for the sake of our analysis that CNIO runs 10 experiments in the average month, each of which generate an average of 10 GB of raw input data and produce an additional 20 GB of result data. For each of these experiments, CNIO wishes to run 4 different workflows, with an average running time of 2 hours on a large EC2 instance. In addition, we assume that the experiment results are downloaded once to the CNIO in-house data center. We also assume that the customer already has a RightScale account, the cost of which is not included in the analysis.

Amazon Service Cost
SQS Negligible
  • Data transfer in: $0.10 per GB * 10 GB per workflow = $1 per workflow
  • Data transfer out: 1 download per workflow * 20 GB per download * $0.15 per GB = $3 per workflow
  • Storage: 30 GB per workflow * $0.14 per GB = $4.20 per workflow
  • Total cost: $8.20 per workflow
EC2 $0.38 per hour * 2 hours per workflow = $0.76 per workflow
All services Total cost: $8.20 + $0.76 = $8.96 per workflow

Total Cost:
10 experiments per month * 4 workflows per experiment * $8.96 per workflow =
$358.40 per month or $4300 per year

Towards an On-demand Genomic Processing Service

By building on the RightGrid framework, The Server Labs is able to offer a robust cloud-based platform on which to perform on-demand genomic processing tasks, at the same time enabling experiment results to be more easily reproduced, stored and published. To make genomic processing even simpler on the cloud, the on-demand model can be taken even one step forward by providing a pay-as-you-go software as a service. In such a model, researchers are agnostic to the fact that the processing of their data is done in the cloud. Instead, they would interact with the platform via a web interface, where they would be able to upload their experiment’s raw input data, select their workflow of choice, and choose whether or not to share their result data. They would then be notified asynchronously via email once the processing of their experiment data has been completed.


Low Cost, Scalable Proteomics Data Analysis Using Amazon’s Cloud Computing Services and Open Source Search Algorithms

How to map billions of short reads onto genomes

The RightGrid batch-processing framework

The Server Labs at CeBIT 2010

Come and see us at CeBIT 2010 where we will have a booth on Amazon’s Stand in the Main Exhibition Hall: Hall 2, Stand B26. We will be there on March the 4th and March the 5th.

We will be happy to chat to you about our added value in bringing IT Architecture solutions to the Cloud.

We will also be presenting the work on HPC we have been doing for the European Space Agency, at 13:30 on March the 4th, and at 11:00 on March the 5th, both in the Theatre at the back of the Amazon stand.

Complex low-cost HPC Data Processing in the Cloud

With the maturing of cloud computing, it is now feasible to run even the most complex HPC applications in the cloud. Data storage and high performance computing resources - fundamental for these applications – can be outsourced thus leveraging scalability, flexibility and high availability at a fraction of the cost of traditional in-house data processing. This presentation evaluates Amazon’s EC2/S3 suitability for such a scenario, by running a distributed astrometric process The Server Labs developed for the European Space Agency’s Gaia mission in Amazon EC2

Eating our own Dog Food! – The Server Labs moves its Lab to the Cloud!


After all these years dealing with servers, switches, routers and virtualisation technologies we think it´s time to move our lab into the next phase, the Cloud, specifically the Amazon EC2 Cloud.

We are actively working in the Cloud now for different projects, as you´ve seen in previous blog posts. We believe and feel this step is not only a natural one but also takes us in the right direction towards a more effective management of resources and higher business agility. This fits with the needs of a company like ours and we believe it will also fit for many others of different sizes and requirements.
Cloud computing is not only a niche for special projects with very specific needs. It can be used by normal companies to have a more cost effective It infrastructure, at least in certain areas.

In our lab we had a mixture of server configurations, comprising Sun and Dell servers running all kinds of OSs, using VMWare and Sun virtualisation technology. The purpose of our Lab is to provide an infrastructure for our staff, partners and customers to perform specific tests, prototypes, PoC´s, etc… Also, the Lab is our R & D resource to create new architecture solutions.

Moving our Lab to the cloud will provide an infrastructure that will be more flexible, manageable, powerful, simple and definitely more elastic to setup, use and maintain, without removing any of the features we currently have. It will also allow us to concentrate more in this new paradigm, creating advanced cloud architectures and increasing the overall know-how, that can be injected back to customers and the community.

In order to commence this small project the first thing to do was to perform a small feasibility study to identify the different technologies to use inside the cloud to maintain confidentiality and secure access primarily, but also to properly manage and monitor that infrastructure. Additionally, one of the main drivers of this activity was to reduce our monthly hosting cost, so we needed to calculate, based on the current usage, the savings of moving to the cloud.

Cost Study

Looking at the cost for moving to the cloud we performed an inventory of the required CPU power, server instances, storage (for both Amazon S3 and EBS) and the estimated data IO. Additionally, we did an estimation of the volume of data between nodes and between Amazon and the external world.

We initially thought to automatically shutdown and bring up those servers that are only needed during working hours to save more money. In the end, we will be using Amazon reserved instances, that give a even lower per-hour price similar to the one that we would get using on-demand servers.

Based on this inventory and estimations, and with the help of the Amazon Cost Calculator, we reached a final monthly cost that was aprox. 1/3 of our hosting bill!.

This cost is purely considering the physical infrastructure. We need to add on top of this the savings we have on hardware renewal, pure system administration and system installation. Even if we use virtualization technologies, sometimes we´ve had to rearrange things as our physical infrastructure was limited. All these extra costs mean savings on the cloud.

Feasibility Study

Moving to the cloud gives a feeling to most IT managers that they lose control and most importantly, they lose control of the data. While the usage of hybrid clouds can permit the control of the data, in our case we wanted to move everything to the cloud. In this case, we are certainly not different and we are quite paranoid with our data and how would be stored in Amazon EC2. Also, we still require secure network communication between or nodes in the Amazon network and the ability to give secure external access to our staff and customers.

There are a set of open-source technologies that have helped us to materialize these requirements into a solution that we feel comfortable with:

  • Filesystem encryption for securing data storage in Amazon EBS.
  • Private network and IP range for all nodes
  • Cloud-wide encrypted communication between nodes within a private IP network range via OpenVPN solution
  • IPSec VPN solution for external permanent access to the Cloud Lab, as for instance connection of private cloud/network to public EC2 Cloud
  • Use of RightScale to manage and automate the infrastructure
Overview of TSL Secure Cloud deployment

Overview of TSL Secure Cloud deployment

Implementation and Migration

The implementation of our Cloud Lab solution has gone very smoothly and it is working perfectly.
One of the beneficial side effects you get when migrating different systems into the cloud is that it forces you to be much more organised as the infrastructure is very focused on reutilisation and automatic recreation of the different servers.

We have all our Lab images standardized, taking prebuilt images available in Amazon and customising them to include the security hardening, standard services and conventions we have defined. We can in a matter of seconds deploy new images and include them into our Secure VPN-Based CloudLab network ready to be used.

Our new Cloud Lab is giving us a very stable, cost-effective, elastic and secure infrastructure, which can be rebuilt in minutes using EBS snapshots.

The Server Labs @ Cloud Computing Expo 09 – Update

The presentation we gave last month at Cloud Computing and Expo in Prague is now avaliable online below

Update 29 Jun 2009: Amazon Web Services

Amazon has just published this blog wholesale jerseys entry about The Server Labs’s Proof of Concept for ESA Scaling to the Stars

Using RightScripts to create a Weblogic cluster in Amazon EC2

In my previous post, I described how to set up a Weblogic cluster in Amazon EC2 using the Oracle-supplied Amazon AMI image. In this post, I will describe how to create a cluster using RightScripts, an alternative technology offered by RightScale.

In Amazon EC2, you work on an AMI – installing software, configuring – until you are happy with it. Then you ‘freeze’ it, storing it in S3 so that you can create many different instances based on this AMI. Amazon give you the possibility to pass configuration data to each new instance using “user-supplied data” which allows you do differ one launched AMI from another.

RightScale offer you an alternative. Instead of doing all the installation and configuration work on an AMI and then freezing it, you capture all the installation and configuration work in scripts – RightScripts. Each time you start up an instance in RightScale, you decide which RightScripts to execute against a base operating System to construct the complete machine that you wish to deploy.

For example, if you want to deploy an Apache Web Server, you write a RightScript that downloads, installs and configures Apache. You start up a new instance (with a base operating system e.g. CentOS) and run the RightScript. Once it has finished, you have an Apache server up and running. You can even associate one or more RightScripts with a base AMI to make a RightScale Server Template.

The benefits over the use of highly personalised AMIs are:

  • It is easier to change the configuration of your machines in the future – you can just execute another RightScript ‘on the fly’
  • You are not tied to one particular cloud vendor. RightScale allow you to execute RightScripts on machines in non-Amazon clouds

This article will show you how to create a Weblogic cluster using RightScale Server Templates made up of various RightScripts. You will need some familiarity with Amazon EC2 and S3 to get the full benefits from this article and I also assume that you’ve read my previous post on this theme.

You will need to sign up for a RightScale account to follow the steps in this post.


We will create two server templates – one for the primary cluster node that runs the admin server and a managed server and one that contains just a managed server. We will create these templates from various RightScripts that completely automate the installation and configuration process and just require a few variables to be set. This should mean that we can deploy many managed server instances with just a few clicks.

Creating the RightScripts

Log into the RightScale service and click on the Design > RightScripts link in the navigation bar on the left. Click on the ‘new’ button to add a RightScript and add each of the scripts described below, naming them the same as the header:

Install Weblogic

This script installs weblogic on top of the base operating system. It relies on a tar-gzipped copy of the Weblogic installation being stored in an S3 bucket (folder) to which you have access. To create this copy (a one-off procedure), run an instance based on AMI “ami-6a917603” and execute the following commands, substituting “my-new-bucket-name” for an Amazon S3 bucket name that does not already exist e.g. [my-company-name]-[weblogic]

rpm -ivh s3cmd-0.9.9-1.el5.noarch.rpm
tar -czvf /tmp/oracle-weblogic-103.tgz /opt/oracle
s3cmd mb s3://my-new-bucket-name
s3cmd put /tmp/oracle-weblogic-103.tgz s3://my-new-bucket-name/oracle-weblogic-103.tgz

(For more information on s3cmd, see here).

You can check that the weblogic zip was uploaded correctly using a tool such as S3Fox.

Here is the RightScript. Substitute “my-new-bucket-name” for the bucket name you created in the step above.


# NOTE: relies on AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables having been set.

# get WL zip out of S3 using the s3cmd tool that's installed by default
s3cmd get my-new-bucket-name:oracle-weblogic-103.tgz /tmp/wl.tgz

# untar WL to install it
tar -zxvf /tmp/wl.tgz -C /

# report success
exit 0

Install the Domain

When I say “install”, I really mean “extract”! Download the test_domain Domain from here and use S3Fox or another equivalent tool to copy it over to the “my-new-bucket-name” folder in S3 that you created earlier. The RightScript below retrieves this domain from S3, extracts it and places it in the /mnt/domains folder. You must substitute “my-new-bucket-name” for your S3 bucket name. This domain is a basic domain that has no servers or clusters configured – these will be configured later in RightScripts.


# NOTE: relies on $AWS_ACCESS_KEY_ID and $AWS_SECRET_ACCESS_KEY environment variables having been set.

# get domain zip out of S3 using the s3cmd tool that's installed by default
s3cmd get my-new-bucket-name:test_domain.tgz /tmp/test_domain.tgz

# untar domain to install it
tar -zxvf /tmp/test_domain.tgz -C /

# report success
exit 0

Start Weblogic Admin Server

This script simply runs the admin server and redirects output to a log file. This should probably be replaced with an init.d script in a production setup.


mkdir /mnt/logs/

# start up weblogic
nohup /mnt/domains/test_domain/ > /mnt/logs/weblogicAdmin.log 2>&1 &

Create Cluster

This script contacts the Administration server on a defined hostname and port and creates a Weblogic cluster using the Weblogic Scripting Tool (WLST).

#!/bin/bash -e

PY_FILE="/tmp/create-cluster_"`date "+%s"`.py

cat < $PY_FILE



/opt/oracle/weblogic/common/bin/ $PY_FILE

rm -rf $PY_FILE

Add Server To Domain

This script uses WLST to connect to an admin server and create a new server in the domain. The server’s name will be that of the machine’s EC2 public DNS name.

#!/bin/bash -e

SERVER_NAME=$(eval "curl")
PY_FILE="/tmp/create-server_"`date "+%s"`.py

cat < $PY_FILE



/opt/oracle/weblogic/common/bin/ $PY_FILE

rm -rf $PY_FILE

Start Managed Weblogic

This script starts up a Weblogic managed server instance which connects to the admin server and downloads all the domain info required. It relies on being able to download a security file – SerializedSystemIni.dat from your “my-new-bucket-name” folder in S3. Download this file here and upload it to S3 using S3Fox or an equivalent tool. Change the “my-new-bucket-name” folder in the script below to reflect your bucket name. This should probably be replaced with an init.d script in a production setup.



SERVER_NAME=$(eval "curl")

# create the server domain directory structure
mkdir -p $DOMAIN_HOME
mkdir -p /mnt/logs

# create the file
echo "username=$SERVER_USERNAME" >> $BOOT_FILE
echo "password=$SERVER_PASSWORD" >> $BOOT_FILE

echo "using admin URL: http://$ADMIN_SERVER_DNS_NAME:$ADMIN_SERVER_PORT"

mkdir -p $START_ROOT/security
s3cmd get my-new-bucket-name:SerializedSystemIni.dat $START_ROOT/security/SerializedSystemIni.dat

# start up weblogic
nohup /opt/oracle/weblogic/common/bin/ $SERVER_NAME http://$ADMIN_SERVER_DNS_NAME:$ADMIN_SERVER_PORT > /mnt/logs/weblogicManaged-$SERVER_NAME.log 2>&1 &

Create the ServerTemplates

Now that we have the basic RightScripts to construct Weblogic instances, we can create the ServerTemplates that logically group the scripts together.

Click on the Design > ServerTemplates link in the navigation bar in RightScale. Click on ‘new’ to create a new Server Template and call the template “Weblogic Admin”. Choose “EC2 US” and “m1.small” for the cloud and instance type attributes respectively. Use the browser tool to select the image “RightImage CentOS5_2V4_1_10”. Leave the rest of the attributes as their defaults and click ‘save’.

Click on the ‘scripts’ tab in the newly-created server template and add the RightScripts that you created earlier as boot scripts, in the following order: Install Weblogic, Install Domain, Start Weblogic Admin Server, Create Cluster, Add Server To Domain, Start Managed Weblogic. As you can see, this should install a weblogic server, create the domain and start the admin server. It should also create a cluster and a server in the domain and then finally start up the server.


Create another server template called “Weblogic Managed” with the same cloud, instance type and image attributes as “Weblogic Admin”. Add the following scripts as boot scripts in this order to the template: Install Weblogic, Add Server To Domain, Start Managed Weblogic.


Start the instances

In the RightScale navigation bar, go to Manage > Deployments and click on ‘default’. Click on “Add EC2 US server”. For the Server Template, select Private > Weblogic Admin. Enter “Weblogic Admin 1” as the nickname and choose the SSH key that you want to use. Use a security group that has ports 7001 and 7002 open. Leave the rest of the attributes as defaults and click “Add”.

Repeat these steps to create “Weblogic Managed 1” which uses the Weblogic Managed template. You should now have two servers configured in the “Default” deployment and it should look something like the screenshot below (note that my servers are in the EU):

Servers for "Default" deployment

Launching the servers

Click on the launch icon next to the “Weblogic Admin – 1” server and you should see a screen prompting you for some parameters. These are all the input parameters required by the RightScripts that will run at boot time. Enter the information as shown in the screenshot below and click on the ‘Launch’ button:


Wait until RightScale shows the status of the server as “operational” – this can take a while (8mins or so), so be patient! Once it’s started, click on the server name and then on the audit entries tab. The latest audit entry should have status “operational”. Click on this and you should see a list of the scripts that ran at startup and their outcomes:

Audit entries

On the info tab of the server, you should see the public DNS name of the server e.g. “”. Go to the URL: http:://[public-dns-name]:7001/console and log in with the username/password weblogic/weblogic.

If you click on the environment > servers link, you should see an Admin server and a managed server configured in a cluster. The managed server name should correspond to the public DNS of the machine:

Weblogic console servers

Now, in the RightScale console, start up the other server – “Weblogic Managed – 1” using the following inputs (substituting “Weblogic Admin – 1” for “Weblogic EU Admin – 1”) :

Inputs for Weblogic Managed

Once the managed server has started up, hit refresh in the Weblogic console. You should see the second managed server appear and the screen should look something like my one below:

Weblogic console showing all connected servers


Using the RightScripts provided in this post, you can easily deploy a Weblogic cluster using just two server templates. Adding new nodes is simple – just create another server based on the “Weblogic Managed” template and when the node starts up, provide it with the parameters it needs to connect to the admin server. The RightScripts take care of the rest for you.