The Server Labs Blog Rotating Header Image

ALSB/OSB customization using WLST

One of the primary tasks in release management is environment promotion. From development to test or from test to production, environment promotion is a step which should be as much automated as possible.

We can use the service bus MBeans in WLST scripts to automate promotion of AquaLogic/Oracle Service Bus configurations from development environments through testing, staging, and finally to production environments.

Each environment has particularities which may need changes in configuration of the software. These are usually centralized in property files, database tables, environment variables or any other place to facilitate environment promotion.

In AquaLogic/Oracle Service Bus there is the concept of environment values:

Environment values are certain predefined fields in the configuration data whose values are very likely to change when you move your configuration from one domain to another (for example, from test to production). Environment values represent entities such as URLs, URIs, file and directory names, server names, e-mails, and such. Also, environment values can be found in alert destinations, proxy services, business services, SMTP Server and JNDI Provider resources, and UDDI Registry entries.

For these environment values, we have different standard operations

  • Finding and Replacing Environment Values
  • Creating Customization Files
  • Executing Customization Files

However, these operations are limited to the ‘predefined fields whose values are very likely to change’… and what happens if we need to modify one of the considered ‘not very likely’? A different story is whether to consider SAP client connection parameters ‘not very likely’ to change in a environment promotion from test to production…

In order to automate these necessary changes, one option is to modify directly the exported configuration prior to importing it to the destination environment but in our case, we want to maintain the philosophy of the customization after the importing, keeping the exported package untouched. We will try to use a WLST script instead of a customization file, as the later doesn’t satisfy our needs.

The first thing we have to do for using WLST is to add several service bus jar files to the WLST classpath. For example, if we have a Windows platform we add the following at the beginning of wlst.cmd file (I’m sure *nix people will know how to proceed in their case)

For Aqualogic Service Bus 3.0:

SET ALSB_HOME=c:\bea\alsb_3.0
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\lib\sb-kernel-api.jar
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\lib\sb-kernel-common.jar
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\lib\sb-kernel-resources.jar
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\lib\sb-kernel-impl.jar
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\..\modules\com.bea.common.configfwk_1.1.0.0.jar
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\..\modules\com.bea.alsb.statistics_1.0.0.0.jar

For Oracle Service Bus 10gR3:

SET ALSB_HOME=c:\bea\osb_10.3
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\lib\sb-kernel-api.jar
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\lib\sb-kernel-common.jar
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\lib\sb-kernel-resources.jar
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\lib\sb-kernel-impl.jar
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\..\modules\com.bea.common.configfwk_1.2.1.0.jar
SET CLASSPATH=%CLASSPATH%;%ALSB_HOME%\..\modules\com.bea.alsb.statistics_1.0.1.0.jar

In our example, we will try to change the HTTP timeout in the normalLoanProcessor business service present in ALSB/OSB examples server.

normalLoanProcessor

normalLoanProcessor configuration

For that, we will first connect to the bus from WLST and open a session using SessionManagementMBean

from com.bea.wli.sb.management.configuration import SessionManagementMBean
connect("weblogic", "weblogic", "t3://localhost:7021")
domainRuntime()
sessionMBean = findService(SessionManagementMBean.NAME, SessionManagementMBean.TYPE)
sessionName = "mysession"
sessionMBean.createSession(sessionName)
mysession

mysession shown in sbconsole

Nothing new until now. Next thing we need is a reference to the component you want to modify. We chose to use a BusinessServiceQuery like:

from com.bea.wli.sb.management.query import BusinessServiceQuery
from com.bea.wli.sb.management.configuration import ALSBConfigurationMBean
bsQuery = BusinessServiceQuery()
bsQuery.setLocalName("normalLoanProcessor") 
bsQuery.setPath("MortgageBroker/BusinessServices")
alsbSession = findService(ALSBConfigurationMBean.NAME + "." + sessionName, ALSBConfigurationMBean.TYPE)
refs = alsbSession.getRefs(bsQuery)
bsRef = refs.iterator().next()

After this we have a reference to the business service we want to modify. Now is when fun begins.

There is an undocumented service bus ServiceConfigurationMBean (not to be confused with old com.bea.p13n.management.ServiceConfigurationMBean) whose description is ‘MBean for configuring Services’.

ServiceConfiguration.mysession as shown in jconsole

Among the different methods, we find one with an interesting name: getServiceDefinition

getServiceDefinition as shown in jconsole

It looks that we can use the getServiceDefinition method with our previous reference to the business service for obtaining exactly what its name states.

from com.bea.wli.sb.management.configuration import ServiceConfigurationMBean
servConfMBean = findService(ServiceConfigurationMBean.NAME + "." + sessionName, ServiceConfigurationMBean.TYPE)
serviceDefinition = servConfMBean.getServiceDefinition(bsRef)

This is the result of printing serviceDefinition variable:


  
    
    
      
      
        NormalLoanApprovalServiceSoapBinding
        http://example.org
      
    
    
      5
    
    
      normal
    
    
      wsdl-policy-attachments
    
  
  
    http
    false
    
      http://localhost:7021/njws_basic_ejb/NormalSimpleBean
    
    
      none
      0
      30
      true
    
    
      
        POST
        0
      
    
  

Surprised? It’s exactly the same definition written in .BusinessService XML files. In fact, the service definition implements XMLObject.

Now it’s time to update the business service definition with our new timeout value (let’s say 5000 milliseconds) using XPath and XMLBeans. We must also take care of defining namespaces in XPath the same way that are defined in .BusinessService XML files.

nsEnv = "declare namespace env='http://www.bea.com/wli/config/env' "
nsSer = "declare namespace ser='http://www.bea.com/wli/sb/services' "
nsTran = "declare namespace tran='http://www.bea.com/wli/sb/transports' "
nsHttp = "declare namespace http='http://www.bea.com/wli/sb/transports/http' "
nsIWay = "declare namespace iway='http://www.iwaysoftware.com/alsb/transports' "
confPath = "ser:endpointConfig/tran:provider-specific/http:outbound-properties/http:timeout"
confValue = "5000"
confElem = serviceDefinition.selectPath(nsSer + nsTran + nsHttp + confPath)[0]
confElem.setStringValue(confValue)

We are almost there. First we update the service.

servConfMBean.updateService(bsRef, serviceDefinition)

Modified mysession shown in sbconsole

And finally, we activate the session (see NOTE) like we would do in bus console.

sessionMBean.activateSession(sessionName, "Comments")

mysession changes shown in sbconsole

Task details of mysession

Updated normalLoanProcessor configuration

With this approach, it could be possible to build a framework that allows to customize ALL fields as needed.

NOTE:
If you get the exception below when activating changes, please update your WebLogic Server configuration as described in Deploy to Oracle Service Bus does not work

Traceback (innermost last):
  File "", line 1, in ?
com.bea.wli.config.deployment.server.ServerLockException: Failed to obtain WLS Edit lock; it is currently held by user weblogic. This indicates that you have either started a WLS change and forgotten to activate it, or another user is performing WLS changes which have yet to be activated. The WLS Edit lock can be released by logging into WLS console and either releasing the lock or activating the pending WLS changes.
        at com.bea.wli.config.deployment.server.ServerDeploymentInitiator.__serverCommit(Unknown Source)
        at com.bea.wli.config.deployment.server.ServerDeploymentInitiator.access$200(Unknown Source)
        at com.bea.wli.config.deployment.server.ServerDeploymentInitiator$1.run(Unknown Source)
        at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:363)
        at weblogic.security.service.SecurityManager.runAs(Unknown Source)
        at com.bea.wli.config.deployment.server.ServerDeploymentInitiator.serverCommit(Unknown Source)
        at com.bea.wli.config.deployment.server.ServerDeploymentInitiator.execute(Unknown Source)
        at com.bea.wli.config.session.SessionManager.commitSessionUnlocked(SessionManager.java:420)
        at com.bea.wli.config.session.SessionManager.commitSession(SessionManager.java:339)
        at com.bea.wli.config.session.SessionManager.commitSession(SessionManager.java:297)
        at com.bea.wli.config.session.SessionManager.commitSession(SessionManager.java:306)
        at com.bea.wli.sb.management.configuration.SessionManagementMBeanImpl.activateSession(SessionManagementMBeanImpl.java:47)
[...]

Creating Sonar Reports from Hudson

Introduction

In order to guarantee the quality of software development projects, it is important to be able to verify that a continuous integration build meets a minimum set of quality control criteria. The open source project Hudson provides the popular continuous integration server we will use throughout our example. Similarly, Sonar is a lead open source tool providing a centralized platform for storing and managing this type of quality control indicators. By integrating Sonar with Hudson, we’re able to extract and verify quality control metrics stored by Sonar in automated and recurrent manner from Hudson. By verifying these quality metrics we can qualify a given build as valid from a quality perspective, and quickly flag down builds where violations occur. At the same time, it will be very useful to generate summaries of key quality metrics in an automated manner, informing interested parties with a daily email.

Installing Hudson

As a first step, you will need to download and install Hudson from http://hudson-ci.org/.

Installing the Groovy Postbuild Plugin

In order to be able to extend Hudson with custom Groovy-based scripts, we will use the Groovy Postbuild Plugin. To install this plugin, you will have to click on Manage Hudson followed by Manage Plugins, as shown below:

You will then have to select the Available tab at the top, and search for Groovy Postbuild Plugin under the section Other Post-Build Actions.

Sonar Reporting the Groovy Way

Once the Groovy Postbuild Plugin has been successfully installed and Hudson restarted, you can go ahead and download the SonarReports package and extract it to ${HUDSON_HOME}, the home directory of the Hudson server (e.g. the folder .hudson under the user’s home directory on Windows systems). This zip file contains the file SonarReports.groovy under scripts/groovy, which will be created under ${HUDSON_HOME} after expansion.

Hudson Job Configuration

To facilitate reuse of our Hudson configuration for Sonar, we will first create a Sonar Metrics job to be used as a template. We can then create a new job for each project we wish to create Sonar reports for by simply copying this job template.

In the Sonar Metrics job, we first create the necessary parameters that will be used as thresholds and validated by our Groovy script. To this end, we select the checkbox This build is parameterized under the job’s configuration. We then configure the parameters are shown below, where we have provided the corresponding screenshots:

  • projectName: project name that will appear in emails sent from Hudson.
  • sonarProjectId: internal project ID used by Sonar.
  • sonarUrl: URL for the Sonar server.
  • emailRecipients: email addresses for recipients of Sonar metrics summary.
  • rulesComplianceThreshold: minimum percentage of rule compliance for validating a build. A value of false means this metric will not be enforced.
  • blockerThreshold: maximum number of blocker violations for validating a build. A value of false means this metric will not be enforced.
  • criticalThreshold: maximum number of critical violations for validating a build. A value of false means this metric will not be enforced.
  • majorThreshold: maximum number of major violations for validating a build. A value of false means this metric will not be enforced.
  • codeCoverageThreshold: minimum percentage of code coverage for unit tests for validating a build. A value of false means this metric will not be enforced.
  • testSuccessThreshold: minimum percentage of successful unit tests for validating a build. A value of false means this metric will not be enforced.
  • violationsThreshold: maximum number of violations of all type for validating a build. A value of false means this metric will not be enforced.

Finally, we enable the Groovy Postbuild plugin by selecting the corresponding checkbox under the Post-build Actions section of the job configuration page. In the text box, we include the following Groovy code to call into our script:

sonarReportsScript = "${System.getProperty('HUDSON_HOME')}/scripts/groovy/SonarReports.groovy"
shell = new GroovyShell(getBinding())
println "Executing script for Sonar report generation from ${sonarReportsScript}"
shell.evaluate(new File(sonarReportsScript))

Your Hudson configuration page should look like this:

Generating Sonar Reports

In order to automatically generate Sonar reports, we can configure our Hudson job to build periodically (e.g. daily) by selecting this option under Build Triggers. The job will then execute with the specified frequency, using the default quality thresholds we configured in the job’s parameters.

It is also possible to run the job manually to generate reports on demand at any time. In this case, Hudson will ask for the value of the threshold parameters that will be passed in to our Groovy script. These values will override the default values specified in the job’s configuration. Here is an example:

Verifying Quality Control Metrics

When the Hudson job runs, our Groovy script will verify that any thresholds defined in the job’s configuration are met by the project metrics extracted from Sonar. If the thresholds are met, the build will succeed and a summary of the quality control metrics will appear in the Hudson build. In addition, a summary email will be sent to the recipient list emailRecipients, providing interested parties with information regarding the key analyzed metrics.

On the other hand, if the thresholds are not met, the build will be marked as failed and the metric violation described in the Hudson build. Similarly, an email will be sent out informing recipients of the quality control violation.

Conclusion

This article demonstrates how Hudson can be extended with the use of dynamic programming languages like Groovy. In our example, we have created a Hudson job that verifies quality control metrics generated by Sonar and automatically sends quality reports by email. This type of functionality is useful in continuous integration environments, in order to extend the default features provided by Hudson or Sonar to meet custom needs.

Smooks processing recipies

Introduction

In one of our customer projects we had a requirement to import CSV, fixed length and Excel files in different formats and store records in the database. We chose Smooks to accomplish this task.

Smooks is a Java framework to read, process and transform data from various sources (CSV, fixed length, XML, EDI, …) to various destinations (XML, Java objects, database). It convinced me because:

  • it brings out-of-the-box components to read CSV and fixed length files
  • it integrates smoothly with an ORM library (Hibernate, JPA)
  • processing is configured using an XML configuration file – you need only few lines of code to do the transformations
  • extensibility – implementing a custom Excel reader was relatively easy
  • low added filtering overhead – reading 100.000 CSV lines and storing them in the database using Hibernate took us less than 30 seconds

During the development we had to overcome some hurdles imposed by Smooks processing model. In this post I would like to share our practical experience we gained working with Smooks. First, I’m going to present a sample transformation use case with requirements similar to a real-world assignment. Then I will present solutions to these requirements in a ‘how-to’ style.

Use case

We are developing a ticketing application. The heart of your application is Issue class:

We have to write an import and conversion module for an external ticketing system. Data comes in the CSV format (for the sake of simplicity). The domain model of the external system is slightly different than ours; however, issues coming from the external issue tracker can be mapped to our Issues.

External system exchange format defines the following fields: description, priority, reporter, assignee, createdDate, createdTime, updatedDate, updatedTime. They should be mapped to our Issue in the following manner:

  • description property – description field
  • This is a simple Smooks mapping. No big issue.

  • project property – there is no project field. Project should be assigned manually
  • A constant object (from our domain model) must be passed to Smooks engine to be used in Java binding.
    See Assign constant object to a property.

  • priority property – priority field; P1 and P2 priorities should be mapped to Priority.LOW, P3 to Priority.MEDIUM, P4 and P5 to Priority.HIGH
  • This mapping could be done using an MVEL expression. However, we want to encapsulate this logic in a separate class that can be easily unit-tested. See Use external object to calculate property value

  • involvedPersons property – reporter field plus assignee field if not empty (append assignee using ‘;’ separator)
  • Set compound properties in Java binding will show how to achieve it.

  • created property – merge createdDate and createdTime fields
  • updated property – merge updatedDate and updatedTime fields
  • In Set datetime from date and time fields, two strategies will be presented.

    Before diving into details, I’m going to present the final Smooks configuration and the invocation code of the transformation (as JUnit 4 test). Later on, in each recipe, I will explain the XML configuration and Java code fragments relevant to that recipe.

    The remaining classes (Issue, Priority, Project, IssuePrioritizer) are not included in the text. You can browse online the source code in GitHub. To get your local copy, clone the Git repository:


    git clone git://github.com/mgryszko/blog-smooks-recipies.git


    smooks-config.xml

    
    
    
        
            SAX
        
    
        
    
        
            
            
            
            
        
    
        
            
        
    
        
            
            
            
                prioritizer.assignPriorityFromCode(_VALUE)
            
            
                transformedProps["reporter"]
                    + (org.apache.commons.lang.StringUtils.isNotBlank(transformedProps["assignee"]) ? ";" + transformedProps["assignee"] : "")
            
            
                yyyy-MM-dd
            
            
                HH:mm
            
            
                updated = transformedProps["updatedDate"] + " " + transformedProps["updatedTime"];
                new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm").parse(updated)
            
        
    
    


    SmooksRecipiesTest

    package com.tsl.smooks;
    
    // imports hidden
    
    public class SmooksRecipiesTest {
    
        private static final SimpleDateFormat DATETIME_FORMAT = new SimpleDateFormat("yyyy-MM-dd HH:mm");
    
        private Source importedIssues = new StringSource(
            "description,priority,reporter,assignee,createdDate,createdTime,updatedDate,updatedTime\n"
                + "Added phased initialization of javabean cartridge,P1,Ataulfo,Teodorico,2010-10-01,13:10,2010-10-10,20:01\n"
                + "Processing recursive tree like structures with the Javabean Cartridge,P3,Eurico,,2010-10-02,07:15,2010-11-15,09:45"
        );
    
        private Smooks smooks;
        private ExecutionContext executionContext;
    
        private List expIssues;
        private Project expProject = new Project("Smooks");
    
        @Before
        public void initSmooks() throws Exception {
            smooks = new Smooks(getResourceFromClassLoader("smooks-config.xml"));
            executionContext = smooks.createExecutionContext();
            executionContext.getBeanContext().addBean("project", expProject);
            executionContext.getBeanContext().addBean("prioritizer", new IssuePrioritizer());
        }
    
        private InputStream getResourceFromClassLoader(String name) {
            return getClass().getClassLoader().getResourceAsStream(name);
        }
    
        @Before
        public void createExpIssues() throws Exception {
            expIssues = Arrays.asList(
                new Issue("Added phased initialization of javabean cartridge", expProject, Priority.LOW,
                    "Ataulfo;Teodorico", DATETIME_FORMAT.parse("2010-10-01 13:10"), DATETIME_FORMAT.parse("2010-10-10 20:01")
                ),
                new Issue(
                    "Processing recursive tree like structures with the Javabean Cartridge", expProject, Priority.MEDIUM,
                    "Eurico", DATETIME_FORMAT.parse("2010-10-02 07:15"), DATETIME_FORMAT.parse("2010-11-15 09:45")
                )
            );
        }
    
        @Test
        public void process() throws Exception {
            smooks.filterSource(executionContext, importedIssues);
    
            List issues = (List) executionContext.getBeanContext().getBean("issues");
            assertEquals(expIssues, issues);
        }
    }
    


    Assign a constant object (from your domain model) to a property

    According to the Smooks manual, bean context is the place where JavaBean cartridge puts newly created beans. We can add our own bean (Project):

    executionContext = smooks.createExecutionContext();
    executionContext.getBeanContext().addBean("project", new Project("Smooks"));
    

    … and reference it in the Java binding configuration:

    
        ....
        
        ...
    
    


    Use an external object to calculate property value

    Similar to the previous tip we add an additional bean (IssuePrioritizer) to the bean context:

    executionContext = smooks.createExecutionContext();
    executionContext.getBeanContext().addBean("prioritizer", new IssuePrioritizer());
    

    … and define an MVEL expression for the property. The MVEL expression uses the bean and references the value being processed (in this case coming from the CSV reader) by the implicit _VALUE variable:

    
        ....
        
            prioritizer.assignPriorityFromCode(_VALUE)
        
        ...
    
    


    Set compound properties in Java binding

    It is not possible to map directly two source fields to a Java bean property. Java bindings with and are executed on a SAX visitAfter event bound to to a single XML element/CSV field. We have to define a binding for a helper Map bean with the fields we want to merge:

    
        
        
        ...
    
    

    … and use an MVEL expression that concatenates two fields using the helper map bean (transformedProps):

    
        ...
        
            transformedProps["reporter"]
                + (org.apache.commons.lang.StringUtils.isNotBlank(transformedProps["assignee"]) ? ";" + transformedProps["assignee"] : "")
        
        ...
    
    


    Set datetime from date and time fields

    In this transformation we have to both merge and convert values of two fields.

    In the first solution, we create a separate setter for the date and time part in the target Issue class (Smooks uses setters in Java binding):

    public class Issue {
        ...
        public void setCreatedDatePart(Date createdDatetime) {
            createCreatedIfNotInitialized();
            copyDatePart(createdDatetime, created);
        }
    
        public void setCreatedTimePart(Date createdDatetime) {
            createCreatedIfNotInitialized();
            copyTimePart(createdDatetime, created);
        }
        ...
    }
    

    … and then use a standard value binding with date decoder:

    
        ...
        
            yyyy-MM-dd
        
        
            HH:mm
        
        ...
    
    

    The advantage of this approach is that you make use of Smooks decoder infrastructure. You can configure the transformation with your own decoders (e.g. custom java.util.Date allowing to specify multiple date formats). If you are using the built-in DateDecoder, you can catch and handle a standard DataDecodeException.

    The disadvantage is that you have to change your domain model code. New methods add complexity and must be unit tested, especially in cases when only one of partial setter is called.

    In the second solution, you define a binding for a helper Map bean with the date and time fields. In the right binding you use an MVEL expression concatenating date and time strings and converting them to Date (e.g. using a java.text.SimpleDateFormat instance):

    
        ...
        
            updated = transformedProps["updatedDate"] + " " + transformedProps["updatedTime"];
            new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm").parse(updated)
        
        ...
    
    

    The advantages of the first solution are disadvantages of the second one. You don’t touch your Java classes. It is simple – you have to specify only the Smooks configuration. In case of handling of many date/time formats and their combinations, the MVEL expression defining the conversion can become complicated. In case of an exception, you won’t get DataDecodeException, but an ugly, generic ExpressionEvaluationException.

    Conclusions

    Smooks is a great library that will save you writing a lot of code in case of processing many different formats. With a few lines of code and the XML configuration you will be able to read a file and persist its contents in the database using your favourite ORM framework.

    However, Smooks processing model and its usage in built-in cartridges make sometimes difficult to configure the transformation for a real world requirement. The information provided in the user guide is sometimes scarce and unclear. I hope these practical cases will help you use Smooks Java bean mappings and MVEL expressions more effectively.

Genomic Processing in the Cloud

Introduction

Over the last decade, a new trend has manifested itself in the field of genomic processing. With the advent of the new generation of DNA sequencers has come an explosion in the throughput of DNA sequencing, resulting in the cost per base of generated sequences falling dramatically. Consequently, the bottleneck in sequencing projects around the world has shifted from obtaining DNA reads to the alignment and post-processing of the huge amount of read data now available. To minimize both processing time and memory requirements, specialized algorithms have been developed that trade off speed and memory requirements with the sensitivity of their alignments. Examples include the ultrafast memory-efficient short read aligners BWA and Bowtie, both based on the Burrows-Wheeler transform.

The need to analyse increasingly large amounts of genomics and proteomics data has meant that research labs allocate an increasing part of their time and budget provisioning, managing and maintaining their computational infrastructure. A possible solution to meet their needs for on-demand computational power is to take advantage of the public cloud. With its on-demand operational model, labs can benefit from considerable cost-savings by only paying for hardware when needed for their experiments. Procurement of new hardware is also simplified and more easily justified, without the need to expand in-house resources.

Not only does the cloud reduce the time to provision new hardware; it also provides significant time-savings by automating the installation and customization of the software that runs on top of the hardware. A controlled computational environment for the post-processing of experiments allows results to be more easily reproduced, a key objective to researchers across all disciplines. Results can also be easily shared among researchers, as cloud-based services facilitate the publishing of data over the internet, while allowing researches control over their access. Finally, data storage in the cloud was designed from the ground-up with high-availability and durability as key objectives. By storing their experiment data in the cloud, researchers can ensure their data is safely replicated among data centres. These advantages free researchers from time-consuming operational concerns, such as in-house backups and the provisioning and management of servers from which to share their experiment results.

Given the vast potential benefits of the cloud, The Server Labs is working with the Bioinformatics Department at the Spanish National Cancer Research Institute (CNIO) to develop a cloud-based solution that would meet their genomic processing needs.

An Environment for Genomic Processing in the Cloud

The first step towards carrying out genomic processing in the cloud is identifying a suitable computational environment, including hardware architecture, operating system and genomic processing tools. CNIO helped us identify the following software packages employed in their typical genomic processing workflows:

  • Burrows-Wheeler Alignment Tool (BWA): BWA aligns short DNA sequences (reads) to a sequence database such as the human reference genome.
  • Novoalign: Novoalign is a DNA short read mapper implemented by Novocraft Technologies. The tool uses spaced-seed indexing to align either single or paired-end reads by means of Needleman-Wunsch algorithm. The source code is not available for download. However, anybody may download and use these programs free of charge for their research and any other non-profit activities as long as results are published in open journals.
  • SAM tools: After reads alignment, one might want to call variants or view the alignments against the reference genome. SAM tools is an open-source package of software aplications which includes an alignments viewer and a consensus base caller tool to provide lists of variants (somatic mutations, SNPs and indels).
  • BEDTools: This software facilitates common genomics tasks for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (.BED) format. BEDTools supports the comparison of sequence alignments allowing the user to compare next-generation sequencing data with both public and custom genome annotation tracks. BEDTools source code in freely available.

Note that, except for Novoalign, all software packages listed above are open source and freely available.

One of the requirements of these tools is that the underlying hardware architecture is 64-bit. For our initial proof of concept, we decided to run a base image with Ubuntu 9.10 for 64-bit on an Amazon EC2 large instance with 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each) and 850 GB of local instance storage. Once we had selected the base image and instance type to run on the cloud, we proceeded to automate the installation and configuration of the software packages. Automating this step ensures that no additional setup tasks are required when launching new instances in the cloud, and provides a controlled and reproducible environment for genomic processing.

By using the RightScale cloud-management platform, we were able to separate out the selection of instance type and base image from the installation and configuration of software specific to genomic processing. First, we created a Server definition for the instance type and operating system specified above. We then scripted the installation and configuration of genomic processing tools, as well as any OS customizations, so that these steps can be executed automatically after new instances are first booted.

Once our new instances were up and running and the software environment finalized, we executed some typical genomic workflows suggested to us by CNIO. We found that for their typical workflow with a raw data input between 3 and 20 GB, the total processing time on the cloud would range between 1 and 4 hours, depending on the size of the raw data and whether the type of alignment was single or paired-end. With an EC2 instance pricing at 38 cents per hour for large instances, and ignoring additional time required for customization of the workflow, the cost of pure processing tasks totalled less than $2 for a single experiment.

We also found the processing times to be comparable to running the same workflow in-house on similar hardware. However, when processing on the cloud, we found that transferring the raw input data from the lab to the cloud data centre could become a bottleneck, depending on the bandwidth available. We were able to work around this limitation by processing our data on Amazon’s European data centre and avoiding peak-hours for our data uploads.

Below, we include an example workflow for paired-end alignment that we successfully carried out in the Amazon cloud:

Example single-end workflow executed in the cloud.

Example single-end alignment workflow executed in the cloud.

Maximizing the Advantages of the Cloud

In the first phase, we demonstrated that genomic processing in the cloud is feasible and cost-effective, while providing a performance on par with in-house hardware. To truly realize the benefits of the cloud, however, what we need is an architecture that allows tens or hundreds of experiment jobs to be processed in parallel. This would allow researchers, for instance, to run algorithms with slightly different parameters to analyse the impact on their experiment results. At the same time, we would like a framework which incorporates all of the strengths of the cloud, in particular data durability, publishing mechanisms and audit trails to make experiment results reproducible.

To meet these goals, The Server Labs is developing a genomic processing platform which builds on top of RightScale’s RightGrid batch processing system. Our platform facilitates the processing of a large number of jobs by leveraging Amazon’s EC2, SQS, and S3 web services in a scalable and cost efficient manner to match demand. The framework also takes care of scheduling, load management, and data transport, so that the genomic workflow can be executed locally on experiment data available to the EC2 worker instance. By using S3 to store the data, we ensure that any input and result data is highly available and persisted between data centres, freeing our users from the need to backup their data. It also ensures that data can be more easily shared with the appropriate level of access control among institutions and researchers. In addition, the monitoring and logging of job submissions provides a convenient mechanism for the production of audit trails for all processing tasks.

The following diagram illustrates the main components of The Server Labs Genomic Processing Cloud Framework:

The Server Labs Genomic Processing Framework

The Server Labs Genomic Processing Cloud Framework.

The Worker Daemon is based on The Server Labs Genomic Processing Server Template, which provides the software stack for genomic processing. It automatically pulls experiment tasks from the SQS input queue along with input data from S3 to launch the genomic processing workflow with the appropriate configuration parameters. Any intermediate and final results are stored in S3 and completed jobs are stored in SQS for auditing.

Cost Analysis

Given a RightGrid-based solution for genomic processing, we would like to analyse how much it would cost to run CNIO’s workflows on the Amazon cloud. Let us assume for the sake of our analysis that CNIO runs 10 experiments in the average month, each of which generate an average of 10 GB of raw input data and produce an additional 20 GB of result data. For each of these experiments, CNIO wishes to run 4 different workflows, with an average running time of 2 hours on a large EC2 instance. In addition, we assume that the experiment results are downloaded once to the CNIO in-house data center. We also assume that the customer already has a RightScale account, the cost of which is not included in the analysis.

Amazon Service Cost
SQS Negligible
S3
  • Data transfer in: $0.10 per GB * 10 GB per workflow = $1 per workflow
  • Data transfer out: 1 download per workflow * 20 GB per download * $0.15 per GB = $3 per workflow
  • Storage: 30 GB per workflow * $0.14 per GB = $4.20 per workflow
  • Total cost: $8.20 per workflow
EC2 $0.38 per hour * 2 hours per workflow = $0.76 per workflow
All services Total cost: $8.20 + $0.76 = $8.96 per workflow

Total Cost:
10 experiments per month * 4 workflows per experiment * $8.96 per workflow =
$358.40 per month or $4300 per year

Towards an On-demand Genomic Processing Service

By building on the RightGrid framework, The Server Labs is able to offer a robust cloud-based platform on which to perform on-demand genomic processing tasks, at the same time enabling experiment results to be more easily reproduced, stored and published. To make genomic processing even simpler on the cloud, the on-demand model can be taken even one step forward by providing a pay-as-you-go software as a service. In such a model, researchers are agnostic to the fact that the processing of their data is done in the cloud. Instead, they would interact with the platform via a web interface, where they would be able to upload their experiment’s raw input data, select their workflow of choice, and choose whether or not to share their result data. They would then be notified asynchronously via email once the processing of their experiment data has been completed.

References

Low Cost, Scalable Proteomics Data Analysis Using Amazon’s Cloud Computing Services and Open Source Search Algorithms

How to map billions of short reads onto genomes

The RightGrid batch-processing framework

SCA Async/Conversational services Part 2: Non-SCA Web Service client

Following my previous post on the internals of asynchronous and conversational services in Tuscany SCA, which options are available for consuming these services when you cannot use a Tuscany SCA runtime on the client?.

Depending on the transport binding used you would expect to find a certain level of standarisation on conversational/asynchronous services implementation, allowing interoperable solutions. However, it is difficult, if not impossible, to find interoperable solutions for these types of services, even for highly standarised bindings such as SOAP Web Services. We have seen in the previous post how Tuscany SCA handles these services for WS and JMS bindings. We saw that at least for Web Services, some WS-* standards were used (i.e. WS-Addressing), but there are still solution-specific details that do not allow interoperability. This reflects how difficult it is for the standarisation community to define detailed standards and mechanisms to enable interoperable solutions for these kind of services.

In this post I present an option for this situation, using an standard JAX-WS client, attaching the appropriate WS-Addressing headers using JAXB. For that, I extend the pure Tuscany SCA client-server sample of the previous post with a non-SCA client.

You can find the extended source code of the maven projects here.

Overview

The two main pieces to setup on a JAX-WS client are:

  1. Proper WS-Addressing headers containing the Tuscany’s specific information (i.e callbackLocation, conversationID and optionally callbackID).
  2. A web service implementing the callback interface, listening in callbackLocation URI.

JAX-WS WS-Addressing headers setup

As for a normal WSDL-first client development, the first step is to generate the JAX-WS client classes from the Counter Service WSDL. This is done using the maven jaxws-maven-plugin on the wsdl file generated for the service (i.e. http://localhost:8086/CounterServiceComponent?wsdl).

To setup the SOAP WS-Addressing headers, I follow the recommendations from the JAX-WS guide and use WSBindingProvider interface, which offers a much better control on how headers can be added.

Therefore, in the constructor of CounterServiceClient.java, the WS-Addressing “TuscanyHeader” is added:

     public CounterServiceClient() {
        super();
        
        // Initialise JAX-WS Web Service Client.
        service = new CounterServiceService();
        servicePort = service.getCounterServicePort();
  
        // Set the WS-Addressing headers to use by the client.
        WSBindingProvider bp = (WSBindingProvider) servicePort;
        // Generate UUID for Tuscany headers
        conversationId = UUID.randomUUID().toString();
        callbackId = UUID.randomUUID().toString();
        ReferenceParameters rp = new ReferenceParameters(conversationId);
        rp.setCallbackId(callbackId);
        TuscanyHeader header = new TuscanyHeader(CALLBACKURI, rp);        
        bp.setOutboundHeaders(Headers.create((JAXBRIContext) jaxbContext,
                header));
        
        ...
    }        

The ReferenceParameters and TuscanyHeader classes are JAXB classes with the required information to map the TuscanyHeader Java object to the WS-Addressing XML header. For instance, the ReferenceParameters class, which includes the Tuscany SCA parameters, has the following JAXB definitions:

public class ReferenceParameters {
    private static final String TUSCANYSCA_NS = "http://tuscany.apache.org/xmlns/sca/1.0";
    /** The conversation id. */
    private String conversationId = "";
    /** The callback id. */
    private String callbackId = null;

    ...
    public void setCallbackId(String callbackId) {
        this.callbackId = callbackId;
    }

    public void setConversationId(String conversationId) {
        this.conversationId = conversationId;
    }

    @XmlElement(name = "CallbackID", namespace = TUSCANYSCA_NS, required = false)
    public String getCallbackId() {
        return callbackId;
    }

    @XmlElement(name = "ConversationID", namespace = TUSCANYSCA_NS, required = true)
    public String getConversationId() {
        return conversationId;
    }
}

JAX-WS callback service setup

The callback service needs to be created on the client side. For that we define a simple web service (CounterServiceCallbackImpl.java) with the CounterServiceCallback as the contract.

/**
 * The CounterServiceCallback implementation.
 * The Web service namespace must be the same as the one defined in the Server Side.
 */
@WebService(targetNamespace = "http://server.demo.tuscany.sca.tsl.com/")
public class CounterServiceCallbackImpl implements CounterServiceCallback {
	private int count;

	@Override
	public void receiveCount(int count, int end) {
		System.out.println("CounterServiceCallback --- Received Count: " + count + " out of " + end);
		this.count = count;
	}
        ...
}

In this sample project, the web service is also started during the instantiation of the JAX-WS Client. We use the embedded Sun httpserver to publish the web service instead of relying in other web application servers as Jetty or Tomcat:

     public CounterServiceClient() {
        super();
        ...
        
        // Setup Receiver Callback Web Service
        System.out.println("Starting Callback service in " + CALLBACKURI);
        callback = new CounterServiceCallbackImpl();
        callbackServiceEp = Endpoint.create(callback);
        callbackServiceEp.publish(CALLBACKURI);
    }        

Testing the client

As in the previous post, in order to run the client tests you need to:

  • Run the server side, executing the run-server.sh under the counter-callback-ws-service directory. The server component should start and wait for client requests.
  • Go to the client project, counter-callback-ws-client-SCA and execute the tests with:
    mvn clean test
    

    This runs both the SCA and non-SCA client tests.

  • In case you want to run two clients at the same time, you need to define two separate CallbackURIs. For that purpose the property sca.callbackuri has been defined to configure the URI to use. Therefore to run two clients in parallel execute
    mvn  test -Dsca.callbackuri="http://localhost:1999/callback" -Dtest=com.tsl.sca.tuscany.demo.client.WebServiceNonSCAClientTest 
    and another client with 
    mvn  test -Dsca.callbackuri="http://localhost:2999/callback" -Dtest=com.tsl.sca.tuscany.demo.client.WebServiceNonSCAClientTest 
    

    Once running you should see the different service calls and conversations IDs interleaved in the server side console:

    setCounterServiceCallback on thread Thread[Thread-2,5,main] 
    ConversationID = 81e257cd-af3b-4ea6-ae69-4b8e0d9db8a9. Starting count 
    Sleeping ...
    setCounterServiceCallback on thread Thread[Thread-5,5,main]                 
    ConversationID = 70e86dcf-436c-4df1-ab6b-165b3f9070a4. Starting count   
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    ConversationID = 81e257cd-af3b-4ea6-ae69-4b8e0d9db8a9. Stopping count
    Sleeping ...
    ConversationID = 70e86dcf-436c-4df1-ab6b-165b3f9070a4. Stopping count
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    ConversationID = 81e257cd-af3b-4ea6-ae69-4b8e0d9db8a9. Restarting count
    Sleeping ...
    ConversationID = 70e86dcf-436c-4df1-ab6b-165b3f9070a4. Restarting count
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Sleeping ...
    Thread interrupted.
    Thread interrupted.
    

Conclusions

I hope that these posts have shed some light on how SCA and conversational/async services are implemented taking Tuscany as SCA runtime reference. I also believe that it is important to know the available options for consuming a SCA service without an SCA runtime, and how we can do it in a simple manner.