Refinery Platform Documentation

The Refinery Platform documentation is split up into three major sections: one for users, one for administrators and one for developers.

User Documentation

This section of the Refinery Platform documentation is intended for users of the web application.

Administrator Documentation

This section of the Refinery Platform documentation is intended for administrator of a Refinery instance.

Setting up a Refinery Instance from Scratch

Dependencies

Python

Refinery requires Python 2.7.3. For package dependencies please see requirements.txt. Requirements can be installed using pip as follows:

> pip install -U -r requirements.txt

You might need to install NumPy manually before running the above command (you can find the version in requirements.txt). For example:

> pip install numpy==1.7.0
Virtual Environment

We highly recommended to create a virtualenv for your Refinery installation.

External Software
Postgresql
Apache HTTP Server
Apache Solr

Refinery uses Solr for searching and faceted browsing.

Website
http://lucene.apache.org/solr
Version
4.0.0-alpha or later
Configuration

We recommend to run Solr using the bundled Jetty webserver. The Solr example configuration included in the standard download is sufficient and should be run like this:

cd <solr-download-directory>
java -Dsolr.solr.home=<refinery-installation-directory>/solr/ -jar start.jar > <path-to-solr-log-file> 2>&1 &

By default Jetty will allow connections to Solr from any IP address. This is not secure and not required to run Refinery. We recommend to allow connections to Solr only from localhost. Note that this requires Solr to run on the same host as Refinery. If Solr should run on another host change the IP address used below accordingly.

To configure Jetty to only accept connections from localhost do the following:

  1. Go to <solr-download-directory>/etc.

  2. Open jetty.xml.

  3. Locate <Call name="addConnector"> in jetty.xml. Be aware that the default jetty.xml file contains an addConnector block that is commented out.

  4. Supply a default value of “127.0.0.1” for the jetty.host system property used to configure Host as follows:

    <Set name="Host"><SystemProperty name="jetty.host" default="127.0.0.1"/></Set>
    
  5. Make sure that the jetty.host system variable is not set.

  6. Restart Jetty using the command shown above.

  7. In the settings_local.py of your Refinery installation configure REFINERY_SOLR_BASE_URL as follows:

    REFINERY_SOLR_BASE_URL = "http://localhost:8983/solr/"
    
  8. Restart the WSGI server running Refinery to reload your settings.

RabbitMQ

This is the preferred message broker for the Celery distributed task queue. Refinery uses Celery and RabbitMQ to handle long-running tasks.

Website
http://www.rabbitmq.com
Version
???

Settings

Refinery settings are configured in settings_local.py.

Note

You should never edit the settings directly in settings.py to avoid conflicts when upgrading.

Database Settings

DATABASES

Solr Settings
REFINERY_SOLR_BASE_URL = "http://localhost:8983/solr/"
Location of the Solr API.
Email Settings

EMAIL_HOST = 'localhost'

EMAIL_PORT = 25

DEFAULT_FROM_EMAIL = 'webmaster@localhost'

SERVER_EMAIL = 'root@localhost'
The email address that error messages come from, such as those sent to ADMINS and MANAGERS.
Customization Settings
TIME_ZONE = 'America/New_York'
Local time zone for this installation. Choices can be found at http://en.wikipedia.org/wiki/List_of_tz_zones_by_name, although not all choices may be available on all operating systems. On Unix systems, a value of None will cause Django to use the same timezone as the operating system. If running in a Windows environment this must be set to the same as your system time zone.
REFINERY_PUBLIC_GROUP_NAME = "Public"
Set the name of the group that is used to share data with all users (= “the public”)
REFINERY_PUBLIC_GROUP_ID = 100
Do not change this after initialization of your Refinery instance.

ISA_TAB_DIR = ''

FILE_STORE_DIR = 'file_store'
Location of the file store data directory relative to MEDIA_ROOT.
REFINERY_SOLR_SPACE_DYNAMIC_FIELDS = "_"
Used to replaces spaces in the names of dynamic fields in Solr indexing.
REFINERY_CSS = ["styles/css/refinery-style-bootstrap.css", "styles/css/refinery-style-bootstrap-responsive.css", "styles/css/refinery-style.css" ]
List of paths to CSS files used to style Refinery pages (relative to STATIC_URL)
REFINERY_GOOGLE_ANALYTICS_ID = ""
Supply a Google analytics id “UA-...” (if set to “” tracking will be deactivated).
EMAIL_SUBJECT_PREFIX = '[Refinery] '
Prefix for emails sent by Refinery. Should always end with a space.
REFINERY_REPOSITORY_MODE = False
Set to True to activate Refinery repository mode.
ACCOUNT_ACTIVATION_DAYS = 7
Number of days user has to activate their account before it expires.
REFINERY_WELCOME_EMAIL_SUBJECT = 'Welcome to Refinery'
Subject of the welcome email sent to new users.
REFINERY_WELCOME_EMAIL_MESSAGE = 'Please fill out your user profile'
Message body of the welcome email sent to new users.
REFINERY_FILE_SOURCE_MAP = {}
Optional dictionary for translating file URLs into file system paths (and vice versa) format: {‘pattern’: ‘replacement’} where pattern is a string to search for in source and then replace with replacement string. May contain more than one pattern-replacement pair (only the first match will be used).
REFINERY_BANNER = ''
Optional string to display a message near the top of every page (HTML tags allowed).
REFINERY_BANNER_ANONYMOUS_ONLY = False
Optional setting to display REFINERY_BANNER to anonymous users only.
REFINERY_REGISTRATION_CLOSED_MESSAGE = ''
Optional string to display a message when REGISTRATION_OPEN = False (HTML tags allowed).
REFINERY_INNER_NAVBAR_HEIGHT = 20
Set height of navigation bar (e.g. to fit a logo).
REFINERY_MAIN_LOGO = ""
Supply a path to a logo that will become part of the branding (set navbar height correctly!)
REFINERY_EXTERNAL_AUTH = False
Use external authentication system like django-auth-ldap (disables password management URLs)
REFINERY_EXTERNAL_AUTH_MESSAGE = ''
Message to display on password management pages when REFINERY_EXTERNAL_AUTH = True
TAXONOMY_URL = "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz"
Location of the zip file that contains the entire NCBI taxonomy database
UCSC_URL = "hgdownload.cse.ucsc.edu/admin/hgcentral.sql"
Database of all UCSC genomes, alternate names, and their corresponding organisms.
AE_BASE_QUERY = 'http://www.ebi.ac.uk/arrayexpress/xml/v2/experiments?'
Base query for what kind of ArrayExpress studies to pull in (e.g. only ChIP-Seq studies, or studies updated after a certain date)
AE_BASE_URL = "http://www.ebi.ac.uk/arrayexpress/experiments"
prefix of the URL where all ArrayExpress studies’ MAGE-TAB files can be accessed
Authentication settings

Example for user authentication via LDAP using django-auth-ldap:

from django_auth_ldap.config import LDAPSearch
# Baseline configuration
AUTH_LDAP_SERVER_URI = "ldap://ldap.example.com"
AUTH_LDAP_BIND_DN = ""
AUTH_LDAP_BIND_PASSWORD = ""
AUTH_LDAP_USER_SEARCH = LDAPSearch("OU=Domain Users,DC=rc,DC=Domain",
                                ldap.SCOPE_SUBTREE, "(uid=%(user)s)")
# Populate Django user from the LDAP directory.
AUTH_LDAP_USER_ATTR_MAP = {
   "first_name": "givenName",
   "last_name": "sn",
   "email": "mail"
}
settings.AUTHENTICATION_BACKENDS += (
    'refinery.core.models.RefineryLDAPBackend',
)

Refinery is a Django-based web application implemented in Python and JavaScript. For a full list of all external dependencies please see Dependencies.

Installation

The easiest method to install Refinery is to follow instructions in the README file. Note: Installation process will fail if any of the ports forwarded from the VM are in use on the host machine (please see Vagrantfile for the list of ports). After installation has finished you will need to create a Django superuser:

> python manage.py createsuperuser

To instantiate administrator-modifiable content on the Refinery website, e.g., the contents of the “About” page, load the default content into the database:

> python manage.py loaddata core/fixtures/default-pages.json

Obtaining the Software

The source code for Refinery can be downloaded from the Github repository either by cloning the repository or by downloading a zip archive.

Settings

Before Refinery can be installed a number of variables - so called “settings” - have to be configured. In addition to the settings discussed here, please also see the complete list of all Refinery Settings that can be customized.

Galaxy

Galaxy is required to run analyses in Refinery and to provide support for archiving.

Website
https://bitbucket.org/galaxy/galaxy-dist
Version
Aug 12, 2013 Galaxy Distribution
Notes

Refinery running in the VM can access Galaxy instance running on the host at http://192.168.50.1:8080

On the host you will need to:

  • Set $GALAXY_DATABASE_DIR env variable to the absolute path of the $GALAXY_ROOT/database folder of your local Galaxy instance installed on the host if you want to copy files directly it.
  • Create a symlink /vagrant/media that points to the absolute path of the media subdirectory inside the Refinery project directory.

Upgrading an Existing Refinery Instance

Migrations

First:

>>> ./manage.py syncdb

Next:

>>> ./manage.py migrate --list

Preparing Galaxy Workflows for Refinery

To import a Galaxy workflow into Refinery, you first have to annotated the workflow. The amount of annotation required is minimal and you can conveniently add the annotation for the workflow in the Galaxy workflow editor.

In a nutshell, you have to provide simple Python dictionaries (see examples below if you are not familiar with Python) in the “annotation” text fields for the workflow and corresponding tools. These fields can be found on the right side of the workflow editor.

Annotation fields must either be empty of contain correctly formatted annotation dictionaries as described below. If other information is found in an annotation field, you will not be able to import the workflow into Refinery.

Workflow-Level Annotations

For Refinery to recognize a Galaxy workflow as a Refinery Workflow, you need to provide a set of simple annotations in the workflow annotation field in the Galaxy workflow editor. The annotation field is listed under “Edit Attributes” on the right side of the workflow editor.

Note

The annotation fields in the Galaxy workflow editor behave slightly differently for workflow-level and tool-level annotations. In order to confirm changes to a workflow-level annotation, move the cursor to the end of the input field and hit the Return key. This is not required in tool-level annotation fields. Be sure to save the workflow after editing an annotation field.

The workflow-level annotation is a Python dictionary with the following keys:

refinery_type: string
Required | This field is used to tell Refinery how it should treat the workflow. Refinery Workflows are either analysis or (bulk) download workflows. The outputs of analysis Workflows will be inserted into the Data Set and connected to their inputs in the experiment graph. The outputs of bulk download Workflows are assumed to be archive files (zip files, tarballs) and will be associated with the Data Set but will appear in a list of available downloads.
refinery_relationship: array of dictionaries

Optional | This field is used to describe relationships between inputs of the Workflow. For example, a Workflow that performs peak-calling on ChIP-seq data, requires that each ChIP file is associated with one input file (= genomic background). Such relationships are described using dictionary with three fields:

category: string
Required | Describes the type of the relationship between files and can be one of 1-1, 1-N, N-1, REPLICATE.
set1: string
Required | For 1-1, 1-N, N-1 and REPLICATE relationships, this must refer to the name of the corresponding workflow input, for example to the input used for the ChIP file.
set2: string
Required (not for REPLICATE relationships) | For 1-1, 1-N and N-1, this must refer to the name of the corresponding workflow input, for example to the input used for the input file (= genomic background).

Schematic tool annotation (indentation only for better readability):

{
        "refinery_type": "<workflow_type>",
        "refinery_relationships": [
                {
                        "category": "<relationship_type>",
                        "set1": "<name_of_input_1>",
                        "set2": "<name_of_input_2>"
                }
        ]
}
Examples

A standard analysis workflow with a single input would be annotated as follows:

{
        "refinery_type": "analysis"
}

A download workflow would be annotated like this:

{
        "refinery_type": "download"
}

A more complex analysis workflow with two inputs and a 1-1 relationship between two inputs named “ChIP file” and “input file” would be annotated as follows: (the name fields of the two input datasets are set to “left input file” and “right input file”, respectively)

{
        "refinery_type": "analysis",
        "refinery_relationships": [
                {
                        "category": "1-1",
                        "set1": "ChIP file",
                        "set2": "input file"
                }
        ]
}

Tool-Level Annotations

In order to import output files generated a tool in the workflow into Refinery, the tool has to be annotated. To access the annotation field for a tool, click on the tool representation in the workflow editor. The annotation field is named “Annotation / Notes”.

Note

You have to annotate at least one tool and one output file. Workflows that do not declare outputs for import into Refinery will not be imported.

Like in workflow-level annotations, the annotation needs to be provided as a Python dictionary. In order to import output files of the tool back into Refinery, the tool-level annotation dictionary needs to contain a key that is the same as the output declared by the tool, for example "output_file".

This key must be associated with a further dictionary that provides a name, that will be used to import the file into Refinery. Optionally, a description can be provided to further explain the content of the output file, as well as a file type, if the file extension provided by Galaxy is not sufficient to detect the actual file type automatically. This is typically the case when Galaxy uses “data” as the file extension.

name: string
Required | A descriptive name for the output file. If output files from multiple tools in the workflow are imported back into Refinery, it is recommended to include the name of the tool in the file name.
description: string
Optional | A description of the file. This will be shown in the description of the workflow outputs.
type: string
Optional | The abbreviation/extension of a file type registered in Refinery.

Schematic tool annotation (indentation only for better readability)

{
        "<tool_output_1>": {
                "name": "<filename_1>",
                "description": "<description_1>",
                "type": "<extension_1>"
        },
        "<tool_output_2>": {
                "name": "<filename_2>",
                "description": "<description_2>",
                "type": "<extension_2>"
        }
}
Example

The following example use indentation for better readability. Indentation is not required.

{
        "output_narrow_peak": {
                "name": "spp_narrow_peak",
                "description": "",
                "type": "bed"
        },
        "output_region_peak": {
                "name": "spp_region_peak",
                "description": "",
                "type": "bed"
        },
        "output_plot_file": {
                "name": "spp_plot_file",
                "description": "",
                "type": "pdf"
        }
}

Importing Galaxy Workflows into Refinery

Before you can import Workflows from a Galaxy installation into Refinery, the following requirements have to be met:

  • You have to add a Galaxy Instance for the Galaxy installation in question to Refinery through the admin UI.

  • You have to create a Workflow Engine for this Galaxy Instance using the create_workflowengine command, which requires a Galaxy Instance id and the name of a group that should own the workflow engine, e.g. “Public”.

    >>> python manage.py create_workflowengine <instance_id> "<group_name>"
    

    Alternatively, you can also create a workflow engine through the admin UI, in that case, however, you have to manually assign ownership to the managers of the group that should own the workflow engine.

  • You have to annotate all workflows in the Galaxy installation that you want to import.

Once these requirements have been met, run the import_workflows command:

>>> python manage.py import_workflows

This command will attempt to import Workflows from all Workflow Engines registered in your Refinery server. All Galaxy workflows that are annotated as Refinery Workflows will be parsed and imported if annotated correctly. Annotation errors will be reported, as well as the total number of Workflows imported from each Workflow Engine.

Existing Workflows in your Refinery server will be deactivated but not deleted. Deactivated workflows can no longer be executed but their information can be accessed through the Analyses in which they were run.

Adding a Genome Build into Refinery for Visualizations

Adding a genome build to refinery is a two-step process: first, you need to add in the taxon information for your organism, and then you create the associated genome build. These are both done through the admin interface.

Before logging into the admin interface however, we need to the taxonomy information for our organism, so go to NCBI’s taxonomy browser (http://www.ncbi.nlm.nih.gov/taxonomy) and search for your organism. You should eventually end up on a page that looks something like this:

_images/taxonpage.png

Keep the page open and go to the refinery admin interface. After logging in, navigate to the Annotation Server and click on Taxons, then click on Add taxon. From there, you will be brought to a form with four fields:

  • Taxon id: NCBI taxon ID
  • Name: the name of the organism
  • Unique name: the scientific name of the organism if the name provided in the above field is not unique across all possible names (e.g. C. elegans can refer to multiple species)
  • Type: type of name (e.g. scientific name, an abbreviation, the common name, etc.)

The NCBI taxonomy page in your browser will help you fill in all of the values. See the picture below to know what information on the page goes where. Please note that you need to create a new entry for every name that you use. So in our example below, if you wished to put all of these names in the database, you would create an entry for Homo sapiens, human, man, and Homo sapiens Linnaeus, 1758 in addition to any other names you might wish to create (e.g. H. sapiens).

_images/taxonpage-markedup.png

In the above image, the important pieces of information have been highlighted in colored boxes. Below are two examples.

_images/taxon-example-human.png

Because the scientific name has no type associated with it, please annotate the the type field with “scientific name.” This is the official type designated by NCBI.

_images/taxon-example-homosapiens.png

Even though H. sapiens is not on the taxonomy page, because many people use a species’ abbreviated name when annotating their data, fill out the form accordingly.

_images/taxon-example-hsapiens.png

Now that the taxon information has been filled in for your organism, you can input the information for the genome build you’d like to support. Click Annotation server again in the admin interface and this time click Genome builds. Fill in the fields accordingly to the best of your knowledge, making sure to have the species point to the taxon that uses the full scientific name. Below are two examples. Please make sure only one genome build for each organism is selected as the default.

_images/genomebuild-example-hg19.png _images/genomebuild-example-grch37.png

Please note that while it is not required that you fill in a UCSC equivalent for any non-UCSC genome builds provided, we are currently considering the UCSC genome builds to be the standard, so we’d prefer that it exist.

Developer Documentation

This section of the Refinery Platform documentation is intended for developers who are contributing to the Refinery core and extensions.

The source code for the Refinery Platform is available in the repository.

Development Environment

This section of the Refinery Platform documentation describes setting up Eclipse for Refinery development.

Eclipse Project defaults

Main Module:

${workspace_loc:refinery-platform}/${DJANGO_MANAGE_LOCATION}

Program arguments:

runserver --noreload

Working directory:

${workspace_loc:}

Git

Make sure to use the SSH repository URL (instead of HTTPS) if you want to push code to Github without entering username and password.

> git remote set-url origin git@github.com:parklab/refinery-platform.git

License

The Refinery Platform license is very similar to the MIT License but contains an additional clause the prohibits the use of the names of the copyright holders in most circumstances.

Copyright (c) 2011-2013 The President and Fellows of Harvard College.
All rights reserved. Copyright (c) 2011-2013 Boston Children's
Hospital. All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

Except as contained in this notice, the name of Harvard University
and Boston Children's Hospital or any affiliate shall not be used in
advertising, publicity, news release or otherwise to promote the
sale, use or other dealings in this Software without prior written
authorization by Harvard University and Boston Children's Hospital.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.