Most proteins are comprised of smaller (structural) modules - protein domains The evolution of proteins involves the re-arrangement, loss or gain of protein domains in a process referred to as modular evolution.
DoMosaics is a tool for analysing and visualizing aspects of modular protein evolution. It allows users to, starting with a set of related protein sequences, annotate protein domains (using different domain annotation methods), and visualize domain arrangements (the N- to C-terminal order of domains in a protein) along a phylogentic tree. It can be used to find out whether a domain of interest was lost, whether a group of proteins differs in its domain arrangements, or what the characteristic domains for a phylogenetic group are.
Beyond domain annotation, DoMosaics allows users to examine domain annotations to see whether the domains have been correctly identified. You can create dotplots, run a context-dependent similarity analysis and more. Furthermore, you can also run some general phylogenetic profiling, create domain network graphs, find domain insertion/deletion events or simply visualize and manipulate phylogenetic trees. All output can be saved as PDF or as SVG (which can be viewed in Firefox or edited using e.g. Adobe Illustrator or Inkscape) as well as in a number of common bitmap formats.
What DoMosaics is
What DoMosaics is not
This is the guide / documentation of DoMosaics. Use the menu at the top to browse by topic If you have a problem, or cannot find what you are looking for, please do not hesitate to get in touch with us under firstname.lastname@example.org.
DoMosaics requires a working Java installation. Furthermore, we recommend that users who plan to run local HMMSCAN jobs have a least a dual-core system. You can find the appropriate Java version for your system here: http://java.com/en/download/index.jsp If you plan to run HMMSCAN you will further more require the HMMER package, which you can obtain from here: http://hmmer.janelia.org/software
To get DoMosaics, navigate to the website www.domosaics.net, and go to the download section. Here, you can run DoMosaics using Java Web Start. In order to do so, you must select the JVM size (e.g. 500MB), and press 'Start DoMosaics'. Your browser will download a JNLP file and ask for you to confirm the security exception. Note that DoMosaics will require full permission to your system in order to interact with processes, write files etc. Once you confirm the security settings, the program should start. If you prefer, you can also Download the jar and run it manually (or compile the program from source). Choose the tab 'Direct Download' to get the binaries and source files.
If your browser does not autostart the jar file after download, it is likely that your browser is not set up to handle JNLP files. If so, when you start DoMosaics you will see a dialog asking you what to do with the file. In the window, click on Open with, select the javaws executable (depending on your operating system, the name can be different) and press OK. If the "open with" option does not display the Java Web Start executable, map the .jnlp file to Java Web Start manually. This procedure differs for different browsers - best google the right solution for your OS/Browser setup.
[12:59] radmoore@joshua ~/Download $ java -jar domosaics.jarUsers of other systems should be able to double-click the jar file.
InterPro (read section Interproscan below) can be used to find domains in a set of protein sequences. This service is offered by the EBI. In order to use this free public service, users must provide a valid email address. While an email address can be provided in the user interface of interproscan and for each scan separately, it will be lost as soon as the dialog is closed. Users can provide a global email address here, which will be pre-filled in the interproscan form.
Notethe email address provided is used for interproscan only. Find out more on the conditions for using EBI Webservices here.
If you use a local HMMER installation, you can pre-define the binaries and profiles you use for scanning here. For more information on what these fields mean (and what they are good for), consult the section on Hmmscan below.Workspace
DoMosaics project data are saved in folders, with all relevant data saved in xml files. The project folders are saved in a user-defined workspace folder (typically $user_home/domosaics-workspace). If DoMosaics does not have a workspace folder set upon start up, users are prompted to enter one. The workspace folder can be changed any time here.
Users are prompted whether the workspace is to be synchronized upon exit (in which any changes made to existing projects, as well as any new projects will be written to the workspace folder). This can be set as default by selecting the check-box 'Save Workspace on Exit'. If a project should already exist it can be overwritten (users will normally be prompted whether existing project files are to be overwritten). This too can be set as default.Help improve
Should any errors occur while using DoMosaics, a simple error reports can be submitted to a dedicated bugzilla
instance. By agreeing to 'Help improve DoMosaics' we are made aware of any Java exceptions
which might occur while using DoMosaics.
Submitted error reports do not contain any information on data used. However, they do contain a dump of the current users environment which often contains the username of the current user as well as directory names and paths (along with other, purely machine-based information). The bug reports are used for the sole purpose of simplifying the reconstruction of the error (so that it can be fixed). We greatly appreciate your help in eradicating any errors which may occur.
>P82176 MKCLLYLCLWCYCVLVSSSIVLICNGGHEYYECGGACDNVCADLHIQNKTNCPIINIRCN DKCYCEDGYARDVNGKCIPIKDCPKIRSRRSIGIPVDKKCCTGPNEHYDEEKVSCPPETC ISLVAKFSCIDSPPPSPGCSCNSGYLRLNLTSPCIPICDCPQMQHSPDCQwould match with a protein in the domain view named P82176. Everything between the '>' and the first white space is treated as the ID.
>ENSP00000376776 617 57 171 DOMON 2.0e-25 213 341 Cu2_monooxygen 7.5e-43 360 521 Cu2_monoox_C 2.3e-52ENSP00000376776 consists of three domains (DOMON, Cu2_monooxygen, Cu2_monoox_C) and has a sequence length of 617. In the xdom format, domains are sorted in N- to C terminal order. Each domain line contains the start and stop position, some form of ID or accession number and an Evalue.
NoteRunning an interproscan search requires a working internet connection.
Hmmscan is a program from the HMMER package. It is used to search protein sequences against a database of hidden Markov models (HMM) profiles, such as those created stored in Pfam. HMMSCAN is used to detect the presence of domains in proteins.
Noteyou can set default values for the binaries and profiles in the settings, found under Main menu, File, Settings.
Both Hmmscan and Hmmpress binaries can be downloaded as part of an archive from http://hmmer.janelia.org. Once downloaded, extract the archive and look for the 'binaries' folder. In there, you should find the required binaries (Hmmscan and Hmmpress). If you cannot find an archive for your OS, or the binaries do not work, you will have to compile from source - please check the HMMER manual on instructions how to do so.HMMER3 scan bin
Notethat you can conduct searches and construct profiles on the HMMER website.
To assert the presence of a domain in a protein, thresholds are generally used to minimize the number of false positives. The threshold can be either manually curated for each domain model (see the HMMER manual for a description of the gathering threshold) or automatically set to an E-value (confidence) cut-off for all models. In DoMosaics, if the confidence cutoff is selected, the per-model gathering threshold will be used. Otherwise, a global E-value cut-off can be entered. Filters
HMMER3 applies a number of filters which amount to a drastic increase in speed. However, the downside of these filers is that false positives can be expected (domains wrongly discarded by filters). HMMER, and DoMosaics, provides the option to turn off some of the filters. The "bias filter" is for biased amino-acid composition (low-complexity) and the "max filter" deactivates all filters (at the cost of much longer running times).
The Number of CPUs specifies on how many CPUs the scan will be performed.Post processing
Hmmscan returns a list of all domain occurrences detected in a protein, which might contain overlaps. This frequently occurs with homolog domain families (specialization of an ancestral domain). DoMosaics has different methods for dealing with overlaps:
As part of the post processing of domain annotation, DoMosaics allows for context-dependant annotation as a mean
to provide more complete domain annotation. The objective of context-dependant annotation is to detect
remote/divergent but relevant additional domains by using similar arrangements to find possibly missed annotation.
Confidence thresholds are to guarantee the (quasi-)absence of false positives. However, such stringent policy leads to numerous missed domains, especially in remote or divergent species. This can be dealt with by lowering detection threshold. However, this comes at the cost of exponential number of false positives. The Co-Occurring Domain Detection (CODD) approach (described in Terrapon et al., Bioinformatics, 2009) exploits the tendency of domain families to appear with a highly reduced set of "collaborating" other families. It filters the false positives to only retain the most trust-worthy domains that barely escape from recommended (pre-defined) thresholds.
For more information on RADS, visit http://rads.uni-muenster.deDoMosaics provides an interface to the RADS web service, and allows users to find arrangements in UniProt which are similar to a given query. There are two distinct ways of accessing the RADS interface from within DoMosaics:
NoteIt is important to note that
You can load a fasta entry from a file, or load an existing sequence view. If this route is chosen, the sequences are first annotated (remotely, via the RADS webservice) using the Pfam domain definitions. After this step, the resulting arrangement is used as a query to find similar arrangements in UniProt.
Note that if your file or view contains multiple entries, only a single entry will be chosen for the scan. While for a file with multiple fasta entries the first entry will be chosen, using a sequence views with multiple entries results in the use of any sequence from the view (as the entries in a sequence view are not ordered). In other words, using a view to search only makes sense if you know all arrangements are the same (or if you expect them to be very similar, such as could be the case for a set of orthologous proteins).
Loading a domain arrangement
You can also load a domain arrangement (in xdom format) from a file. As with sequences, only one entry will be chosen from the file (the first); if a domain-view is chosen, the selection will be (as for sequences) seemingly random. Again, this only makes sense if the arrangements in the arrangement view are the same (or very similar). Furthermore, as the current implementation of RADS/RAMPAGE is based on Pfam domain definitions your search will be unsuccessful if you use arrangements with domains other than Pfam. If you have arrangements with other domain definitions, consider using the sequence instead.Choosing a search algorithm
Both RADS and RAMPAGE can be used for the search (see above). A RAMPAGE search requires sequences, and implies an initial RADS search. In most cases, the score values can remain as is (Set defaults will restore the default scores if you should have changed them).The results table
Once the search has completed, Show results will list all hits that were found (ordered by score). Some information about the query, including a graphical representation of its arrangements are displayed at the top. Select hits of interest by clicking the import checkbox (or use the select all button to import all hits). Click on Import selection to import you selection. You can associate the scan hits with an existing project from the dropdown menu.
NoteIt is important to know that DoMosaics is not a program for building trees.
The proper construction of a phylogenetic tree is a complex endeavor, and is best done with dedicated software such as RaxML or FastTree. That said, DoMosacis is able to perform a simple tree construction by using the PAL java library for phylogenetic analysis. In the first step, the sequences are aligned used EBIs interface to ClustalW. The resulting alignment is then used to construct a tree.Choose a dataset
If you do not have sequences associated with the arrangements, you can build a tree based on a distance matrix created from differences in the domain arrangements (see using domains for tree creation below). Of course, you can select a regular sequence view for tree creation. Note that selecting a sequence view will clear any arrangement-views you may have selected.Creating the tree
After selecting the dataset, click 'Next'. Select a substitution matrix from the top dropdown, and choose a
tree constructing method from the bottom. Click 'Finish', and, once the tree construction is complete,
give the new tree-view a name (which must be unique) and associate with a project.
Once you have constructed a tree, you might consider creating a domain-tree view.
A domain tree is a tree with leaves associated to arrangements. It is a central element of DoMosaics, both for the analysis of domain arrangements across a protein family, as well as for the visualization of related proteins with their domain structure:
Views belong to projects and are displayed in the left-hand side of the workspace. Every project can have one of four different view types
Each view is associated with its own view panel (main window on the right), and comes with a distinct set of functionalities organised in a main menu (at the top of the view panel), and context menus available by a mouse right-click on items within the view. Within one project in the workspace, each of these four view types is a separate category (node in the tree):
Note that each of these four types is associated with a specific icon. Domain arrangements that are associated with sequences are signified by an additional 's' in the icon. Views can be selected through double click. An active view is displayed in the main view panel on the left.
There are a number of functionalities in each view. Here, we will only illustrate some of the features specific to DoMosaics. All views provide means for exporting back-end data to files (e.g. sequence view to fasta) or to graphics (pixel and vector).
The sequence view is the simplest of all views. Sequence views provide means for the user to maintain the association between a set of sequences and corresponding domain annotation or trees. Furthermore, sequence views can be selected wherever sequences are needed (e.g. for domain annotation). They can also be created from other views, for example through the use of the 'Select sequence' functionality in the Domain-view (see below).
When a tree view is open, the main menu (main view panel at the right) will display the tree. A number of tree specific options are available under 'View' in the main menu (e.g. expand leaves or show bootstrap values). Tree visuals can be adjusted under 'Edit' (e.g. change edge weights or fonts)
A right mouse on nodes in the tree will show the node context menu, from which a number of operations are available such as highlight subtree, rotate children, re-root tree, collapse/expand etc. Note that functionalities available in a tree are also available in a domain tree.
The domain view is the most central view in DoMosaics. As with the sequence and tree views, domains views have a distinct menu along the top of the main view panel. This menu provides the following functionalities:
Beyond the main menu, a number of operations are triggered in context, that is, by right-clicking on an
arrangement (that is between domains of an arrangement) or within domains of an arrangement.
The former provides the context menu of the arrangement, while the latter provides the context menu of the selected
A domain tree view is a combination between a tree view and a domain view. The view panel menu and context menus are mixed, and allow access to the view specific-functionality from both the domain and the tree view.
The Jaccard similarity coefficient is used for comparing sets, and is computed by dividing the intersection of two given sets by their union. Ergo, a Jaccard index of 1 signifies that two given sets are identical. The Jaccard distance can be easily computed by subtracting the Jaccard similarity coefficient from 1; it moves between 0 and 1. In DoMosaics, the compared sets correspond to domain sets.
The Domain distance is an edit distance, in which the differences between two strings can be measured as the number of operations necessary to transform from one string into the other. Arrangements are strings of a proteins constituent domains. As such, two arrangements can be compared using their edit distance, using insertions and deletions as operations. A large domain distance indicates that a large number of operations are necessary to go from one arrangements to another, and may be indicative of a large evolutionary distance between the two considered proteins.
The distance matrix provides an overview of the distances between arrangements, whereas users can determine which distance is to be used under 'Options'. The IDs/Names of the proteins as they appear in the corresponding view are found along the axis; choosing a cell in the matrix will highlight the two proteins which are being compared. The distances can be exported to a .csv file for further processing.
Datasets can be filtered by these distance matrices. To do so, select an arrangement in a domain or domain-tree view, go to 'Edit' in the submenu and choose 'Collapse by similarity'. For example, you can filter the dataset by changing the max. allowed number of edit operations using a sliding bar
Heads up!The domain graph functionality of DoMosaics is still experimental.
In a domain graph, domains are displayed as nodes and edges connect domains which co-occur within at least one protein. Domain graphs can be undirected, where the edges have no direction and hence are indicative only of co-occurrences, or directed, where edges have directions indicating the order in which they appear in proteins. In DoMosaics, domain graphs are directed (from N to C-terminus).
The threshold slider at the bottom can be used to visually highlight parts of the graph which have at least x neighbours. This is particularly useful for very large datasets.
A Dotplot is a graphical method for conducting a pairwise-sequence comparison. By comparing all positions of the first sequence to all positions of the second, matches (regions of similarity) of a given length can be identified (word-length). Using the two sequences as axes to a rectangular plot, a dot a positioned between words from both sequences which are similar.
Domain-dotplots are an extension of a regular dotplot. Given two selected domain arrangements with an underlying sequence attached (recognizable by a rectangular [s] which appears on mouse over on domains in the domain-view), a standard dotplot of amino-acids is created (with customisable substitution matrix, word length and score).
Additionally each domain of one protein is facing its corresponding (same domain family) in the other protein, and is represented by a square with the indicated average conservation. Hence, this adds to the global sequence conservation the information about which domain are the most conserved and, for example in case of multiple repeats, which domains are likely paralogs.
This tool has been designed to post process domain annotation with loose (not stringent) E-value thresholds. It includes 3 filters of domain annotations: an e-value filter, an overlap resolver, and the Co-Occurring Domain Detection (CODD) method.
The E-value filter allows a dynamic visualization of domain arrangements of proteins. Starting with a domain annotation conducted with very high e-value thresholds, the user can play with the threshold settings and see impact on domain arrangements (as opposed iteratively re-annotating with varying thresholds).
The overlap resolver allows to clean the domain annotation after a "complete domain annotation process" which might contain overlapping domains at the same position. Only one domain occurrence is supposed to be present/active at one position of a protein. The usual overlap resolvers is based on E-value, only retaining domain occurrences with best e-values and removing overlapping domains incrementally. Additionally we propose a resolver that maximize the coverage of the protein by domains. These options are provided to be able to retrieve "standard arrangements" from overlapping ones. However, in some cases a user might be interested in keeping overlaps as they must not result from erroneous annotation by misleading amino-acid sequences similarities, but instead might also involve evolutionary relationships of domains. In this later case, one must question which function is actually present, especially when scores and e-values are high and significant.
The Co-Occurring Domain Detection (CODD) method (Terrapon et al., 2009) has been designed for the context-dependent detection of divergent domains in remote species or detecting domains in species which exhibit a high amino-acid composition bias. CODD works as a filter on putative domains (domains which do not satisfying defined thresholds) by removing the most doubtful domains and retaining the most reliable ones. The filter takes advantage if the known property of domain co-occurrence: most domain families are observed in proteins with only few other "favorite" (function collaboration/compatibility) domain families. Hence, using a list of correlated domain pairs, CODD can certify the presence of putative domains based on the asserted presence of others. This method is able to increase domain annotation in case of divergence/bias while rejecting the majority of false putative domains generated resulting from the loosening of E-value thresholds.