MMTSB
Tool Set Documentation

Difference between revisions of "cluster.pl"

From MMTSB
Jump to: navigation, search
 
 
(6 intermediate revisions by 3 users not shown)
Line 9: Line 9:
 
of files. The list of files can be passed either as the last arguments
 
of files. The list of files can be passed either as the last arguments
 
on the command line or through standard input. The clustering result
 
on the command line or through standard input. The clustering result
is written to standard output.<BR>
+
is written to standard output.
 +
 
 
Two clustering methods are available. Hierarchical clustering
 
Two clustering methods are available. Hierarchical clustering
 
that uses automatic stopping criteria is used by default
 
that uses automatic stopping criteria is used by default
Line 23: Line 24:
 
are large enough subclusters or until a maximum recusion level is
 
are large enough subclusters or until a maximum recusion level is
 
reached. The level is set to 999 by default, which means practically
 
reached. The level is set to 999 by default, which means practically
no limit, it can be changed with <B>-maxlevel</B>.<BR>
+
no limit, it can be changed with <B>-maxlevel</B>.
 +
 
 
An alternative clustering method employs a fixed cluster radius and
 
An alternative clustering method employs a fixed cluster radius and
 
is selected with <B>-kclust</B>. Subclusters are not generated in this
 
is selected with <B>-kclust</B>. Subclusters are not generated in this
 
case, and the resulting number of clusters cannot be limited with
 
case, and the resulting number of clusters cannot be limited with
<B>-maxnum</B>. The fixed cluster radius is given with <B>-radius</B>.<BR>
+
<B>-maxnum</B>. The fixed cluster radius is given with <B>-radius</B>.
 +
 
 
Depending on the cluster method different cluster modes that
 
Depending on the cluster method different cluster modes that
 
use different criteria for measuring distances between the
 
use different criteria for measuring distances between the
Line 37: Line 40:
 
In this case the option <B>-contmaxdist</B> determines the maximum distance
 
In this case the option <B>-contmaxdist</B> determines the maximum distance
 
between residues for being counted in the comparison. The default
 
between residues for being counted in the comparison. The default
value is 12.0 A.<BR>
+
value is 12.0 A.
 +
 
 
The fixed radius clustering method supports dihedral based clustering
 
The fixed radius clustering method supports dihedral based clustering
 
using RMSD values for phi angles (<font color=#508060>phi</font>), psi angles (<font color=#508060>psi</font>),
 
using RMSD values for phi angles (<font color=#508060>phi</font>), psi angles (<font color=#508060>psi</font>),
Line 44: Line 48:
 
and multiplied by a mixing factor, and the cartesian space RMSD multiplied
 
and multiplied by a mixing factor, and the cartesian space RMSD multiplied
 
by 1 minus the mixing factor. This mode is selected with <font color=#508060>mix</font>.
 
by 1 minus the mixing factor. This mode is selected with <font color=#508060>mix</font>.
The mixing factor can be set with <B>-mixfactor</B>.<BR>
+
The mixing factor can be set with <B>-mixfactor</B>.
 +
 
 
In the default mode of operation the input structures are expected
 
In the default mode of operation the input structures are expected
 
to be in PDB format (option: <B>-pdb</B>). Alternatively, SICHO lattice
 
to be in PDB format (option: <B>-pdb</B>). Alternatively, SICHO lattice
chains can be used as input if <B>-sicho</B> is given.<BR>
+
chains can be used as input if <B>-sicho</B> is given.
 +
 
 
For loop/fragment modeling it is possible to restrict the comparison
 
For loop/fragment modeling it is possible to restrict the comparison
 
to a range of residues specified with the <B>-l</B> option. In this case
 
to a range of residues specified with the <B>-l</B> option. In this case
Line 53: Line 59:
 
may either be done based on the loop/fragment residues (default) or based
 
may either be done based on the loop/fragment residues (default) or based
 
on the rest of protein excluding the loop/fragment if <B>-fitxl</B> is
 
on the rest of protein excluding the loop/fragment if <B>-fitxl</B> is
specified.<BR>
+
specified.
 +
 
 
A log file can be requested if the option <B>-log</B> is used.
 
A log file can be requested if the option <B>-log</B> is used.
 
It is also possible to request output of the centroids of each cluster
 
It is also possible to request output of the centroids of each cluster
Line 61: Line 68:
 
clustering the centroids are RGB maps with the contact map represented
 
clustering the centroids are RGB maps with the contact map represented
 
as a bitmap.
 
as a bitmap.
 +
 +
This tool calls external binaries to carry out the clustering. The environment
 +
variables KCLUST and JCLUST can be used to set the location of the binaries
 +
for K-means and hierarchical clustering, respectively.
  
 
== Options ==
 
== Options ==
  
 
; -help : usage information
 
; -help : usage information
 
+
; -jclust : invoke hierarchical clustering algorithm
 +
; -maxnum value : maximum number of clusters for hierarchical clustering
 +
; -minsize value : minimum size of cluster to perform second level of hierarchical subclustering
 +
; -maxlevel value : maximum number of clustering levels for hierarchical clustering
 +
; -kclust : invoke K-means clustering algorithm
 +
; -radius value : K-means clustering radius
 +
; -[no]iterate : go (not) through multiple rounds of K-means clustering
 +
; -mode rmsd|contact|phi|psi|phipsi|mix : cluster according to RMSD, contact map, dihedrals, or a combination of contact map and RMSD (mix)
 +
; -contmaxdist value : maximum value for contacts to be considered
 +
; -mixfactor value : weight of contact map vs. RMSD in mixed mode
 +
; -pdb : process files as PDB structures
 +
; -sicho : process files as SICHO models
 +
; -selmode ca|cb|cab|heavy|all : specify which atoms are used for the calculation of RMSD values
 +
; -l min&#58;max[=...] : limit RMSD calculation to specified residue range
 +
; -fitxl : perform least-squares superposition for all residues except residue range given with -l
 +
; -[no]lsqfit : superimpose structures before calculating RMSD values
 +
; -centroid : write out centroid structures
 +
; -centout template : provide template for centroid file names
 +
; -log file : generate log file
  
 
== Examples ==
 
== Examples ==

Latest revision as of 03:27, 28 July 2009

Usage

usage:   cluster.pl [options] [files]
options: [-jclust] [-kclust]
         [-maxnum value] [-minsize value] [-maxlevel value]
         [-radius value] [-[no]iterate]
         [-mode rmsd|contact|phi|psi|phipsi|mix]
         [-contmaxdist value] [-mixfactor value]
         [-pdb | -sicho]
         [-selmode ca|cb|cab|heavy|all]
         [-l min:max[=min:max ...]] [-fitxl]
         [-[no]lsqfit]
         [-centroid] [-centout template]
         [-log file]

Show source


Description

This script applies a clustering algorithm to a set of files. The list of files can be passed either as the last arguments on the command line or through standard input. The clustering result is written to standard output.

Two clustering methods are available. Hierarchical clustering that uses automatic stopping criteria is used by default or if the option -jclust is given. With this cluster method the ideal number of clusters is determined automatically up to a maximum number of clusters. The default maximum is 4 clusters, it may be changed with -maxnum. For each of these initially identified clusters the clustering procedure is then reapplied to determine subclusters if the number of elements in a cluster is larger than a given threshold. This threshold is set to 400 structures by default and can be changed with -minsize. The procedure is repeated recursively as long as there are large enough subclusters or until a maximum recusion level is reached. The level is set to 999 by default, which means practically no limit, it can be changed with -maxlevel.

An alternative clustering method employs a fixed cluster radius and is selected with -kclust. Subclusters are not generated in this case, and the resulting number of clusters cannot be limited with -maxnum. The fixed cluster radius is given with -radius.

Depending on the cluster method different cluster modes that use different criteria for measuring distances between the input structures are available. The cluster mode is selected with -mode. Both cluster methods support clustering based on cartesian coordinate RMSD between structures. With the hierarchical clustering algorithm it is also possible to cluster based on similarities in the contact map (contact). In this case the option -contmaxdist determines the maximum distance between residues for being counted in the comparison. The default value is 12.0 A.

The fixed radius clustering method supports dihedral based clustering using RMSD values for phi angles (phi), psi angles (psi), or both (phipsi). A mixing mode is also supported where the distance measure is given by the sum of phi and psi RMSD values, divided by 20 and multiplied by a mixing factor, and the cartesian space RMSD multiplied by 1 minus the mixing factor. This mode is selected with mix. The mixing factor can be set with -mixfactor.

In the default mode of operation the input structures are expected to be in PDB format (option: -pdb). Alternatively, SICHO lattice chains can be used as input if -sicho is given.

For loop/fragment modeling it is possible to restrict the comparison to a range of residues specified with the -l option. In this case the fit between different structures before an RMSD value is calculated may either be done based on the loop/fragment residues (default) or based on the rest of protein excluding the loop/fragment if -fitxl is specified.

A log file can be requested if the option -log is used. It is also possible to request output of the centroids of each cluster with -centroid. The option -centout is used to provide a template for the centroid file names. The centroids are written in PDB format if the clustering is based on RMSD comparison. For contact map based clustering the centroids are RGB maps with the contact map represented as a bitmap.

This tool calls external binaries to carry out the clustering. The environment variables KCLUST and JCLUST can be used to set the location of the binaries for K-means and hierarchical clustering, respectively.

Options

-help 
usage information
-jclust 
invoke hierarchical clustering algorithm
-maxnum value 
maximum number of clusters for hierarchical clustering
-minsize value 
minimum size of cluster to perform second level of hierarchical subclustering
-maxlevel value 
maximum number of clustering levels for hierarchical clustering
-kclust 
invoke K-means clustering algorithm
-radius value 
K-means clustering radius
-[no]iterate 
go (not) through multiple rounds of K-means clustering
-mode rmsd|contact|phi|psi|phipsi|mix 
cluster according to RMSD, contact map, dihedrals, or a combination of contact map and RMSD (mix)
-contmaxdist value 
maximum value for contacts to be considered
-mixfactor value 
weight of contact map vs. RMSD in mixed mode
-pdb 
process files as PDB structures
-sicho 
process files as SICHO models
-selmode ca|cb|cab|heavy|all 
specify which atoms are used for the calculation of RMSD values
-l min:max[=...] 
limit RMSD calculation to specified residue range
-fitxl 
perform least-squares superposition for all residues except residue range given with -l
-[no]lsqfit 
superimpose structures before calculating RMSD values
-centroid 
write out centroid structures
-centout template 
provide template for centroid file names
-log file 
generate log file

Examples

cluster.pl 1vii.sample.*.pdb
performs hierarchical clustering of the input files 1vii.sample.*.pdb based on mutual RMSD.

# cluster file
# automatically generated on: Tue Sep 25 14:08:51 2001
# mode: rmsd, filetype: pdb, lsqfit: 1, selmode: cab
@cluster t has 16 elements, 3 subclusters
1 1vii.sample.1.pdb
2 1vii.sample.10.pdb
3 1vii.sample.11.pdb
4 1vii.sample.12.pdb
5 1vii.sample.13.pdb
6 1vii.sample.14.pdb

...


cluster.pl -maxnum 3 -mode contact -contmaxdist 8.0 1vii.sample.*.pdb
performs hierarchical clustering of the input files 1vii.sample.*.pdb based on differences in the residue contact maps allowing a maximum of 3 clusters. Residue separations of more than 8 A are ignored in the comparison.

# cluster file
# automatically generated on: Tue Sep 25 14:09:23 2001
# mode: contact, filetype: pdb, lsqfit: 1, selmode: cab
@cluster t has 16 elements, 2 subclusters
1 1vii.sample.1.pdb
2 1vii.sample.10.pdb
3 1vii.sample.11.pdb
4 1vii.sample.12.pdb
5 1vii.sample.13.pdb
6 1vii.sample.14.pdb

...


cluster.pl -maxnum 3 -minsize 5 -l 10:21 -fitxl -log cluster.log 1vii.sample.*.pdb
performs hierarchical clustering of the input files 1vii.sample.*.pdb based on mutual RMSD for residues 10 through 21. Before calculating RMSD values the rest of two protein structures that are compared is overlayed with a least squares fit. At each clustering level a maximum of three clusters are selected and subclusters are idenitified for clusters containing 5 or more structures. A log file cluster.log is produced.

# cluster file
# automatically generated on: Tue Sep 25 14:10:13 2001
# mode: rmsd, filetype: pdb, lsqfit: 1, selmode: cab
@cluster t has 16 elements, 2 subclusters
1 1vii.sample.1.pdb
2 1vii.sample.10.pdb
3 1vii.sample.11.pdb
4 1vii.sample.12.pdb
5 1vii.sample.13.pdb
6 1vii.sample.14.pdb

...


cluster.pl -maxnum 3 -minsize 10 -centroid -centout sample 1vii.sample.*.pdb
performs hierarchical clustering of the input files 1vii.sample.*.pdb based on mutual RMSD. At each clustering level a maximum of three clusters are selected and subclusters are idenitified for clusters containing 10 or more structures. In addition the centroids of each cluster are written out in PDB format to files beginning with sample.

# cluster file
# automatically generated on: Mon Jun 18 10:09:08 2001
# mode: rmsd, filetype: pdb, lsqfit: 1, selmode: cab
@cluster t has 16 elements, 3 subclusters
1 1vii.sample.1.pdb
2 1vii.sample.10.pdb
3 1vii.sample.11.pdb
4 1vii.sample.12.pdb
5 1vii.sample.13.pdb
6 1vii.sample.14.pdb

...


cluster.pl -kclust -radius 3 1vii.sample.*.pdb
performs fixed radius clustering of the input files 1vii.sample.*.pdb based on mutual RMSD. The radius is set to 3 A.

# cluster file
# automatically generated on: Tue Sep 25 14:10:58 2001
# mode: rmsd, filetype: pdb, lsqfit: 1, selmode: cab
@cluster t has 16 elements, 8 subclusters
1 1vii.sample.1.pdb
2 1vii.sample.10.pdb
3 1vii.sample.11.pdb
4 1vii.sample.12.pdb
5 1vii.sample.13.pdb
6 1vii.sample.14.pdb

...


cluster.pl -kclust -mode phi -radius 30 1vii.sample.*.pdb
performs fixed radius clustering of the input files 1vii.sample.*.pdb based on phi dihedral RMSD. The radius is set to 30 degrees.

# cluster file
# automatically generated on: Tue Sep 25 14:12:19 2001
# mode: phi, filetype: pdb, lsqfit: 1, selmode: cab
@cluster t has 16 elements, 7 subclusters
1 1vii.sample.1.pdb
2 1vii.sample.10.pdb
3 1vii.sample.11.pdb
4 1vii.sample.12.pdb
5 1vii.sample.13.pdb
6 1vii.sample.14.pdb

...