SUPPLEMENTARY DATA

 

Building Multiclass Classifiers for Remote Homology Detection and Fold Recognition

By Huzefa Rangwala (rangwala AT cs . umn dot edu) & George Karypis (karypis@ cs. umn dot edu)

Computer Science Department at the University of Minnesota-Twin Cities


DATASETS

There were four datasets used in the paper:
  sf95 and sf40 are setup for remote homology detection (RH), whereas fd25 and fd40 are setup for the fold recognition problem (FD)

sf95 was obtained from a similar study (Multi-class protein fold recognition using adaptive codes by Eugene Ie. et. al) and can be obtained by clicking this link

Also note that sf95 was used the SCOP 1.65 version with an ASTRAL filtering at 95%

sf40, fd25, fd40  were generated by us from the SCOP 1.67 version of database with ASTRAL filtering at 40%, 25% and 40% respectively.

Below we provide the sequences that were downloaded from the ASTRAL website at these specific percent identities. We also provide the sequence identifiers that formed a part of the overall train, cross-validation and testing sets. We also list the final untouched test sets used for evaluation for each of the datasets. Using this information one could repeat our experiments for comparative studies.

Dataset Sequences in fasta format
Sequence Identfiers for dataset*
Test Set Identifiers
Class Definitions
sf40
SCOP v 1.67 (Astral 40 %)
Identifiers (1119 sequences)
Identifiers (238 sequences)
RH Identifiers (37 superfamilies)
fd25
SCOP v 1.67 (Astral 25%)
Identifiers (1294 sequences)
Identifiers (278 sequences)
FD Identifiers (25 folds)
fd40
SCOP v 1.67 (Astral 40 %) Identifiers(1651 sequences) Identifiers (344 sequences)
FD Identifiers (27 folds)
* The sequence identifiers contain the identifiers for the test sequences as well.


                                                                    CODES

The programs we used in this study were

  1. support vector machines (SVM Light) for building base classifiers
  2. SVM-Struct for learning second level learning
  3. Ranking Perceptron Algorithm (implemented ourselves). Please contact Huzefa to get this code or a similar MATLAB code (implemented by Eugene Ie. et. al) can be obtained by clicking this link
  4. Direct K-way classifier using the Crammer-Singer Approach.
Please feel free to contact us to get the various extensions for SVM-Struct, direct K-way classifier and ranking perceptron that were used in this study.

Corresponding Author: Huzefa Rangwala
Last Updated on August 18, 2006