README To download the dataset click on the file link for required dataset: rch.tgz, dpx,tgz etc NOTE: This each file is around 180M in size and may take a long time to download. Caution: RCHr datasets are here named RCH, the RCH datasets are here named RCHA and the RD datasets are here named DPX. Once downloaded unzip the file eg: > tar -xzvf rch.tgz (or > gunzip rch.tgz; tar -xvf rch.tar) etc. This will create 6 folder (directores) in the working folder for that particular experiment, For example: RCHA1 contains the data used in the RCH experiment where type 1 input attributes were employed (just a local window of residues). SA6 contains the data used in the SA experiment for using the type 6 input attributes (a local window of residues plus predicted secondary structure data and the predicted average value of the feature for each protein studied). Within each Folder there are three subfolders: Q2.uf, Q3.uf and Q5.uf containing datasets with the target variable discretized into 2, 3 and 5 classes respectively (using uniform frequency). Each of these subfolders contains the 10 pairs of training and test sets used in the 10 fold cross validation experiments: trainFold00.w4, testFold00.w4 ... where w4 indicated use of a window of 4 residues wither side of the target. The training sets contain 950 chains each and the test sets 100 chains each (as described in the paper). In all of these files. the instances extracted from the protein chains are concatenated. These files are in the WEKA Machine Learning Package ARFF file format. http://www.cs.waikato.ac.nz/~ml/weka/arff.html For example file rch1/Q2.uf/testFold00.w4 begins: @relation rch1.q2.uf.w4.testFold00 @Attribute att_0 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} @Attribute att_5 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_6 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_7 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_8 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute class {0,1} @data x,x,x,x,A,E,I,K,H,0 x,x,x,A,E,I,K,H,Y,0 x,x,A,E,I,K,H,Y,Q,0 x,A,E,I,K,H,Y,Q,F,0 A,E,I,K,H,Y,Q,F,N,0 E,I,K,H,Y,Q,F,N,V,0 I,K,H,Y,Q,F,N,V,V,0 K,H,Y,Q,F,N,... .... The first line (@relation) describes the Dataset. Subsequent lines describe each of the attributes of the data (@attribute) and the values that can each attribute may take. In the data above there are 9 attributes. att_4 is the target residue, the other attributes are the four residues each side of the target. (here x indicates that the flanking residues my possibly include "end-of-chin" values). More complex input types (datasets rchA2-6 etc.) may include integer and real valued attributes as follows: Input Data Type 1 -- att_0 .. att_8 => Residues in Window -4..0..+4 (9 attributes) Input Data Type 2 -- att_0 .. att_10 => Predicted SS , PredSS Confidence , Residues in Window -4..0..+ (9 attributes) Input Data Type 3 -- att_0 .. att_29 => Chain Length , Prop Ala, Prop Cys, .... Prop Tyr (20 attributes) , Residues in Window -4..0..+4 (9 attributes) Input Data Type 4 -- att_0 .. att_31 => Predicted SS , PredSS Confidence , Chain Length , Prop Ala, Prop Cys, .... Prop Tyr (20 attributes) , Residues in Window -4..0..+ (9 attributes) Input Data Type 5 -- att_0 .. att_9 => Predicted Average Value for Feature , Residues in Window -4..0..+4 (9 attributes) Input Data Type 6 -- att_0 .. att_11 => Predicted SS , PredSS Confidence , Predicted Average Value for Feature , Residues in Window -4..0..+4 (9 attributes) The prediction target variable (class) in this case can have values 0 or 1 (2 state prediction -- Q2). The actual data itself begins after the line "@data" -- here we see the end of chain "xxxx" flanking the N-terminus. In subsequent instances (lines) the window is moved along the chain. In the data shown above, the N-terminal target residues happen to have class 0 ((low hull numbers -- i.e. toward the surface).