README To download the dataset click on the file link for required dataset: SoftcomputingPaperDatasets.tgz NOTE: This file is around 1Gb in size and may take a long time to download. Caution: The CN datasets are here named dc10 -- a distance cutoff value of 10 Angstroms. Once downloaded unzip the file eg: > tar -xzvf SoftcomputingPaperDatasets.tgz (or > gunzip SoftcomputingPaperDatasets.tgz; tar -xvf SoftcomputingPaperDatasets.tar) etc. This will create 6 folders (directories) for each property studied: dc10 = Coordination Number dt = Delaunay tessellation gg = gabriel graph rng = relative neighbourhood graph mst = minimum spanning tree These directories are labeled using the property and a suffix 1 .. 6. For example: dc101 contains the data used in the dc10 (CN) experiment where type 1 input attributes were employed (just a local window of residues). DT6 contains the data used in the DT experiment for using the type 6 input attributes (a local window of residues plus predicted secondary structure data and the predicted average value of the feature for each protein studied). Within each Folder there are three subfolders: Q2.uf, Q3.uf and Q5.uf containing datasets with the target variable discretized into 2, 3 and 5 classes respectively (using uniform frequency). Each of these subfolders contains the 10 pairs of training and test sets used in the 10 fold cross validation experiments: trainFold00.w4, testFold00.w4 ... where w4 indicated use of a window of 4 residues wither side of the target. The training sets contain 950 chains each and the test sets 100 chains each (as described in the paper). In all of these files. the instances extracted from the protein chains are concatenated. These files are in the WEKA Machine Learning Package ARFF file format. http://www.cs.waikato.ac.nz/~ml/weka/arff.html For example file dt1/Q2.uf/testFold00.w4 begins: @relation dt1.q2.uf.w4.testFold00 @Attribute att_0 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} @Attribute att_5 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_6 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_7 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute att_8 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,x} @Attribute class {0,1} @data x,x,x,x,A,E,I,K,H,0 x,x,x,A,E,I,K,H,Y,0 x,x,A,E,I,K,H,Y,Q,0 x,A,E,I,K,H,Y,Q,F,0 A,E,I,K,H,Y,Q,F,N,0 E,I,K,H,Y,Q,F,N,V,1 I,K,H,Y,Q,F,N,V,V,1 K,H,Y,Q,F,N.... .... The first line (@relation) describes the Dataset. Subsequent lines describe each of the attributes of the data (@attribute) and the values that can each attribute may take. In the data above there are 9 attributes. att_4 is the target residue, the other attributes are the four residues each side of the target. (here x indicates that the flanking residues my possibly include "end-of-chin" values). More complex input types (datasets dt2-6 etc.) may include integer and real valued attributes as follows: Input Data Type 1 -- att_0 .. att_8 => Residues in Window -4..0..+4 (9 attributes) Input Data Type 2 -- att_0 .. att_10 => Predicted SS , PredSS Confidence , Residues in Window -4..0..+ (9 attributes) Input Data Type 3 -- att_0 .. att_29 => Chain Length , Prop Ala, Prop Cys, .... Prop Tyr (20 attributes) , Residues in Window -4..0..+4 (9 attributes) Input Data Type 4 -- att_0 .. att_31 => Predicted SS , PredSS Confidence , Chain Length , Prop Ala, Prop Cys, .... Prop Tyr (20 attributes) , Residues in Window -4..0..+ (9 attributes) Input Data Type 5 -- att_0 .. att_9 => Predicted Average Value for Feature , Residues in Window -4..0..+4 (9 attributes) Input Data Type 6 -- att_0 .. att_11 => Predicted SS , PredSS Confidence , Predicted Average Value for Feature , Residues in Window -4..0..+4 (9 attributes) The prediction target variable (class) in this case can have values 0 or 1 (2 state prediction -- Q2). NB these are the classes (eg. above of below average) for the property under consideration NOT the actual values of the property for the target residue. The actual data itself begins after the line "@data" -- here we see the end of chain "xxxx" flanking the N-terminus. In subsequent instances (lines) the window is moved along the chain. In the data shown above, the N-terminal target residues happen to have class 0 ((low Delaunay tesselation numbers -- i.e. toward the surface).