Open Source Codes


Kaggle Data Science Bowl 2017 - Lung Cancer Detection Challenge

For more details, please refer to our solution document.

Solution Document

Source Code (Please contact me for the codes)

Differential Privacy

You can run DP_init generate the type of each attribute (i.e. binary or continuous).
The command is DP_init [data file] [output of type] (see init.bat)
Before this program, the data should be scaled to [-1,1].
Those attributes with values of -1 or 1 are considered "binary" while others are considered "continuous"
You can modify the result if the automatic detect is not correct. (see type.txt)

For those with type "binary", the attributes in synthetic data base contains only two numbers, -1 or 1.
For those with type "other", the attributes in synthetic data base are between -1 and 1.

The you can run DP_LP to generate synthetic data base.
The command is DP_LP [config file name] (see train.bat)

The config file should contain some parameters of this program (see config.txt). "config.txt" is our default parameters.

The program will generate the synthetic data base in the file given in config file.

Format of input database : each line contain one tip of information separated by comma.(see WDBC_Scaled.csv)

The meaning of each parameter:
R_size 64: the number of basis functions
C_size 10000: the number of points uniformly sampled from the ellipsoid obtained by PSI
N 100 : the number of discretized points in each dimension
eps 0.5: Privacy parameters
m 10000: the size of synthetic database
para_t 3: the upper bound of the degree of basis functions
sigma 4: the variance of the Gaussian kernel function
n_sample_D 1000: the number of test points
n_sample_Q 10000: the number of test queries, the worst-case error is over these 10000 queries
csv_filename WDBC_scaled.csv: the scaled database(input)
output_filename output.txt: the output synthetic database(output)
type_filename type.txt: the file indicating the type of each attribute
k 10:the number of the top eigenvectors and eigenvalues
n_PSI 5: the number of iterations
eps_PSI 1: privacy parameter for the Private PCA
delta_PSI 0.00001: privacy parameter for the Private PCA
ellipsR 3: the ratio of the square root of the eigenvalues to the radius of the ellipsoid in each eigenvetor
pMean_eps 1: privacy parameter for obtaining the private mean of database
isPm1Shrinked 1: restrict the sampling points in [-1,1]^d

Source Code

Executable File(C++)

IMED

Task
Given the size of images, create the metric matrix G.
Calculate the distances between every two images.
Calculate A*G^(1/2),(A is the image matrix, defined blow).

Input: r,c //number of rows and columns of the image
m //number of images
images shown as row vectors
Output: (according to your request)
the distances between every two images
or A*G^(1/2)

Input Format
r,c // size of image
m //number of images
A(m*n) // m images
the ith lines indicates the pixels of the ith image

Output Format
Case : your request
0 the distances between every two images.
Format:
the ith line contains d(i,i+1),d(i,i+2)...,d(i,m-1) (i=0...m-1)
1 A*G^(1/2)
Format:
H(m*n) // the n*n element of a matrix

Arguments of the main program

Input file name
Output file name
Type of your request
0 for Calculate the distances between every two images.
1 for Calculate A*G^(1/2)

Source Code

Executable file(C#)

Executable file(C++)

DBoost

Task

Train a (multi-class) voting classifier using DBoost based on dissimilarity descriptions.
The weak learner are dissimilarity based decision stumps.

Input: dissimilarity matrix, encoded labels of the examples
Output: the classifier

Input Format

Format: n // number of examples
m // codelength (We use one-vs-all, so codelength is the number of classes)
l(n*m) // labels of the examples. If label is "i", the ith element of the label vector is 1, and -1 (or 0) for all the others
A(n*n) // dissimilarity matrix, only upper triangle is saved
Output Format

Format:
Classifier.data
Format: T // number of base classifiers
Thresh(T)// threshold for the decision stump
(i(T))
(j(T)) // example pair (determine the pseudo hyperplane)
(S1(T), S2(T)) // the label subsets of the two nodes
Arguments of main program
Input FileName
Output FileName
Number of Rounds

Source Code

Executable file(C#)

Executable file(C++)

DBoost Predict

Task

Multiclass prediction using DBoost for similarity based representations

Input: dissimilarity matrix, labels of the test examples
parameters of the classifier
Output: prediction result of every test example and the error rate

Input Format
Format:
Classifier.data
Format: T // number of base classifiers
Thresh(T)// threshold for the decision stump
(i(T))
(j(T)) // example pair (determine the pseudo hyperplane)
(S1(T), S2(T)) // the label subsets of the two nodes

test.data
Format: n // number of test examples
m // number of train examples
l // code length (number of classes, we use one-vs-all here)
C(n,l) // codeword of the test examples. If the label is "i", then the ith element is 1, and 0 for all the other elements
A(n*m) // dissimilarity matrix,
// for n test and m training examples
// the training examples must be in the
// order identical to the training process
Output:
Format:
P(i) // the predict result of every test example
error //the prediction error rate
Arguments of main program
Input FileName
Classifier FileName
Output FileName
Number of base classifiers

Source Code

Executable file(C#)

Executable file(C++)

Notes

To run the Executable files powered by C#, you need to install Windows .NetFramework 3.5 in Windows XP/2000.

You may download it from http://down.tech.sina.com.cn/content/35600.html.

If you have any problem with these Executable files please send E-mail to us.