(1) Given a distance measure between two vectors X and Y,
D(X,Y) = 1⁄2 { (var(X) + var(Y))2 - √[ (var(X) + var(Y))2 - 4 var(X) var(Y) (1-ρ(X,Y)2 ] }
ρ(x,y) = cov(X,Y)/√(var(X)var(Y)).
var(X) = (1/n) (∑(X-meanX)2), where n is the dimensionality of vector X.
cov(X,Y) = (1/n) (∑(X-meanX)(Y-meanY)), where both X and Y have the same dimensionality, n.
(2) Given a number of features (n), there exists a feature set O = {Fi, i= 1, 2... n}, which may be reducible. For the sake of this test, the features are represented by ‘feature vectors’ (e.g., X=<FV1, FV2, ... FVn>), where the dimensions of each and every vector is equal to the number of randomly-picked instances of a dataset used to construct the feature vectors (say 10% of the available instances).
(3) Implement the following cluster-density-reducing unsupervised feature selection algorithm (or CDR)
For every feature vector, compute the D distance between the vector and all other vectors; Find the smallest distance (>0) between any two vectors: call that D_min_global;
Let e = k x D_min_global (where k is a real-valued constant =>1 entered by the user);
Repeat {
For all vectors whose minimum D distance to another vector =~ D_min_global Do
{
Find a nearest vector (this will give you a pair of vectors, Va and Vb); Discard one of the pair of vectors (Va, Vb); // function description below
}
Re-compute D_min_global; // as the elimination of features will alter it }
Until (D_min_global > e)
Output retained features; // just a list of feature indexes, such as FV1, FV3, FV14.
Discard one of the pair of vectors (Va, Vb) { Create a vector_cluster = {Va,Vb};
Let k=2; // meaning 2nd nearest neighbour
10: If there are no more nearest neighbours Then { Randomly discard Va or Vb; Flip switch; Exit function };
}
Find the k-nearest_neighbour of Va and of Vb; // they may be two distinct vectors or the same one Let vector_cluster = vector_cluster + k-nearest_neighbour of Va and of Vb;
Calculate D_average of Va and of Vb to the other vectors in the vector_cluster;
If D_average of (Va>Vb) OR (Va<Vb) Do
{
If (switch = true) Then discard Va or Vb with lowest D_average Else discard the other; Switch = NOT(switch); // flip the Switch, where switch is a global Boolean variable
}
Else { k++; go to 10 } // in order to increase the size of the vector_cluster
(4) To solve the test, you will need to:
- Implementing & debugging the distance measure (described above)
- Possibly, adapt a nearest neighbor classifier to the needs of applying CDR to one of the data set below.
Description: [login to view URL]
Example code in Java: [login to view URL]
- Design & Implement a function to read from one of the data sets at:
[login to view URL] (e.g., REALDISP, Image Segmentation and Robot Execution
Failures datasets).
- Incrementally Implement & Debug the algorithm (CDR): first to compute D, then the Discard function and
finally, to implement the whole algorithm.
- Provide a theoretical assessment of the time complexity of CDR as a function of n (the initial number of
features); test your prediction empirically by running your program on sub-sets with different n values.
- Present the program & output file and a written report on the results- especially, theoretical & empirical
time scalability results.
Marks are out of 100% and they are assigned as follows: 35% accurate & efficient Java or C++ program implementation of the CDR algorithm (USB key); 15% a complete output file exhibiting feature reduction for one dataset (USB key); 15% correct theoretical analysis of the order of time complexity of your implementation (paper); 35% correct analysis of the empirical time-performance of your program, with justification for any divergence between it and the theoretical analysis (on paper). Attach a completed & signed statement of originality form to your submission.
Hello
I'm interesting your project very well
I'm a Good C/C++, Java, Math, Algorithm expert.
I understand your req exactly.
I m quite well experienced in these assignment jobs.
Let's go ahead with me
I want to service for you continously.
Thanks