Closed

Source code matching application

This project received 1 bids from talented freelancers with an average bid price of $701 USD.

Get free quotes for a project like this
Employer working
Project Budget
N/A
Total Bids
1
Project Description

FOR ACCOMPLISHED EXPERT DEVELOPERS ONLY!!!!! NO NEWBIES PLEASE.

This must be as fast as can possibly be so preferably in C, C++ or JAVA

The purpose of this application is to identify computer source code that has been wholly copied, partially copied, of certain segments of File A have been copied, changed (maybe change the variable names) and then used in File B.

In comparing two sets of computer code, there will be hundreds of files from SET A to compare against SET B. The application you will write will need to compare every file in SET A against every file in SET B.

The purpose of this application is not to create a tool for ***precisely*** pinpointing plagiarized code, but rather a tool to ***assist*** me in finding plagiarized code.

I need to be able to narrow my focus from hundreds of thousands of lines of code in hundreds of files down to dozens of lines of code in dozens of files.



The application front end will ask for an original source folder (where all of the original SET A code will be found), a comparison source folder (where all of the suspected SET B code will be found), and it will then perform all five of the comparisons below, file by file.




## Deliverables

FOR ACCOMPLISHED EXPERT DEVELOPERS ONLY!!!!! NO NEWBIES PLEASE.

This must be as fast as can possibly be so preferably in C, C++ or JAVA

The purpose of this application is to identify computer source code that has been wholly copied, partially copied, of certain segments of File A have been copied, changed (maybe change the variable names) and then used in File B.

In comparing two sets of computer code, there will be hundreds of files from SET A to compare against SET B. The application you will write will need to compare every file in SET A against every file in SET B.

The purpose of this application is not to create a tool for ***precisely*** pinpointing plagiarized code, but rather a tool to ***assist*** me in finding plagiarized code.

I need to be able to narrow my focus from hundreds of thousands of lines of code in hundreds of files down to dozens of lines of code in dozens of files.



The application front end will ask for an original source folder (where all of the original SET A code will be found), a comparison source folder (where all of the suspected SET B code will be found), and it will then perform all five of the comparisons below, file by file.



The application must be able to read all forms of text file, different Unicode encodings etc. It must contain options for selecting the comparison languages. Each comparison language will contain a list of keywords to exclude from the Whole Word Matching test described below, i.e. if I select 'java' as a comparison language, this will load in a list of all current JAVA keywords (from an editable text file). Expected languages in the list will be C, C++, C#, VB, Java, JavaScript, PHP, Perl, Python. This should be presented in a DROPDOWN list so that I can add to it later if need be without having to make graphical changes (for checkboxes etc)



Once the comparison language is selected, and the two folders to compare, the application will compare the files against each other as follows, performing all five tests on each file pair.

The Application will generate reports that properly describes the results of each test for each file comparison, and it will calculate an overall similarity index for the two files averaged across the 5 tests.



Here are the tests required.

**
**

**WHOLE Word Matching**

For each file pair, the CodeMatch Word Matching algorithm counts the number of matching words that are not programming-language keywords.

To determine whether a word is a programming-language keyword, comparison is done with a list of programming-language keywords. For example, the word ***while*** in a C source code file would be ignored as a keyword by this algorithm.

In some programming languages such as C and Java, keywords are case sensitive. In other programming languages like Basic, keywords are not case sensitive.

Case sensitivity can be turned ON or OFF depending on the language being examined, So, for a case-sensitive language like C, the word ***While*** would not be considered a language keyword and would not be ignored, but in a case-insensitive language such as Basic, however, the word ***While*** would be considered a language keyword and would be ignored.

In either case, when comparing non-keyword words in the file pairs, case is ignored so that the word ***Index*** in one file would be matched with the word ***index*** in the other.

This case-insensitive comparison is done to prevent being fooled by simple case changes in plagiarized code.



**PARTIAL Word Matching**

The Partial Word Matching algorithm examines each non-keyword word in the source code of one file and finds all words that match a sequence within one or more non-keyword words in the other file.

Like the word-matching algorithm, this one is also case insensitive, so the word ***abc*** in file 1 can be found within words ***aabc***, ***abc1111111***, and **xc*abcxxxyz*** in file 2. The length of a partial word match is configurable.

**THIS CODE IS ALREADY DEVELOPED IN C++ AND IS AVAILABLE TO THE WINNING DEVELOPER**



**Source Line Matching**

The Source Line Matching algorithm compares each line of source code from both files, ignoring case.

I refer to functional program language lines as source lines and *exclude* comment lines.

Also, sequences of whitespace are converted to single spaces so that the syntax structure of the line is preserved.

Note that a line of source code may have a comment at the end. The comment is stripped off for this comparison.

Source lines that contain only programming-language keywords are ***not*** considered matching.

For source lines to be considered matches, they must contain at least ***one*** non-keyword such as a variable name or function name, otherwise, lines containing basic operations would be reported as matching.



**Comment Line Matching**

The Comment Line Matching algorithm compares each line of comments from both files, again ignoring case.

As before, sequences of whitespace are converted to single spaces so that the syntax structure of the line is preserved.

Note that a line of source code may have a comment at the end.

The source code is stripped off for this comparison.

The entire comment is compared, ***regardless*** of whether there are keywords in the comment or not.



**Semantic Sequence Matching**

The Semantic Sequence Matching algorithm compares the ***first word*** of every source line in the pair of files, ignoring blank lines and comment lines.

This algorithm finds sequences of code that appear to perform the same functions despite changed comments and identifier names found in different positions.

The algorithm finds the longest common semantic sequence within both files.

This algorithm yields a score representing the number of source lines in the longest semantically matching sequence in the two files.



**Overall Match Score**

Finally, a single score is given for the similarity of the file pairs based on the 5 tests.

If a file pair has a higher score, it implies that these files are more similar and may be plagiarized from each other or from a common third file.

Each of the scores from each of the five individual algorithms is weighted and added to give the Total Match Score.

These weights can be adjusted to give the optimal results.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online