Plagiarism detection - NEED IT IN 5 days TOPS!
Find common phrases and sentences between documents (source and suspicious). Find all plagiarized parts. There can be 4 cases:
- Copy paste
- Copy paste + word order change
- Copy paste + paraphrasing
- Copy paste + word order change and paraphrasing
I suggest here using multithreaded Needleman Wunch algorithm for document similarity comparison and plWordnet for synonyms and paraphrasing checks.
There are sets of document pairs suspected to be plagiarized and files that previous have been plagiarized from. In plain text format named as [url removed, login to view], [url removed, login to view], where XXXX in pair number, in first there is something plagiarized from the second one.
In plagiarism was detected we want to save all information in XML file as follows:
<?xml version="1.0" encoding="UTF-8"?>
<alignment document="[url removed, login to view]" source="[url removed, login to view]">
<passage documentFrom="123" documentTo="123" sourceFrom="123" sourceTo="123" />
<passage documentFrom="234" documentTo="234" sourceFrom="234" sourceTo="234" />
Tag passage means that plagiarism was detected:
• documentFrom – beginning index of recognized plagiarized fragment from [url removed, login to view]
• documentTo- ending index of recognized plagiarized fragment from [url removed, login to view],
• sourceFrom- beginning index of recognized plagiarized fragment from [url removed, login to view],
• sourceTo- ending index of recognized plagiarized fragment from [url removed, login to view] m.
save all to suspiciousXXXX-sourceXXXX.xml. For entire task, it will be a set of XML files.
In order to measure quality, I will use
• precision,: Claude, Webb, Geoffrey I., “Encyclopedia of Machine Learning and Data Mining Sammut”, 2017, precision
• recall: Claude, Webb, Geoffrey I., “Encyclopedia of Machine Learning and Data Mining Sammut”, 2017, precision and recall
• granularity,: Potthast, Martin, et al. “An evaluation framework for plagiarism detection.” Proceedings of the 23rd international conference on computational linguistics: Posters. Association for Computational Linguistics, 2010.
• pladget score (main score),: Potthast, Martin, et al. “An evaluation framework for plagiarism detection.” Proceedings of the 23rd international conference on computational linguistics: Posters. Association for Computational Linguistics, 2010.
Trial set is attahed:
• pl/en – division between PL and EN documents,
• src (inside pl/en) – source documents,
• susp (inside pl/en) – suspicious documents,
• xml (inside pl/en) – proper answers.
Is attached as JAR file that needs newest Java 8.
• -e evaluation method,
• -i path to ZIP file with reesulting XML files,
• -t path to folder with answers.
java -jar [url removed, login to view] -i c:\\[url removed, login to view] -t c:\\dataset -e TASK1
THE BASELINE SLUTION TO THIS TASK in general can be based on suffix array. To find Longest Common Substring between documents.
In pre processing this documents will be:
• Remove special characters,
• Normalize white symbols in text,
• Remove EN stop-words,
• Remove PL stop-words,
Such data is then divided in 15-grams phrases and put into suffix array. The result of this is as follows:
• precision: 0.861901, recall 0.123821, granularity: 1.352459, plagdet: 0.175451
Nong, Ge, Sen Zhang, and Wai Hong Chan. “Linear suffix array construction by almost pure induced-sorting.” Data Compression Conference, 2009. DCC’09.. IEEE, 2009.
A was currently studying this topic a lot and I think that the tool would work best if worked in accordance to:
1. split text into sentences - for this I got such tool https://mega.nz/#!Y1V3mK4S!FChLKqtWfKM_Ezs4cpKbrvWI5O982cfZZ7dmOFPlqDE
2. pre-process text using https://mega.nz/#!t5liXD6T!MhC5wzqg-BZ-XWfAWWwuFu_w5Wy3DGyOGMhUTzAWDAE
3. lowercase all
4. for each sentence do LSA (Latent Semantic Anaysis) using - https://mega.nz/#!t4cAiB6I!aTPOQrvW5jbHrmZzbpfdE6WyL-8tS5D-RrD5hisvPAg
So we need to do train and testMP.py from this archive - best would be to have option to use LSA or wordnet as a parameter - it might be slow because it will generate for each sentence a lot of other sentences
5. compare each original sentence to all generated sentences from suspicious file. By doing this sentence by sentence we will maintain information which sentence was plagiarized. I suggest here using Needleman-wunch algorithms - best would be multithreaded version
6. save results in XML as previously described
18 freelancers are bidding on average $619 for this job
Hi,dear. I am a senior software developer. I have just checked your project report, I am able to perform this task with my developer team. I am looking forward to your proposal...
Me and my team has 5 years of experience into Python/Django,iFrame/flask/Golang & Data Scraping or Web Crawling. Can very well execute this Project and can work at US hours.
5 days tops. We're looking at $1000 for the speed required and the complexity of this task. My final year university project was building something very similar to this.
Hi! Sounds like you need an AI / Neural Network to determine a similarity score. I can deliver you the result in 3 days, if you give the project to [login to view URL] us discuss the details on Chat. Thank you.