Parsing data files with Perl script and Java

$100-500 USD

Cancelled

Posted

over 12 years ago

$100-500 USD

Paid on delivery

You are given one file containing 10,000-20,000 urls ([login to view URL]) and another file with the same number of lines that corresponds to its human judgement category labels ([login to view URL]). You need to run the Liblinear package which is a support vector machine implementation on this data. Specifically we want you to do the following. We ideally want you do this task in perl, but also want you to be proficient in Java, since we want to make our tests easy and runnable from the command line (and hence perl) but ideally want you to also be proficient in Java from the standpoint of pushing our optimally trained webpage classifier into product (and all our production systems run in Java). ## Deliverables You are given one file containing 10,000-20,000 urls ([login to view URL]) and another file with the same number of lines that corresponds to its human judgement category labels ([login to view URL]). You need to run the Liblinear package which is a support vector machine implementation on this data. Specifically we want you to do the following. We ideally want you do this task in perl, but also want you to be proficient in Java, since we want to make our tests easy and runnable from the command line (and hence perl) but ideally want you to also be proficient in Java from the standpoint of pushing our optimally trained webpage classifier into product (and all our production systems run in Java). 1. Parse the URLs file and create another file with the text dump of the each URL with new line characters of course eliminated so that the new file contains text of a URL on the corresponding line (ideally get all page text with HTML tags removed). Your perl script should be easily modifiable to allow treating differently terms (words) in different webpage sections. Let's call this file [login to view URL] 2. You will then do one pass over the text data and assign unique integers contiguously from 1 to N for each unique word that appears. Store this mapping in a separate file in the following format per line <integer><tab separator><word> 3. You will then do a pass over the URLs file and assign unique integers N+1 to N+M to each unique domain. (eg: the domain for [login to view URL] would be yahoo.com. You can simply store these in the format: <integer><tab separator><domain>. 4. You will then do a pass over the URLs file and assign unique integers N+ M + 1 to N+M +P to each unique sub-domain. (eg: the sub-domain for [login to view URL] would be movie.yahoo.com. You can simply store these in the format: <integer><tab separator><sub-domain>. 5. Do one pass over [login to view URL] and get a unique mapping from 1 to K where K is the number of distinct labels that you see in the file. Now load the mappings in (2), (3) and (4),(5) in main memory and do one more pass over [login to view URL] and [login to view URL] to prepare a file [login to view URL] which can be accepted directly the SVM software: [login to view URL]~cjlin/liblinear/ The data format is: <label> <feature1:value1>....<feature_i:value_i> You simply print the label by looking at the corresponding integer for this label. To create the rest of the "feature vector", you need to do the following: 1. Populate a HashMap of features and values where the value for features of category 3 and 4 would simply be 1.0 for the correct domain and sub-domain. For the word features, the value would be the number of times the word occurred. 2. Print the Feature Vector in the file. Liblinear requires that you print the features in increasing order of feature id so make sure you purge the hashmap contents in that manner. The feature vector needs to be sparse so iterate only over words that exist in this URL page. 3. Run Liblinear in default 5 fold cross validation mode and report the accuracy of this method to us. ---------------------------------- Once this is done, compare the following variants and report the above cross validation accuracy on each: 1. Normalize the whole feature vector (let x be sum of squares of all the feature values for a line. Divide each value by sqrt(x) before printing to the file) 2. Normalize only the word part of the feature vector (let x be sum of squares of all the word feature values (i.e (2)) for a line. Divide each word feature value by sqrt(x) before printing to the file). 3. There is probably some convergence parameter in the SVM. See if you can get better results by reducing it's value to 1/10 of the default. 4. The default value of the SVM C parameter is 1. See if there is any improvement if you set it to 0.1 and 10. Estimated time: 20-30 hours (including all testing)

Parsing data files with Perl script and Java

$100-500 USD

$100-500 USD

About the project

Looking to make some money?

Benefits of bidding on Freelancer

About the client

Client Verification

Other jobs from this client

Similar jobs