Find Jobs
Hire Freelancers

Parsing data files with Perl script and Java

$100-500 USD

Cancelled
Posted over 12 years ago

$100-500 USD

Paid on delivery
You are given one file containing 10,000-20,000 urls ([login to view URL]) and another file with the same number of lines that corresponds to its human judgement category labels ([login to view URL]). You need to run the Liblinear package which is a support vector machine implementation on this data. Specifically we want you to do the following. We ideally want you do this task in perl, but also want you to be proficient in Java, since we want to make our tests easy and runnable from the command line (and hence perl) but ideally want you to also be proficient in Java from the standpoint of pushing our optimally trained webpage classifier into product (and all our production systems run in Java). ## Deliverables You are given one file containing 10,000-20,000 urls ([login to view URL]) and another file with the same number of lines that corresponds to its human judgement category labels ([login to view URL]). You need to run the Liblinear package which is a support vector machine implementation on this data. Specifically we want you to do the following. We ideally want you do this task in perl, but also want you to be proficient in Java, since we want to make our tests easy and runnable from the command line (and hence perl) but ideally want you to also be proficient in Java from the standpoint of pushing our optimally trained webpage classifier into product (and all our production systems run in Java). 1. Parse the URLs file and create another file with the text dump of the each URL with new line characters of course eliminated so that the new file contains text of a URL on the corresponding line (ideally get all page text with HTML tags removed). Your perl script should be easily modifiable to allow treating differently terms (words) in different webpage sections. Let's call this file [login to view URL] 2. You will then do one pass over the text data and assign unique integers contiguously from 1 to N for each unique word that appears. Store this mapping in a separate file in the following format per line <integer><tab separator><word> 3. You will then do a pass over the URLs file and assign unique integers N+1 to N+M to each unique domain. (eg: the domain for [login to view URL] would be yahoo.com. You can simply store these in the format: <integer><tab separator><domain>. 4. You will then do a pass over the URLs file and assign unique integers N+ M + 1 to N+M +P to each unique sub-domain. (eg: the sub-domain for [login to view URL] would be movie.yahoo.com. You can simply store these in the format: <integer><tab separator><sub-domain>. 5. Do one pass over [login to view URL] and get a unique mapping from 1 to K where K is the number of distinct labels that you see in the file. Now load the mappings in (2), (3) and (4),(5) in main memory and do one more pass over [login to view URL] and [login to view URL] to prepare a file [login to view URL] which can be accepted directly the SVM software: [login to view URL]~cjlin/liblinear/ The data format is: <label> <feature1:value1>....<feature_i:value_i> You simply print the label by looking at the corresponding integer for this label. To create the rest of the "feature vector", you need to do the following: 1. Populate a HashMap of features and values where the value for features of category 3 and 4 would simply be 1.0 for the correct domain and sub-domain. For the word features, the value would be the number of times the word occurred. 2. Print the Feature Vector in the file. Liblinear requires that you print the features in increasing order of feature id so make sure you purge the hashmap contents in that manner. The feature vector needs to be sparse so iterate only over words that exist in this URL page. 3. Run Liblinear in default 5 fold cross validation mode and report the accuracy of this method to us. ---------------------------------- Once this is done, compare the following variants and report the above cross validation accuracy on each: 1. Normalize the whole feature vector (let x be sum of squares of all the feature values for a line. Divide each value by sqrt(x) before printing to the file) 2. Normalize only the word part of the feature vector (let x be sum of squares of all the word feature values (i.e (2)) for a line. Divide each word feature value by sqrt(x) before printing to the file). 3. There is probably some convergence parameter in the SVM. See if you can get better results by reducing it's value to 1/10 of the default. 4. The default value of the SVM C parameter is 1. See if there is any improvement if you set it to 0.1 and 10. Estimated time: 20-30 hours (including all testing)
Project ID: 2703108

About the project

6 proposals
Remote project
Active 12 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
6 freelancers are bidding on average $381 USD for this job
User Avatar
See private message.
$425 USD in 5 days
4.8 (58 reviews)
6.6
6.6
User Avatar
See private message.
$400.35 USD in 5 days
5.0 (76 reviews)
6.1
6.1
User Avatar
See private message.
$400.35 USD in 5 days
5.0 (112 reviews)
6.0
6.0
User Avatar
See private message.
$425 USD in 5 days
4.3 (61 reviews)
6.1
6.1
User Avatar
See private message.
$272 USD in 5 days
5.0 (106 reviews)
5.9
5.9
User Avatar
See private message.
$361.25 USD in 5 days
5.0 (41 reviews)
5.8
5.8

About the client

Flag of UNITED STATES
Mountain View, United States
5.0
230
Member since Apr 12, 2008

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.