Closed

Rapidminer Ninja wanted / Webscraping using Rapidminer

This project received 2 bids from talented freelancers with an average bid price of $49 USD.

Get free quotes for a project like this
Employer working
Skills Required
Project Budget
N/A
Total Bids
2
Project Description

** Your knowledge/skills

Mandatory

- You are an experienced user of Rapidminer 5.2

- You have already a previous experience of successful webscraping using Rapidminer 5.2

** Your work habits

Mandatory

- You respect the deadlines (you will proactively report any hurdles)

- You will answer emails within 24 hours

- You will not outsource the job, fully or parts of it

** Your personality

- You don’t hesitate to provide input/ideas that could bring added value to the project

- You are interested in a long term collaboration on further webscraping projects.

** Your task will be

Your mission is to create a webscraping process in Rapidminer where the input is a set of keywords, and the output is a unique Excel spreadsheet (.xls or .xlsx).

- Let’s choose the example of the set of keywords: US “trade balance” (trade balance is between quotes)

- The process will search the 9 following websites for these keywords

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

- For each website, the process will retreive the 3 (default value) most recent articles. This number must be configurable by website, ie. we may configure 5 articles for the NY Times but only 2 for the WSJ.

- The process will save the content of each article (only the article, not the full webpage) in an Excel spreachsheet where the columns are ordered as following:

+ Column 1: publishing date of the article

The format of the date is different on the websites. For example:

On Reuters : Tue Sep 20, 2011 11:40pm EDT

On Bloomberg : Sep 18, 2011 9:00 PM GMT+0200

On Businessweek : August 04, 2011, 4:45 PM EDT

On WSJ : September 27, 2011, 7:30 PM IST

On FT : September 11, 2011 4:24 pm

Etc.

+ Column 2: direct link to the article on the website (the source webpage that has been processed)

+ Column 3: title of the article (without html tags)

+ Column 4: content of the article (without html tags)

- The file “[url removed, login to view]” will be saved under c:\rapidminer\

** You will deliver

Mandatory

- You will test the process before delivery in order to ensure it works as described

- You will provide the .RMP file.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online