Web Data Extraction

CLOSED
Bids
2
Avg Bid (USD)
$4378
Project Budget (USD)
$30 - $100

Project Description:
We require a script to easily automate data extraction without any programming. Going beyond simple screen scraping or cutting and pasting information from a website, the script has to intelligently extracts information. it has to automatically login to websites, account for changes in the source website, extract that information and copy it to another application reliably in a format specified by us.

## Deliverables


### Description



| I need a customized web crawling program to scrape data off an extensive business contact database that contains millions of members. This program must be able to circumvent server detection, either by bandwith throttling or another device. The database will require multiple templates for extraction, however the end user will have the capability to determine the specific crawling rules, keywords, and depth of crawl. The end product must be able to be convereted into an Excel file, MSFT Access, or MySQL database.

--

Features that are desired:

1. Multiple Data Types in Single Extraction Template
(i.e., Free Text, Tabled Information, Multiple Tables)
2. Multiple Types of File Lists or Data Inputs
(i.e., Excel, Access, MySQL, SQL, etc.)
3. Multiple Extraction Datastores
(i.e., Excel, Access, MySQL, SQL, etc.)
4. Automatic Table Creation during Extraction
(Supported in Excel, MySQL, SQL)
5. SQL 2005 Express Instance
(Stores Meta-Data, Program Variables, and can store Extracted Data)
6. Comprehensive Meta-Data Logging
(For Auditing, Data Cleansing, and Data Joining)
7. Wizard driven DataSet Initialization String Creation
(The HTML for Extraction Area Start and Row Start)
8. Manual Editing of DataSet initialization String
(User Defined DataSet HTML)
9. Automatic Table Row Count Calculation
(Automatically Calculates Number of Tables Rows on each HTML Page)
10. Wizard driven Field Creation
(The HTML for Data Extraction Start and Column Start)
11. Manual Editing of Field Start and Stop HTML
(User Defined Start and Stop Tags)
12. Supports Optional Fields
(Accurate Extraction of data that appears in some rows, but not in others)
13. Built In Data Cleansing
(Remove HTML, Preserve Text Whitespace, Full URL from Relative, and more)
14. Test Extraction w/Step by Step Replay for Troubleshooting
(Expedites Troubleshooting)
15. One-Click Save to Datastore Option
(Extract while browsing in the DataPage Editor)
16. Basic Automation Wizard
(Simple Extraction Automation via File List from Excel, Access, MySQL and SQL)

Packages

1. WinHTTP Stack
(Server quality HTTP platform that allows up to 10 page per second downloading)
2. Multi-Step Task Execution
(Simulate user tasks like Log-in, get Cookie or SessionID, Submit Searches)
3. Bandwidth Throttling
(Scale between 10 request/second to 1 request/hour to simulate real user)
4. Download Images and Files
(Edit File Path and File Naming Conventions)
5. Customize User Agent, Referrer URL, Relative URL, Cookies, and more.
6. Powerful SQL based File List Manipulation and Concatenation
7. Package Run Scheduling
(Run Normally or Silently from Windows Scheduler or other program interface)
9. Create URL File Lists
(Manually or using Excel, Access, MySQL, and SQL) X
10. Advanced Web Crawler
(Control Depth, Number of pages, and parameters of Link to be crawled or ignored) X

--
This program must be able to avoid automated detection or blocking from the host. Remember, it must be able to extract entries in the millions at very high speeds. The database which I need to scrape is www.yell.com


The information required includes:

All as per [www.yell.com][1] |

Skills required:
Anything Goes, Database Administration, Microsoft Access, MySQL, Oracle, Software Architecture, SQL, Visual Foxpro, Windows Desktop
Additional Files: vw_2011___06___17___webDataExtImg_RAC_NameCryptedToProtectYourPrivacy_X201161774652838377953623104315263521084.gif
About the employer:
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.


Hire mastirlaa
$ 680
in 14 days
Hire matfizvw
$ 8075
in 14 days