Web Data Extraction

This project received 2 bids from talented freelancers with an average bid price of $4378 USD.

Get free quotes for a project like this
Employer working
Project Budget
$30 - $100 USD
Total Bids
Project Description

We require a script to easily automate data extraction without any programming. Going beyond simple screen scraping or cutting and pasting information from a website, the script has to intelligently extracts information. it has to automatically login to websites, account for changes in the source website, extract that information and copy it to another application reliably in a format specified by us.

## Deliverables

### Description

| I need a customized web crawling program to scrape data off an extensive business contact database that contains millions of members. This program must be able to circumvent server detection, either by bandwith throttling or another device. The database will require multiple templates for extraction, however the end user will have the capability to determine the specific crawling rules, keywords, and depth of crawl. The end product must be able to be convereted into an Excel file, MSFT Access, or MySQL database.


Features that are desired:

1. Multiple Data Types in Single Extraction Template
(i.e., Free Text, Tabled Information, Multiple Tables)
2. Multiple Types of File Lists or Data Inputs
(i.e., Excel, Access, MySQL, SQL, etc.)
3. Multiple Extraction Datastores
(i.e., Excel, Access, MySQL, SQL, etc.)
4. Automatic Table Creation during Extraction
(Supported in Excel, MySQL, SQL)
5. SQL 2005 Express Instance
(Stores Meta-Data, Program Variables, and can store Extracted Data)
6. Comprehensive Meta-Data Logging
(For Auditing, Data Cleansing, and Data Joining)
7. Wizard driven DataSet Initialization String Creation
(The HTML for Extraction Area Start and Row Start)
8. Manual Editing of DataSet initialization String
(User Defined DataSet HTML)
9. Automatic Table Row Count Calculation
(Automatically Calculates Number of Tables Rows on each HTML Page)
10. Wizard driven Field Creation
(The HTML for Data Extraction Start and Column Start)
11. Manual Editing of Field Start and Stop HTML
(User Defined Start and Stop Tags)
12. Supports Optional Fields
(Accurate Extraction of data that appears in some rows, but not in others)
13. Built In Data Cleansing
(Remove HTML, Preserve Text Whitespace, Full URL from Relative, and more)
14. Test Extraction w/Step by Step Replay for Troubleshooting
(Expedites Troubleshooting)
15. One-Click Save to Datastore Option
(Extract while browsing in the DataPage Editor)
16. Basic Automation Wizard
(Simple Extraction Automation via File List from Excel, Access, MySQL and SQL)


1. WinHTTP Stack
(Server quality HTTP platform that allows up to 10 page per second downloading)
2. Multi-Step Task Execution
(Simulate user tasks like Log-in, get Cookie or SessionID, Submit Searches)
3. Bandwidth Throttling
(Scale between 10 request/second to 1 request/hour to simulate real user)
4. Download Images and Files
(Edit File Path and File Naming Conventions)
5. Customize User Agent, Referrer URL, Relative URL, Cookies, and more.
6. Powerful SQL based File List Manipulation and Concatenation
7. Package Run Scheduling
(Run Normally or Silently from Windows Scheduler or other program interface)
9. Create URL File Lists
(Manually or using Excel, Access, MySQL, and SQL) X
10. Advanced Web Crawler
(Control Depth, Number of pages, and parameters of Link to be crawled or ignored) X

This program must be able to avoid automated detection or blocking from the host. Remember, it must be able to extract entries in the millions at very high speeds. The database which I need to scrape is [url removed, login to view]

The information required includes:

All as per [[url removed, login to view]][1] |

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online