Java web crawler and text extraction modules

This project received 18 bids from talented freelancers with an average bid price of $716 USD.

Get free quotes for a project like this
Project Budget
Total Bids
Project Description

Part A ) Extract information from a given set of url's (BID URLs) which contain many PDF in Spanish and extract from the PDFs text using regular expressions.


The URL [url removed, login to view] should produce the following : Gerente de proyecto, Desarollador Java, Desarrollador PHP, Desarrollador Forms, Desarrollador .NET , Arquitecto de Software. This text is in page 47 of one of the files listed in the url. Keep in mind you have to parse all the docs in the URL.

Part B) After extracting the text the idea is to Store some of the text that matches certain criteria into a relational database (Mysql). With the above example the idea would be to store in a table with three fields:


| [url removed, login to view] | Gerente de Proyecto | Ingeniero de Sistemas

Un (1) año en Gerencia de proyectos informáticos | 1


1. Automatic replies that do not ask for especific information will be automatically discarded.

2. Deliverable MUST be configured as a working java maven project and does NOT have to be web.

3. Only one payment will be made when deliverables work and fully tested.

4. Project will be awarded to the first programmer to submit a working prototype of part A.

Skills Required

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online