Parsing semi-structured Word documents into MySQL

This project was awarded to setsailgo for $100 USD.

Get free quotes for a project like this
Project Budget
$30 - $250 USD
Total Bids
Project Description

This should be an easy one.

I have a series of MS Word documents that are somewhat structured and I need a parser written to capture each section of the document and insert it into a MySQL table. The documents contain information on different cities (schools, entertainment, how to get around, etc.) so each document details a different city/area.

The structure of the documents is as follows:


Getting Around (MS Word H1 style)

text for "Getting Around"... (MS Word Normal style)


By Car (MS Word H1 style)

text for "By Car"... (MS Word Normal style)



Freeways (MS Word H1 style)

text for "Freeways "... (MS Word Normal style)

Drive Time & Distance (MS Word H1 style)

text for "rive Time & Distance"... (MS Word Normal style)

Drivers License (MS Word H1 style)

text for "Drivers License"... (MS Word Normal style)

...and so on so we have a structure like:



----Elementary Schools

--------Here are some of the schools in the area...

----High Schools

--------Here's a list of high schools

--Getting Around

-----By Car

--------Driver's License

-----------Driver's License info...

--------Public Transportation

-----------Public Transportation info...


The MySQL schema I have set up is a simple, single table that captures the hierarchy of these parsed sections:

Field Type Null Key Default Extra

id int(11) NO PRI NULL auto_increment

area int(11) NO

type int(11) NO

parent int(11) NO 0

content_html varchar(1000) NO

status int(11) NO

create_date datetime NO

edit_date datetime NO

created_by int(11) NO

edited_by int(11) NO

So, basically, a document would be parsed on it's title (with a lookup to an "area" table to grab the id, and another lookup for "type" for FK references in the table above), and then each section parsed ([[title]], etc.) would be a new row in this table. Nested sections would have the id of their parent section in the "parent" column and root-level sections would have a "parent" value of 0.

I can provide sample documents and a full schema (including one manually parsed document) upon an accepted bid.

Sounds easy enough, right? You pick the language as long as it's Perl, PHP, Java, or VB (that's what the maintenance programmer is familiar with).

Awarded to:

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online