This should be an easy one.
I have a series of MS Word documents that are somewhat structured and I need a parser written to capture each section of the document and insert it into a MySQL table. The documents contain information on different cities (schools, entertainment, how to get around, etc.) so each document details a different city/area.
The structure of the documents is as follows:
Getting Around (MS Word H1 style)
text for "Getting Around"... (MS Word Normal style)
By Car (MS Word H1 style)
text for "By Car"... (MS Word Normal style)
Freeways (MS Word H1 style)
text for "Freeways "... (MS Word Normal style)
Drive Time & Distance (MS Word H1 style)
text for "rive Time & Distance"... (MS Word Normal style)
Drivers License (MS Word H1 style)
text for "Drivers License"... (MS Word Normal style)
...and so on so we have a structure like:
--------Here are some of the schools in the area...
--------Here's a list of high schools
-----------Driver's License info...
-----------Public Transportation info...
The MySQL schema I have set up is a simple, single table that captures the hierarchy of these parsed sections:
Field Type Null Key Default Extra
id int(11) NO PRI NULL auto_increment
area int(11) NO
type int(11) NO
parent int(11) NO 0
content_html varchar(1000) NO
status int(11) NO
create_date datetime NO
edit_date datetime NO
created_by int(11) NO
edited_by int(11) NO
So, basically, a document would be parsed on it's title (with a lookup to an "area" table to grab the id, and another lookup for "type" for FK references in the table above), and then each section parsed ([[title]], etc.) would be a new row in this table. Nested sections would have the id of their parent section in the "parent" column and root-level sections would have a "parent" value of 0.
I can provide sample documents and a full schema (including one manually parsed document) upon an accepted bid.
Sounds easy enough, right? You pick the language as long as it's Perl, PHP, Java, or VB (that's what the maintenance programmer is familiar with).