Parsing Pages of an old HTML site into a Database
Project ID:
743168
Project Type:
Fixed
Budget:
$750-$1500 USD
Project Description:
www.atecorp.com is a site we are moving from static html to a database driven site. What I need is all the essentials parsed out logically into a spreadsheet. The standard stuff like title tags, meta tags, image name, alt text and description.
The site is 15 years old and has several different developers work on it so not every page has the same layout. 95% of the pages are product pages and those are what I need parsed.
Additional Project Description:
07/21/2010 at 12:08 EDT:
Please see the Project Clarification Board for updated information and details.
Skills required:
HTML,
Perl,
Python,
Web Scraping,
XML
Additional Files:
sample_parse.xlsx
Public Clarification Board
5 messages
-
Regarding the long description:
To confirm, the long description begins with the first paragraph of the product description and ends after the table of product specifications.
Should all HTML tags be preserved? They are not for the last item in the sample.
Should line breaks and white
space be preserved?
Thanks
over 1 year ago
-
Hello Mr. Brendezzi,
I think i don't ftp account to get the data that you need, like in the
sample_parse.xlsx, as long as it's public.
over 1 year ago
-
I think FTP access is not neccessary while all files is pulic.
over 1 year ago
-
Thanks everyone for putting in your bids and for all initial efforts you've put forward. Here are some additional details:
Total Product Pages: 3,433
The site does have .asp file extensions but the site is not dynamic.
This project will not require any assistance with the database
integration.
I will most likely be unable to get access granted to you for FTP. If this is essential please help me out and message me a reason so I can submit a request to IT.
An issue we are trying to determine internally: Synchronization of products to web content. The product database
is different from the static site to be scraped. Most likely this will have to be done manually and will have everything done alphabetically based off either a product ID tag that is embedded in the pages when we attempted to convert (.asp) the site to a dynamic one. We will review internally and
determine how to approach this and I will update. Doesn't sound like it will effect the bid but if it does then you can go ahead and update the bid.
Please allow a few days for the update and feel free to post any questions that may have arisen.
We will determine our contractor by next
Friday and begin the following Monday (August 2nd). FYI we are on Pacific Standard Time.
over 1 year ago
-
i newly registered this site, so my portfoli is poor here.
i working about web robot for 10 years,
i work about download.com, imdb.com etc.
i want to work for this project
and also i can parse this site with extra specialties.
for example you can want to content as
blob data, but i can parse the product features and specifications line by line, and i can send it xml or ms access database file.
over 1 year ago