login
Forgot?
Login with Facebook

Don't have an account? Register one now!

Parsing Pages of an old HTML site into a Database

Bids 
61
Avg Bid
N/A
CLOSED
  • Project ID:

    743168
  • Project Type:

    Fixed
  • Budget:

    $750-$1500 USD

Project Description:

www.atecorp.com is a site we are moving from static html to a database driven site. What I need is all the essentials parsed out logically into a spreadsheet. The standard stuff like title tags, meta tags, image name, alt text and description.

The site is 15 years old and has several different developers work on it so not every page has the same layout. 95% of the pages are product pages and those are what I need parsed.

Additional Project Description:

07/21/2010 at 12:08 EDT:
Please see the Project Clarification Board for updated information and details.



Skills required:

HTML, Perl, Python, Web Scraping, XML

Additional Files:

sample_parse.xlsx

Project posted by:

brendezzi United States
(3 Reviews)

Last seen:

Public Clarification Board

5 messages

  • FranciscoAZ

    Regarding the long description:
    To confirm, the long description begins with the first paragraph of the product description and ends after the table of product specifications.
    Should all HTML tags be preserved? They are not for the last item in the sample.
    Should line breaks and white space be preserved?

    Thanks

    over 1 year ago

  • qdoth

    Hello Mr. Brendezzi,

    I think i don't ftp account to get the data that you need, like in the

    sample_parse.xlsx, as long as it's public.

    over 1 year ago

  • cybertone

    I think FTP access is not neccessary while all files is pulic.

    over 1 year ago

  • brendezzi

    Thanks everyone for putting in your bids and for all initial efforts you've put forward. Here are some additional details:

    Total Product Pages: 3,433
    The site does have .asp file extensions but the site is not dynamic.
    This project will not require any assistance with the database integration.
    I will most likely be unable to get access granted to you for FTP. If this is essential please help me out and message me a reason so I can submit a request to IT.

    An issue we are trying to determine internally: Synchronization of products to web content. The product database is different from the static site to be scraped. Most likely this will have to be done manually and will have everything done alphabetically based off either a product ID tag that is embedded in the pages when we attempted to convert (.asp) the site to a dynamic one. We will review internally and determine how to approach this and I will update. Doesn't sound like it will effect the bid but if it does then you can go ahead and update the bid.

    Please allow a few days for the update and feel free to post any questions that may have arisen.

    We will determine our contractor by next Friday and begin the following Monday (August 2nd). FYI we are on Pacific Standard Time.

    over 1 year ago

  • yelmer

    i newly registered this site, so my portfoli is poor here.

    i working about web robot for 10 years,
    i work about download.com, imdb.com etc.

    i want to work for this project

    and also i can parse this site with extra specialties.

    for example you can want to content as blob data, but i can parse the product features and specifications line by line, and i can send it xml or ms access database file.

    over 1 year ago


If you are the project creator or one of the bidders, please Log In for more options.


Bids are hidden by the project creator. Log in as the project creator or as one of the bidders to view bids. You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.

All Bids ()