HTML to XML converter

Closed Posted Dec 12, 2007 Paid on delivery
Closed Paid on delivery

We currently convert HTML to XML using MSHTMLParser, which comes with IE. When we submit HTML document to parser we receive back IXMLDocument model. Our problem with this approach is that MSHTMLParser is very slow and very CPU intensive. We use it to parse large number of URLs and it is does not work well under server high load environment. We may need to process over 20-30 concurrent HTML pages and our server chokes.

We need to create new HTML to XML parser, which is very fast (We expect on average to process 200 KB HTML page less than 1 sec). The output should be the same as in IXMLDocument. The new parser should work with ANY html page. It could be well formatted HTML or it could be misformed HTML document. If it is mis-formed, parser should properly handle broken tags and not to break. We need behavior same as in MSHTMLParser, which is what IE uses. We are open to use already available components and in fact encourage this, if license for the component allows the use it in commercial applications. In fact, we would recommend taking some of the available components as the base. It is probably not reasonable to create needed parser from scratch in scope of this project. New parser should be written in VC++ or C#. No java or other languages. If using VC++, it must be COM or .NET assembly.

We will be testing component the following way:

We will run over 10000 different URLs using MSHTMLParser and save IXMLDocument to XML file. Then we will run your parser on the same URLs and save output to files. All files have to match in order for us to except the component.

## Deliverables

1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.

2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):

a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.

b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.

3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).

* * *This broadcast message was sent to all bidders on Thursday Dec 13, 2007 1:36:01 PM:

Hello, We added [url removed, login to view] file with few conversion examples. Also, there is [url removed, login to view] file that provides additional details on what exactly needs to be converted.

## Platform

Windows 2003 Server

Apple Safari C Programming C# Programming Engineering Google Chrome Microsoft MySQL PHP Software Architecture Software Testing Windows Desktop XML XSLT

Project ID: #3554928

About the project

3 proposals Remote project Active Jan 3, 2008

3 freelancers are bidding on average $496 for this job

stilgarvw

See private message.

$552.5 USD in 14 days
(74 Reviews)
6.3
alex13vw

See private message.

$510 USD in 14 days
(47 Reviews)
5.2
zeroid

See private message.

$425 USD in 14 days
(27 Reviews)
4.9