Make a Program to Convert HTML Documents to LaTeX

IN PROGRESS
Bids
1
Avg Bid (USD)
$200
Project Budget (USD)
$30 - $250

Project Description:
Hi,

Before wasting your time reading this, YOU SHOULD BE FAMILIAR WITH LaTeX ALREADY.

Now, I'm looking for a program that can take as input a folder filled with HTML files and output a pdf of each HTML file converted using LaTeX. All of the HTML files are in a the standardized format used on the SEC Edgar website, an example of which is here:

http://www.sec.gov/Archives/edgar/data/899881/000103570408000095/0001035704-08-000095.txt

(just rename as .htm and open in browser; also, I would want the "-----BEGIN PRIVACY-ENHANCED MESSAGE-" part removed).

Here is the idea: I read these documents all day, and I don't think they look very nice. However, I love the output produced by LaTeX, an example of which is here:

http://pangea.stanford.edu/computerinfo/unix/formatting/latexexample.html
http://htmltolatex.sourceforge.net/samples/sample2.pdf


Now, you might think this problem is easy, since there are several programs out there to convert html to latex:

http://www.iwriteiam.nl/html2tex.html
http://htmltolatex.sourceforge.net/#samples
http://html2latex.sourceforge.net/doc/html2latex-man.html


The problem is, the SEC Edgar documents tend to have large tables, as well as other complications, and none of the programs I have tried quite get this right. I also have the ability to download these SEC documents in MS Word format. There is a commercial program to do this conversion from .doc:

http://www.grindeq.com/index.php?p=word2latex


I have tried all of these things so far, and nothing really works. Here are examples of my failed efforts (you will need to compile these in LaTeX yourself):

http://www.sendspace.com/file/j8k66t
http://www.sendspace.com/file/0el6ru


The main problem is getting the tables to look right.

The other big issue is that the html file has built in page breaks and page numbering. I would want this stripped out and the documented re-paginated in some optimal way to make it look better when it is converted with LaTeX. However, on the line in the converted document where there would be a new page in the original html file, I want a small page number reference off to the side of the converted pdf. This is because I often need to refer to a certain page of the filing when talking to other people who are looking at the original html version, and it is vital for me to direct them to the page I mean. This page number feature should be able to be toggled on and off with a checkbox control on program startup.

If you really know LaTeX well (which is required for this project!), this should be very easy to do, for these reasons:

1) There are several open source program to do most of the work. They just need some tweaking to work with the Edgar files.

2) The Edgar files are created in a highly standardized way, which you can easily see if you try searching for a few different stock "tickers" on this site (try searching for "MSFT", "PLD", or "GGP"): http://www.sec.gov/edgar/searchedgar/companysearch.html

Your program should required no user intervention to function correctly. Ideally, I will simply fill up a folder with files similar to the 0001035704-08-000095.txt referenced above, and your program will simply show a progress bar and then confirm completion or report errors (I would also like to to support "drag and drop" functionality in windows"). Everything else (compiling the output into pdf, for example) should happen in the background automatically. You can assume that I have the latest version of MiKTeX installed on windows.

The final feature I would like is the option to select between the standard LaTeX serif font, and the standard san-serif font, for both the regular text, and for the text/numbers included in tables, and for either a small or large font. That is, something like this should display at program startup:

Body Text
Serif _____ San-Serif___X____
Small ____ Large ___X____

Tables
Serif _____ San-Serif___X____
Small __X_ Large _______


I hope it is clear how I want these files to look after conversion. It should all look nice, with nothing going off the page or too close to the edge of the page, with the body text justified, and with each table presented in an optimal size to fill up the page well and to not break up a table over 2 pages, and with the table borders and grid-lines thin and light, and attractive to look at. I know that all of this possible in LaTeX (almost anything is possible in TeX!). If you are confused about what I am talking about, please ask for clarification on my PMB before bidding!

In order to complete the project, your program will have to pass a test, in which I give you 5 randomly selected filings, and your program has to convert them all correctly to pdf format using any of the options described above.

I don't care what programming language you use, as long as I can keep the source code. Also, feel free to use any open source code inside your program.

Thanks for bidding, and let me know if you have any questions.

Skills required:
C Programming, Perl, Python
About the employer:
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.


Hire silentius
$ 200
in 4 days