Make a Program to Convert HTML Documents to LaTeX

  • Status Closed
  • Budget $30 - $250 USD
  • Total Bids 1

Project Description


Before wasting your time reading this, YOU SHOULD BE FAMILIAR WITH LaTeX ALREADY.

Now, I'm looking for a program that can take as input a folder filled with HTML files and output a pdf of each HTML file converted using LaTeX. All of the HTML files are in a the standardized format used on the SEC Edgar website, an example of which is here:

[url removed, login to view]

(just rename as .htm and open in browser; also, I would want the "-----BEGIN PRIVACY-ENHANCED MESSAGE-" part removed).

Here is the idea: I read these documents all day, and I don't think they look very nice. However, I love the output produced by LaTeX, an example of which is here:

[url removed, login to view]

[url removed, login to view]

Now, you might think this problem is easy, since there are several programs out there to convert html to latex:

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

The problem is, the SEC Edgar documents tend to have large tables, as well as other complications, and none of the programs I have tried quite get this right. I also have the ability to download these SEC documents in MS Word format. There is a commercial program to do this conversion from .doc:

[url removed, login to view]

I have tried all of these things so far, and nothing really works. Here are examples of my failed efforts (you will need to compile these in LaTeX yourself):

[url removed, login to view]

[url removed, login to view]

The main problem is getting the tables to look right.

The other big issue is that the html file has built in page breaks and page numbering. I would want this stripped out and the documented re-paginated in some optimal way to make it look better when it is converted with LaTeX. However, on the line in the converted document where there would be a new page in the original html file, I want a small page number reference off to the side of the converted pdf. This is because I often need to refer to a certain page of the filing when talking to other people who are looking at the original html version, and it is vital for me to direct them to the page I mean. This page number feature should be able to be toggled on and off with a checkbox control on program startup.

If you really know LaTeX well (which is required for this project!), this should be very easy to do, for these reasons:

1) There are several open source program to do most of the work. They just need some tweaking to work with the Edgar files.

2) The Edgar files are created in a highly standardized way, which you can easily see if you try searching for a few different stock "tickers" on this site (try searching for "MSFT", "PLD", or "GGP"): [url removed, login to view]

Your program should required no user intervention to function correctly. Ideally, I will simply fill up a folder with files similar to the [url removed, login to view] referenced above, and your program will simply show a progress bar and then confirm completion or report errors (I would also like to to support "drag and drop" functionality in windows"). Everything else (compiling the output into pdf, for example) should happen in the background automatically. You can assume that I have the latest version of MiKTeX installed on windows.

The final feature I would like is the option to select between the standard LaTeX serif font, and the standard san-serif font, for both the regular text, and for the text/numbers included in tables, and for either a small or large font. That is, something like this should display at program startup:

Body Text

Serif _____ San-Serif___X____

Small ____ Large ___X____


Serif _____ San-Serif___X____

Small __X_ Large _______

I hope it is clear how I want these files to look after conversion. It should all look nice, with nothing going off the page or too close to the edge of the page, with the body text justified, and with each table presented in an optimal size to fill up the page well and to not break up a table over 2 pages, and with the table borders and grid-lines thin and light, and attractive to look at. I know that all of this possible in LaTeX (almost anything is possible in TeX!). If you are confused about what I am talking about, please ask for clarification on my PMB before bidding!

In order to complete the project, your program will have to pass a test, in which I give you 5 randomly selected filings, and your program has to convert them all correctly to pdf format using any of the options described above.

I don't care what programming language you use, as long as I can keep the source code. Also, feel free to use any open source code inside your program.

Thanks for bidding, and let me know if you have any questions.

Get free quotes for a project like this
Awarded to:
Skills Required

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online