The objective is to extract a section of TEXT from a document that either may or maynot have HTML based tags.
The program should be in PYTHON.
- The program should first download a file from a given URL.
- The program should then strip the file of all HTML tags and retain only text. I prefer a non-REGEX based removal of HTML tags using sgmllib or other libraries.
- Then it should use REGEX to extract a section of a document based on some specified rules.
- It should then save the extracted text in a location on local hard drive.
- It should delete the PARENT file that was downloaded URL.
As an example, I have a document with the list of URLs. Let us suppose we want to extract the text in the section titled "Item 7. Management Discussion and Analysis" based on the rule. Please feel free to play with this test case and with the REGEX rules for extraction. Please note that the format of the document changes between the first and the last i.e. the first ones are not HTMLs.