You have chosen to sponsor your bid up to a maximum amount of .
I would like to scrape data from the SEC.gov website. I am interested in scraping the DEF 14A filings. I want to scrape data from at least 5000 reports, preferably more.
I want to extract just two fields from each report: the name of the company and the percentage of the company owned by the board members.
I would like this information to be scraped and sent to me in Excel format.
Scraping this information will be fairly challenging because the HTML pages are unstructured. The target text does not appear at the same predicable location on each page.
The only way to locate the relevant text is to make use of some kind of advanced Boolean proximity search.
The target text is normally preceded by a number of recognizable terms.
The target text is normally followed by a percentage symbol.
Here is an example of a DEF 14A filing.
http://www.sec.gov/Archives/edgar/data/789019/000119312512418708/d375562ddef14a.htm The relevant table appears on page 11.
The table lists the names of the board members and the percentage of the company that they personally own. Collectively the board members own 9.46 percent of the company. In this example I would be looking to extract the word "Microsoft" and the number 9.46 and place this in Excel format. In this example I would be looking to extract data from the 2013 Microsoft filing and all previous Microsoft filings on record.
The big problem is that not all the companies use the same language: Here is a list of different phrases that various companies use. In each case the percentage figure at the very end is what I would be aiming to extract:
"All directors and current executive officers as a group (12 persons) 3,991,056 6,348,957 10,340,013 2.6%"
"Executive officers and directors as a group (13 persons)(19) 1,490,847 6.7%"
"All directors and executive officers as a group (18 persons) 661,671 1,440,299 269,802,371,776 4.3%"
"All Company directors and executive officers as a group (19 persons) 433,960 596,312 1,030,272 1.5%"
"All nominees, continuing directors and executive officers as a group (20 persons) 5,944,103 16,824,264 139,82 8,03,926,234 (4) 23%"
"All directors, director nominees and executive officers as a group (12 persons) 13,412,40 17.0%"
"All current executive officers and directors as a group (10 persons) (7)........ 19,059,809 1,275,405 52.1%"
"All directors and executive officers as of November 13, 2012 as a group (13 persons) 17,011,477 624,969 17,636,446 54.8%"
A program could be built that recognizes the term "as a group (?? persons)" A wildcard search would have to be used because the number of persons varies, but it is always a two digit figure. The proximity between the term and the relevant percentage varies, but it is normally less than 30 characters.
The relevant percentage figure is normally preceded by the terms "as a group (?? persons)" and its normally followed by a % symbol.
The program would not be 100% accurate, but that does not matter in my case.