I need a web scraper that can go through http://phones.whitepages.com/
There's on the right hand side a column called "US Area Codes"
I need the web scraper to go through all categories (200s,300s, and dive through all subcategories).
For example, when it goes to 200s, first entry that appears is 201 New Jersey.
It would go through that link (http://phones.whitepages.com/201/ and then you'll see a list of longer prefixes.
Such as: (201) 200- Hudson Jersey City Verizon New Jersey, Inc Landline
I want the web scraper to go through all these pages and make a single CSV file (or better yet, write that info in MYSQL ) with that information.
The web scraper should have a switch/argument (on/off) to go even deeper (such as clicking on the (201) 200 or whatever's there and get some more info (which would be a page similar to this one : http://phones.whitepages.com/201-992 ) . I'm interested in the City/State (Newark , NJ) and Zip code
Since there are multiple pages for each prefix (for example there are 8 pages for the (201) prefix, your web scrapper should go through all these pages too.
Info should be stored in this format:
Prefix County City State StateABR Carrier Line type Timezone Zipcode
1201992 Essex Newark New Jersey NJ Broadview Networks, Inc - NJ Landline Eastern 0712
(notice format is 1201992 , not (201)-992- )
Additional info that I need and which is located on the same page : The short description (such as "New jersey area code 201 has been in service since 1947, it has 530 landlines..... [...] and Kearny. State: New Jersey ; Time Zone: Eastern .. This info should be in a separate table
I would prefer code written in PHP with MySQL.