We would like someone to build a PHP crawler/scraper using cURL.
The application should have a form with 2 input fields.
Input 1: a URL
Input 2: text string for search
Input 1 is the starting URL to start crawling a web directory. The application will crawl the directory and follow outgoing links to websites listed in the web directory.
It should be able to search the HTML code of the website for the text string we specify in Input 2 and then search for the specified string through a maximum of 5 pages.
If the text string is not found in any of the first 5 pages of the site, the application should stop crawling that site. That domain should be stored in the database as a domain to not attempt to crawl again in the future.
If it finds the text string in the code, the scraper should crawl the entire site and collect the following data:
Scraper should retrieve the following content:
Meta Description Tag
Email Address - Email Address should be associated with domain it was found on and not page URL it was acquired from.
This data is to be placed into a MYSQL database. One table should contain Domain, URL, Titles and Meta Description Tag. Second table should contain Domain and email information.
We would also like a throttle function to control the number of URL's the program will be crawling at a given time.