I need a PHP 5.3+ CLI crawler using CURL / DOM to extract host and domain names from websites. Crawling must read and follow [url removed, login to view] files. Must be multi-treaded to crawling is fast an efficient. Crawling given host name, supplied by a JSON data feed ([url removed, login to view]) returning a list of ALL domains and hostnames that site links to in JSON format. This should be a unique list so no hostname / domain is repeated. This list then will be submitted via an API to another script. This system MUST be very memory efficient and follow PHP 5.3+ recommended programming standards.
Items to check for host / domain names should be images, scripts and href values, but allow for expansion whilst coding.
This script will only be run from the Debian command line using PHP so make sure you really know CLI before bidding. This is the first of MANY small projects that will link together so clear well documented approach is essential.
Update: Must support UTF-8 and international domain names. CURL references should support compression if the remote server supports it to reduce bandwidth usage.