Webcrawler that checks websites on the internet for adsense script code when found contact information for site is stored in a data base.

Cancelled Posted Jul 26, 2011 Paid on delivery
Cancelled Paid on delivery

Webcrawler that checks websites on the internet for google adsense script code, when found contact information for site is stored in a data base. If adsense script is not found then contact information for the site should be stored in a separate (no google adsense data base).

There are two different adsense source codes available. One contains the word "google_ad_client" and the other the word "GA_googleFillSlot". Depending on the found word the third database value must be inserted.

If there is a link to a contact website the crawler should pass this website and try to find a mail address (only the first one if there are more than 1 available). Often the webmaster try to mask the mail address to prevent spiders from grabbing [url removed, login to view] spider (crawler) should be able to find patterns which COULD be an email address. The crawler must not dekrypt the masked mail address, This is not it's job. Just find patterns which look like a email address and write the found into the data base. Pattern markers being @ or dot in braces are a very good [url removed, login to view] easiest instructions for your crawler is this: If there is one of the following expressions in the source code, grab all before and after this marker (including the marker) up to the next html or script tag:

<a href=<mailto:at%7CAT%7Cdot%7CDOT%>

opening brace: [ or ( or {

closing brace: ] or ) or }

Crawler should recognized (at) and (" dot' ] as markers. (of course it is enough to find the first marker). So the crawler should grab all between the html tag before and after the marker.

the crawler should proccess 5 web addresses simultanously.

Any questions please ask.

Software Architecture

Project ID: #3468675

About the project

Remote project Active Jul 26, 2011