Web Crawler in PHP(repost)

Completed Posted Feb 22, 2011 Paid on delivery
Completed Paid on delivery

Hello!

I need a web crawler that checks the crawled websites for adsense script code and if found look for a contact website and save results to a local database.

You can find the complete project decription here on this site as well.

Thank you,

Marc

## Deliverables

The spider's main job is to find adsense code containing websites and - if such a website is found - try to find a contact mail address on the standard contact website and save website address and mail contact address into a database. That's the general overview. Now let me start with the details.

The crawler should start with a certain web address, let us say [[url removed, login to view]][1]. Now the crawler has two jobs here. Check out if adsense code is available and grabbing new website addresses for further crawling. In detail:

Job 1: The crawler has to find out if adsense code is part of this website's source. This is very easy to find out. The answer is "yes" if this website's source contains the word "google_ad_client" or the word "GA_googleFillSlot". If the answer is "no" only write the website's address into the database to know for later links on this website that it must not be crawler again.

$insert = "INSERT into visited_websites (address,adsense,impressum) VALUES ('[url removed, login to view]', 0, 0)";

$ret = mysql_query($insert);

If the answer is "yes", so if there is adsense code found, the crawler should look for a link to the "impressum" (what is the german standard word for "contact" and to find on nearly every german website). The word "impressum" can be part of the linked website (or folder) or part of the linked word. Here are some exaples how this link to the impressum may look like:

<a class="link" href="[url removed, login to view]">Impressum</a>

<a href="<[url removed, login to view]>" class="dropdown">Impressum</a>

<a href="<[url removed, login to view]>" title="Impressum">Impressum</a>

<a href="<[url removed, login to view]>" title="Impressum">Impressum</a>

<a rel="nofollow" href="[url removed, login to view]" target="_blank" title="Internetradio"><img src="<[url removed, login to view]>" border="0" alt="Impressum"></a>

<a href="/Impressum-(Info)">Impressum</a>

<link rel="copyright" href="/de/impressum/" title="Copyright" />

<link rel="bookmark" href="/impressum/" title="Impressum" />

<a href="javascript:openNewWindow('[url removed, login to view]')">Impressum</a>

<a href="/impressum/" class="impressum" rel="nofollow">Impressum & AGB</a>

If the crawler is NOT able to find a link to the impressum it should write into the database

$insert = "INSERT into visited_websites (address,adsense,impressum) VALUES ('[url removed, login to view]', 'google_ad_client', 0)";

$ret = mysql_query($insert);

comment: There are two different adsense source codes available. One contains the word "google_ad_client" and the other the word "GA_googleFillSlot". Depending on the found word the third database value must be instert. In my example it was "google_ad_client".

If there IS a link to a impressum website the crawler should parse this website and try to find a mail address (only the first one if there are more than 1 available). Often the webmaster try to mask the mail address to prevent spiders from grabbing. You must not make the spider (crawler) Einstein-like but it should be able to find patterns which COULD be a mail adress. The crawler must not dekrypt the masked mail address! This is not your job. Just jind patterns which look like a mail address and write the found into the database. Here are some examples how such masked mail addresses on the impressum side could look like:

Scid1[at][url removed, login to view]

info (at) krankenversicherungprivat (dot) org

matthias<at>[url removed, login to view]

redaktion [at] freeware [punkt] de

So, the pattern is that a "at", "dot" or "punkt" in braces are a very good marker to realize: Hey! Here is a mail address! (by the way, "punkt" is the german word for "dot"). So, I think the easiest instructure for your crawler is this: If there is one of the following expressions in the impressum's source code, grab all before and after this marker (including the marker) up to the next html or script tag:

[at|AT|dot|DOT|punkt|PUNKT|@|et|ET][2]

opening brace: [ or ( or {

closing brace: ] or ) or }

Between opening/closing braces and the marker word may be one of the following chars be present: " | ' | or a space or nothing or combined. And there are not allowed to be more than two chards between the marker word and the brace.

Example:

<font face="verdana">hier ist meine mailadresse: info (at) krankenversicherungprivat (" dot' ] org</font>

Here your crawler should have recognized (at) and (" dot' ] as markers. (of course it is enough to find the first marker). So the crawler should grab all between the html tag before and after the marker. And that is

hier ist meine mailadresse: info (at) krankenversicherungprivat (" dot' ] org

Not a marker is for example { " AT "} because there are more than 2 chars between { and AT

Now the database command is

$insert = "INSERT into visited_websites (address,adsense,impressum, mail_pattern) VALUES ('[url removed, login to view]', 'google_ad_client', 1 'hier ist meine mailadresse: info (at) krankenversicherungprivat (" dot' ] org')";

$ret = mysql_query($insert);

Oh, very important: Mostly the mail address is written unmask. Such mail addresses your crawler must find as well of course! :-) Please use the standard pattern recognizing for mail addresses first before - if neccessary - look for the masked mail addresses.

If your crawler does NOT find something which could be a mail address, the database command is

$insert = "INSERT into visited_websites (address,adsense,impressum, mail_pattern) VALUES ('[url removed, login to view]', 'google_ad_client', 1, 0)";

$ret = mysql_query($insert);

Job 2. The second job is to scan this website for other website addresses which can be crawled later. The links must be

a) linking to other domains

b) link to .de or .at domains (german and austrian domains)

The found links must be written in the database but only to the start site. Example: If the crawler is analysing the website [[url removed, login to view]][1] for other domains and fount the link [[url removed, login to view]][3] if should only save [[url removed, login to view]][4] to the database.

$insert = "INSERT into websites_to_visit (address) VALUES ('[url removed, login to view]')";

$ret = mysql_query($insert);

From this database table 'websites_to_visit' your crawler can pick up new website addresses for crawling when it finished the current website crawling.

$select=mysql_query("SELECT url FROM websites_to_visit WHERE crawled = 0 LIMIT 1);

$update = "UPDATE websites_to_visit set crawled = 1 WHERE url = '[url removed, login to view]'";

The update query is for marking a web address as already crawled.

Okay, this is the description for the crawler. The only crawler setting I need is a variable to declare how many simultanous crawling proccesses are allowed. For example if

$simultanous = 5;

the crawler should proccess 5 web addresses simultanously.

That's it! If you have any question, please ask me!

Engineering MySQL PHP Project Management Software Architecture Software Testing Web Hosting Website Management Website Testing

Project ID: #3122256

About the project

4 proposals Remote project Active Feb 22, 2011

Awarded to:

reeancer

See private message.

$85 USD in 5 days
(34 Reviews)
4.3

4 freelancers are bidding on average $101 for this job

deltatechnologix

See private message.

$127.5 USD in 5 days
(24 Reviews)
5.3
crazenators

See private message.

$106.25 USD in 5 days
(7 Reviews)
3.3
sebbz0rer

See private message.

$85 USD in 5 days
(3 Reviews)
0.8