Hi
Looking to have a web spider built, the spider must adhere to the following guidelines.
1) Completely obey [login to view URL] files and meta tags in web pages.
2) Only request [login to view URL] file once when indexing a website, i.e. if spidering [login to view URL], only request [login to view URL] file once for all pages in that website, store [login to view URL] file in table: at_Robot_Txt
2 a) Check to see if my Spidername is not blocked by website, if not continue to index pages
2 b) Insert [login to view URL] into database table: at_Robot_Txt, and use information from that to determine which pages can and cannot index
Columns:
• URL_Robot_Idx INT Primary Key
• BaseURL VarChar(100)
• RobotTxt VarChar(7500)
If no [login to view URL] file found enter “No text file found in Site”
3) Allow to enter own user-agent name i.e. "Spidername"
4) Read from a list of banned words and permitted words.
5) If it finds any banned works ignore page
6) If it find any permitted words index page
7) If it finds neither of the above ignore page
8) Must be able to index 60,000 + pages a day.
9) Must run on any windows platform from Windows 2000 professional, XP or Server
10) User interface must be easy to use and I should be able to see how spider is progressing, similar to visual web spider.
11) Take list of URLs from SQL 2005 at_URLsToIndex.
Columns:
• URLID INT Primary Key
• URL VarChar(300)
12) When indexing page insert data into the following table at_SpideredWebsites
Columns:
• PageURL VarChar(300)
• BaseURL VarChar(100)
• PageTitle VarChar(200) maximum 20 words
• PageParagraph VarChar(6000)
• PageSize VarChar(6) in KB
• PageLastUpdated VarChar(10) Format: 23 May 07
• ServerIpAddr VarChar(50)
• PageLevel INT i.e [login to view URL] = 100, [login to view URL] = 75, [login to view URL] = 50 [login to view URL] = 25 [login to view URL] = 0
• PageSpidered SmallDateTime Format: 23 May 07
13) Only index URLs that begin http://
14) Remove all html tags before inserting into database
15) Ignore URL that are invalid i.e
[login to view URL]://[login to view URL] etc
16) For body text all text except text in dropdownlists
I cannot state how strongly the spider must obey [login to view URL] files and only request file once when in session no matter how many threads are running, if spider is stopped and restarted later, only request [login to view URL] file once and update [login to view URL] table in database.
If you cannot achieve the above, please do not apply for this project as you will be just wasting my time and yours.
The budget for this project is $500, but get this right and i'll use you for the crawler that needs building
George