Web Crawling utility that crawls a specific site, parses the data according to a template, and then inserts the extracted data into a database.
Inputs (with examples given):
URL Root: [login to view URL]
Starting Integer: **10000**
Ending Integer: **10100
**Wait Interval: **2**
Template name: **[login to view URL]**
Destination database: **testdb**
Destination stored procedure: **testrun**
The user is responsible for setting up a stored procedure that receives the data
In the above example, the program retrieves 101 web pages as specified in the range starting with
[login to view URL]
It reads the [login to view URL] file which contains pattern information defining
(1) fieldname
(2) type of data (integer, date, or character)
(3) max field lenght
(4) are null values allowed?
(5) starting pattern to match
(6) ending pattern
The template might look something like this, although you can define your own "template language"
PROFILETOID
type:integer
nulls:no
start:[login to view URL]
end:">
ALIASTO
type:**char**
nulls:**no**
start:
end:**</a>**
PROFILEFROMID
type: **integer**
nulls:**no**
start: **[login to view URL]**
end:**">**
and so on for the fields **ALIASFROM,DATEPOSTED,MSGNUMBER,MAXMESSAGE,MSGBODY
**
The program starts at the beginning of the file and searches for the string **[login to view URL]**, and then takes what immediately follows but before **">** and extracts it as the **PROFILETOID** field.
For the **ALIAS** field, since there is no start information, the program knows to begin reading that field immediately and end with </a>.
and so on to the end of the template file.
## Deliverables
(More info that wouldn't fit above)
As the page is being read, the results are validated to make sure that every field is filled with the correct type of data. If nothing is found between the start and end patterns, the field is valid only if NULLs are allowed.
If one of the fields is found not valid, an error dialog box pops up and says which url failed and why. Example error messages:
"Failed to parse #10002 because [ALIAS] exceeded the maximum length of 20"
"Failed to parse #10002 because [ALIAS] start pattern not found."
But even if there is an error, the program continues once the dialog box is Okayed. There should be a checkbox on the dialog that offers the user to "Ignore errors" so that the program may continue without interruption.
After each page is read and validated, the program calls a stored procedure with all of the template fields, along with RAWDATA which is a text field including the html of the page retrieved and an ERROR variable which indicates whether the page contained a validation error or not.
The wait interval of "2" is the number of seconds to wait between page requests (so as not to overload the target site). If set to "0", there is no wait between requests.
Deliverables:
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done. Must run on Visual Studio .NET in VB, C#, or Java
**Please configure your code to work with SQL Server using (localhost) as the server, "sa" as the userid, and "crawler" as the password.**
2) Sample database named testdb, containing the testrun storedprocedure and 1 table with the following 10 fields: **ERROR, RAWDATA,PROFILETO, ALIASTO, PROFILEFROM, ALIASFROM, DATEPOSTED MSGNUMBER, MAXMESSAGE, MSGBODY
**
When I run the program without alteration, the above table should be filled in.
3) Installation instructions
## Platform
Windows XP, SQL Server 2000, [login to view URL]