PROJECT OVERVIEW -- Website Webcrawler
Website Webcrawler receives as input --
a. the url of a website
b. a list of keywords separated by commas
upon this input, the app crawls only those links WITHIN the website, and returns specific intelligence on that website. The output intelligence includes --
a. All Email Addresses that has the website as the domain name. So, if the website is [url removed, login to view], it would return only email addresses with [url removed, login to view] as the domain root.
b. All Webpages the has KEYWORD METADATA, URL LINKS, or the CONTENT that matches one or more of the keywords provided. For example, if one of the keywords were "Search," you would flag any webpage that had the word "Search" in their metadata, in its url, or within the page content within that website.
c. A list of All Forms and the Action= post page for that form.
Upon initiation, the application --
a. Using the parameters set in a config file to connect to a Remote or Local Database. The config file would have the following parameters (I prefer xml but name / value also works) --
DBHost: ip address
IF DBTrustedConnection is True, it connects locally to DBName, if False, it connects to DBName using DBHost, DBUser, and DBPassword
b. In a CONTINUOUS LOOP manner, the application would call a Stored Procedure (GetWebsite) in SqlServer Database with NO PARAMETERS. The continuous loops ends once NO RESULTS are returned.
c. The results from the SP are returned in a 1 Row SELECT STATEMENT / dataset with the following columns--
d. You would process this result set according to the logic in the above OVERVIEW SECTION and return the results in PostWebsite SP.
e. The PostWebsite SP would have the following input parameters --
iii. EmailList -- XML FORMAT
<email emailID="" />
<email emailID="" />
iv. Formlist -- xml formatted list of forms, such as
<form webpage="" action=""/>
v. KeywordList -- xml formatted webpage list of keywords matched, such as
<keyword name="search" matchTypeID="1" webpage="abc/search.aspx"/>
MatchTypeID=1 (keyword was found in Metadata keywords)
MatchTypeID=2 (keyword was found in URL)
MatchTypeID=3 (keyword was found in Webpage Content)
f. Upon completion of PostWebsite, application would call GetWebsite for the next set of Website info.
That is all folks!! Simple, right?? :-)