crawl internet for pdf files

CLOSED
Bids
34
Avg Bid (EUR)
979
Project Budget (EUR)
€750 - €1500

Project Description:
Specifications

Short description:
- Need a program that finds all hyperlinks for a certain URL (Domain) and checks whether one of the hyperlinks links to a pdf file. Scan all pages and subpages.

How is should work:
1. User enters a baseurl.
2. Baseurl is saved into a table
3. Webcrawler takes the baseurls given and start crawling the url to find all hyperlinks (main page and all subpages).
4. All found urls are saved to a hyperlink table where besides the base url all founds hyperlinks are stored and indentified.(relations between urls by parentid):
4a: When link is pdf file: mark in a column that the link is a pdf file (skip external domain check if the file is directly linked)
4b: When link is a link to another domain(for example from cnet.com to google.com: mark in a column that the links goes external. Skip crawling this url any deeper an continue to next url.

Extra specs:
1. User can enter how deep to crawl a website (parameter per baseurl) 0= only baseurl webpage, 1-998levels deep, 999 unlimitied
2. User can enter how fast too crawl a website (parameter per baseurl) 0= as fast a possible, 1-8000ms per check
3. User can define how to work through the baseurl queue: First in, First out or priority: High, Middle, Low.
4. User can define how long to crawl a baseurl. For example don't crawl the same baseurl(read server ip of the baseurl) for 0 to 600000 minutes. Then wait 0 to 600000 minutes before crawling again on this baseurl.
5. Application should work as a MS windows service that can be started and stopped
6. Data is stored in a mysql or mssql database
7. Application should be able to be started/stopped through commandline
8. New baseurls and parameters changes should be able to be given when application is running.
9. Application should be able to detect if crawler is being blocked. For example 25 times the same page response. Application should immediatly stop crawling and disable crawling on the baseurl/server ip until user want to start the crawling again.
10. Following should be logged in a logfile: start, stop, new cmd given, new baseurls given, parameter changes, crawler blocked, any basic errors, when pages as fully been crawled.
11. For future multithreading extension: a parameter where for each basurl it can be given which service on which server crawls the baseurl.
12. Prevent looping when crawling: For example by checking when a certain hyperlink already has been crawled.
13. Parameter to enable not follow bot
14. Parameters for db connection
15. Parameters for service/bot indentification
16. Parameters for logfile location
17. Prefer .NET solution but Python is also okay.
18. Code should be open and readable.

Skills required:
.NET, Python, Web Scraping, Web Search
Hire rkillaars
Project posted by:
rkillaars Netherlands
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.


€ 1000
in 10 days
€ 750
in 10 days
Hire renesoft
€ 750
in 15 days
€ 1000
in 10 days
Hire mhmhz
€ 1000
in 10 days
€ 750
in 5 days
Hire appwiz
€ 750
in 7 days
€ 750
in 20 days
€ 750
in 7 days
Hire ArsenMkrt
€ 750
in 14 days