I am looking for someone to design, implement/code, and do a couple do trial runs a crawler that would collect detailed information from a large portion of YouTube videos.
I would like the output to be a set of records with the following information: video id, date uploaded, user, number of views, number of comments, time length, category.
Ideally the program would crawl using a search method that generates a large, unbiased sample. Large= a few million records (10M).
I also would like to be able to start the crawling with several different seeds and have the outpurt augmented as new videos are found, so I can have several runs
I have no preference on the specific language and software infrastructure as long as it is mostly open source stuff.
If you already have this done I would be willing to hire you just for running the program for me to my specs.
i will only consider proposals from people who have done this or something extremely similar before.
1. 10MILLION TO 20 MILLION distinct, random records, collected within one week, in a format readable by MSFT Access of *distinct*, random, YouTube videos. Each record has the following format:
YouTube video ID-> number of views, date uploaded, number of comments, time length, uploader, category, available resolutions.
I cannot provide a server - running the query is part of the job.
The delivered records will be audited in the following way: 10,000 random records will be checked using an Amazon Turk project. I will accept the deliverable if 9801 or more are validated.
2. The script (source code) used to crwal and extract the information.
Other: It is important for me to understand how the crawling occurs (the algorithm).
3. Please look carefully at all the fields in the record definition above and ensure that you KNOW how to extract that info. I will accept bids for shorter or simpler records if they come with an explanation of why what I am looking for is not doable or too hard.