We need a script written in node.js which does the following:
Receive Urls from an RabbitMQ Server
read the the content type of the url
if the content type is text or html:
Download this urls
extract the title (try first h1, then title)
extract the text with readability
write the result back into the RabbitMQServer
It should be possible to download a configurable amount (1-100) Urls in paralell.
the script should be:
absolutely non blocking
a proper error handling
We can provide a python script which did almost the same.