An automated webpage scanning application needs to gather information from a specific list of websites (about 10k) and store it in a Java object for further processing. The aim of this project is to give each programmer a group of 100 sites to implement the scanners for such sites, according to the HTML structure.
The required information is usually organized a highly structured manner, so that the operation of gathering such information can be easily implemented as an iteration on each entry.
The programmer is given a class library which the implemented scanners must comply to. Moreover, the provided library already contains an high-level API that abstracts and automates the scanning process. If the site is well-structured, the implementor simply needs to specify in a jQuery-like fashion where the required information is located. Ciononostante al programmatore è consentito di correggerli per ottenere un pagamento integrale.
The application is written in Java 7, so JDK 7 is required to compile the scanners.
The application depends on two other libraries: jsoup 1.7.2 (to parse the HTML pages) and Apache Commons Lang 3.0.1 (general purpose). In most cases, the implementor will not need to use directly either of them.
To ease the structural detection of the HTML pages before implementing the related scanner, the use of Firefox with the Firebug plugin is highly recommended.
A medium or good knowledge of Java is required, in order to produce good-quality code.
Since jQuery-like selectors are used to navigate through the HTML of the scanned page, the programmer must know how to write them; anyway, the jsoup library API docs contains a list of the supported selectors.
The websites to be scanned are in Italian, but no particular knowledge of this language is needed.
The programmer will be provided the API docs for the application library and some example scanners.
For “site” we mean a domain (i.e. [url removed, login to view]); if a site contains more than a webpage with the information we look for (i.e. [url removed, login to view], [url removed, login to view]), we call these “sub-sites”; if a sub-site with a long list of entries is divided into several numbered pages, such pages belong to the same sub-site and all of them must be scanned.
Hence, the programmer has to implement a scanner for a given list of sub-sites (coming from 100 sites, as said above), keeping in mind that the HTML structure of sub-sites within the same site is often the same.
As already said, the produced code must comply with the application library we give. The programmer must provide the source code of the implemented scanners and, optionally, the compiled class files.
We reserve the right to verify that the produced scanners actually work, and to pay only the amount equivalent to the working ones. However the programmer is allowed to rectify them to obtain a full payment.
13 freelancers are bidding on average $110 for this job
I would be glad to work on this project. I've worked on many java projects including projects Web Scraping. - Please contact me to discuss this further. Thanks.
HI, I am having 3 year ecperience in such field. I can develop your site in zk, ectjs or flex. you can see demo extjs here http://www.taimsolution.com/examples/desktop/desktop.html