We wish to build an intelligent spider which can learn from drag and drop user actions. Our objective to provide a GUI interface for learning spidering rules.
The rules will be based on a combination of knowing the HTML page structure and being able to extract elements e.g. a table of values, therafter fields within the table which may repeat.
Extracted data Ashould be written to a database, i.e. we should be able to drag and drop the extracted field values to a database structure. (auto map).
There may be several rows of data and HTML commits per HTML page parsed.
Rules of extraction are two fold :
1) The HTML structure itself and tagging this for drag and drop and slecting groups of elements capability
2) Regex auto pattern learning based on examples within 1) or combined with 1)
Given a URL the parge should be parsed using To Parse pages ready for drag and drop of HTML elements in the gui.
One of the key concerns is to be able to detect any ambiguity in the rules, for example a table may occur 20 times on a page, we may need only the 4th table, or we may need only the table with a special style associated. (<table columns=30 style=”example">)
Its important that recurring elements can be supported.
Grouping and nested elements:
It should be possible to first group elements using rubber banding gui technique (e.g. select a table) drag and drop from the resulting page into a box and that the system would be able to visit the page URL and follow the drag and drop example i.e. extract the table.
It should be possible for the system based on an example to know what HTML elements To extract form a page based on a users drag and drop action. Regex rules and wildcard patterns should be self learning to extract elements from within a table.
If you have strong experience in some of the above coupled with solid GUI development, please respond.
PHP Simple HTML DOM Parser
[url removed, login to view]
Or similar domain parser to tag elements for drag and drop.
Other Blurb : Related
1. Ontology learning and population: bridging the gap between text ... - Google Books Result
[url removed, login to view] Buitelaar, Philipp Cimiano - 2008 - Computers - 273 pages
However, it relies on a set of manually written regular expressions, ... critical requirement of the method is the availability of sound core ontologies, ...
2. Ontology learning for the semantic Web - Google Books Result
[url removed, login to view] Maedche - 2002 - Computers - 244 pages
Figure 7.12 depicts the view of pattern engineering that allows to development and debugging of regular expression patterns for ontology learning. ...
7 freelancers are bidding on average $657 for this job
Just click to the attachment here our best projects i have linked. We are having 7 years experience in web development. Please check PMB for more details.