Visual web scraping framework in delphi.
I would like a generic scraping framework written in delphi, which I can use as a basis to build web scrapers for various websites.
An implementation for a specified website will be required to demonstrate the functionality of the framework.
Framework functionality must include:
* full programmatic navigation of a website (via a visual recognition of components, as specified by myself (programmatically) through the framework)
* handling/navigation of pop up windows, java and otherwise. (visual)
* emulate website clicks/actions. (within the application, rather than emulating clicks on the computer would be preferred)
* ability to translate text data from a visual area. (I.e. visual translation of the text).
* non visual translation of text data would be useful too. (if it can be done in a simple way with highlight copy to clipboard)
* simple framework to store the data, which reflects the structure of the data on the websource.
More description will be added soon.
Additional Project Description:
09/26/2013 at 23:30 CLST
- solution must be in delphi.
- expect developer to make use of available delphi libraries.
- open to ideas and suggestions from the developer. But want to keep it simple.
Objective: to build a visual web scraper framework in delphi, so that webscraper applications can be built using the framework.
I want the ability to build the webscraper applications myself. Though if I am happy with the project, the likelihood of future work it high.
**To clarify, by "visual" I mean: text data is extracted "visually", i.e. translated from a visual image, jpg screen grab or equivalent. Website navigation is also to be "visual".
- working delphi source code of the framework.
- compatible with delphi XE4.
- extracts text data visually**.
- navigates website visually**.
- emulates basic website interaction. i.e. mouse clicks, typing, etc.
- parts of website to click or type are identified visually**.
- handles navigation of pop up windows visually**, java/flash and otherwise.
- it is important that the design maximises speed of scraping, multithreading would be good if you have experience.
- design must also aim to be efficient. (i.e. computing resources).
- working delphi source code of a scraper for a given website, using the framework (to demonstrate functionality).
- I suggest framework allows you to specify areas, where text data, a navigation node, or interaction area exists.
- I suggest a navigation tree or equivalent be dynamically generated, where any data etc can be related to the relevant navigation node.
- framework allows all data extractions, navigations, and interactions to be saved and loaded. (to/from a mapping or equivalent).
- any saved mapping/commands can be replicated programmatically to emulate website use.
- a method to test that the visual area for a mapping/command is correct.
- the framework should compile to a working executable, where you can create/save new mapping/commands through the framework by manually clicking and interacting with a website.
- there must be a visual representation of an individual mapping/command (e.g. highlight area specified).
- navigation includes scrolling.
- emulation of website clicks/actions should be done within the application, rather than emulating clicks on the computer.
- data should be stored according to where it exists on the website, i.e. data structure must be created dynamically with the navigation of the website.
- store history of extracted data in working memory with timestamps.
- have a simple way to look at history data while scraper app is running, i.e. using a string grid or equivalent.
- I suggest (for example) you use generics TList & TObjectList where the navigation and storage of data can be dynamic and generic. Also data, navigation or interaction nodes could have their own classes which would inherit from a generic node class. And sit as objects in the navigation tree lists.
working website scraper source code:
- built using framework.
- will give you a specific website to base this on.
- I suggest it is built as an extension on the framework application, where a new unit contains the code specific to the website. I.e. it is essentially the same application, where the core framework saves and executes commands, & stores the data.