I have started a small Java project in Eclipse that implements Crawler4J webcrawler ([login to view URL]). This crawler uses the BerkleyDB ([login to view URL]). I don't have time to finish it. At this time it compiles, runs, and creates the DB. I have not tested for what gets saved to the DB. I want to get the project to where it will take a class that reads the Berkley db and outputs its contents so I know if the crawler is getting the data I want.
The second part of this project will be to create different methods in the custom crawler class I created to specify different types of data to be extracted.
The last part of the this project is to be able to feed in a list of URLs to crawl.
You can go to the Crawler4J website on [login to view URL] and review the code. You can also go the the DB site and review the basics of that DB if you are not familiar with it. You can open up the project in Eclipse. It is configured to run in Juno 3.8+, Java 1.7, and Maven 3.0.4.