I am looking to screeb scrape a specific site [url removed, login to view] to collect data on registered sex offenders. The criteria for the search url to return the records I am interested in is as follows: [url removed, login to view];link=doSearch&commaSeparatedOffenderStatus=1,6,7,8,9&stateStatus=1&offenderType=3
However, hitting that URL directly seems to redirect you back to the homepage unless you already have an active session on the site. I suspect this is the first tricky spot as a session or something needs to be set with the parsers.
Once you do get the results, you will notice in a hidden field that all the IDs exist for the results. I anticipated using those ids to build the urls for the next part of the scrape where the offenders record would be built. It is a hidden field. <input type="hidden" name="commaSeparatedPersonIdsALL"
From these ids, the url to the respective record can be formed: [url removed, login to view] using the ID for the personID.
From this form I would like the following data scraped, including a url to the image and combined into an XML feed which will later be imported into our database (the DB import is not part of this project).
From right of photo....
Designation: Sexual Offender
Name: Samuel E Ackerson
Status: Released - Required to Register
Department of Corrections #: D93831
Search the Dept of Corrections Website
Date of Birth: 05/28/1975
Race : White
Weight: 153 lbs
Samuel E Ackerson
Date Of Photo: 11/03/2009
Scars, Marks & Tattoos
From Address Information I would like the first Address and Address Source Information> I would also want longitutde and latitude extracted from the map link for the address being imported. This will be stored in db on import for Geo coding on map.
From Crime Information - Qualifying Offenses I would like all the information brought into the feed as a table using the same headers as the page but without color or formatting.
Again, this data should all be produced into an XML file that I will later use to import into the DB. The XML file should be stored on each run when completed and named with time/date stamp. The process should be setup to be able to be run via windows task manager so maybe php curl from command line or something similar... not my area of expertise.
Also note that I will need personID in the XML output for each record.
Hello, I am scraper expert, i have test the site, it use session, but it's no problem for me, thank you! (as we agreed, i add the plugin revision part into it, and add $70 on bid)