We need data from one website to be dumped into a .csv file. The site has a table, and each row of that table contains a hyperlink. We need a script that will click through to each hyperlinked row and capture the data in the linked page, most of which is in table format. The data on the linked page has fairly well organized html (but we've seen better).
There are around 2000 rows in the original table, which means 2000 hyperlinks. The hyperlinked pages vary in the amount of information that they contain -- some have more entries than others -- but all of the information is in tables on the hyperlinked pages, so it should not be too complicated. All in all, we will have 38500 rows (2000 table rows x 1-100 or so data points in the hyperlinked pages) at the end of the project.
We have limited experience in writing Regular Expressions code, and if we had more time and fewer obligations, we could do this ourselves. However, this project is time sensitive. We need the data in 5 days. We suspect that someone who has written a lot of scraping code could complete this task in an hour or two.
Finally, we want the script that you write to be useable and adaptable for future rounds of data gathering. We prefer that you either use Regular Expressions coding in R or use Python. The finished product should have helpful comments in it for our future use.
Edited to say that we STRONGLY prefer Python or R. We are getting a lot of response from folks who do PHP, but we don't want to have to learn another language in order to adapt this code in the future. Thanks!
26 freelancers are bidding on average $85 for this job
Hello Sir, I am entirely capable. Simply provide me the URL and piece of data you need and lets get started. I requested two days time just to keep a buffer, most probably will deliver in one day. Thanks