We have a web based reporting system that we'd like to scrape. The system is built and hosted by another company, so we'd like to build a Django application that can do this.
So far I have built a quick proof of concept PHP script using the PHP cURL bindings to login and run these reports and return a print out of the results. So I have already uncovered some of the non standard things this web app does (like actually returning a response after returning a 301 header) and proven that we can automate pulling reports. I have also built some Django models for this project to get signoff from my stakeholders and also to communicate to the person doing the work what we want at the end. My proof of conecpt is just that. I did it to make sure that we could do what we needed to do.
So what you will need to do.
1. Use cURL for Python to do what I have done in PHP and build an object structure in Python to do what I have done structurally.
2. Take the response from the current web reporting system which is basically a HTML table and store the results in a MySQL database. I have been advised that BeautifulSoup is probably a good tool for this. See [url removed, login to view] The only change that I'd make to how I currently do things is to chunk the reporting time frame down to a month at a time. So for example if we run a report that spans 2 years we actually run 24 reports against the reporting server and log each piece of data to the database and then return the whole 24 month period to the user.
3. Once the report is run it needs to be emailed to the report owner email address with an excel attachment.
4. Make these reports able to be executed as Django Admin tasks
5. Make a django admin command to import all customers into the system from the reporting system. There are probably < 1000 customers.
Notes this application must run under:
Apache 2x running mod_wsgi
Build using buildout (sample django app already does this)
I have included the sample [url removed, login to view] that I have built. It is pretty much complete. I'm happy to not use the generic relations if they prove difficult to implement.
Brief description of models.
reportingUserAccount: These are the user accounts that we use to connect to our reporting server.
These two tables are not complete but are just the fields that we pass up to the reporting server to get it to return data to us. I have a list of these fields I just have not got to making a DJango model for them yet. There are currently two types of reports that are run product and sales lists.
These tables specify the report jobs, where the data is sent to and which user we run the reports as. Reflecting on this it might be better to be modelled in one model.
customer: This model stores the name and customer id of the customers in the remote reporting system.
reportingRun: Every time a report is run there is an entry in this table.
These tables store the data returned from the remote reporting system. Because the data in the remote system changes over time we may store the same data across multiple runs. This is not going to be an issue.
12 freelancers are bidding on average $3581 for this job
Hi, This is Azhagu Selvan from India. I have worked on a handful number of production ready django products. I also have hands-on experience on doing scraping with Python.
we have a team of 25 members with expertise in their profession. We have made a similar kind of projects. For further information please view your PMB. Ready to work with you Regards.
dear sir, we are a chinese team and so glad to know your project at here. we have 25 programmers and 14 web designers works at the same office, completed a lots of web site using Django.