Completed

URL Scraper and Processor

This project was successfully completed by mhmhz for $300 USD in 3 days.

Get free quotes for a project like this
Employer working
Completed by:
Project Budget
$30-$250 USD
Completed In
3 days
Total Bids
15
Project Description

I need a tool uses a local database to create and manage campaigns. I need to be able to add new campaigns and select campaigns that have already been created. The tool must be able to accept and use proxies (both ip:port and user:password:ip:port format for private and public proxies).

Test added proxies to see if they're working with Google, and highlight the ones that aren't working in red. A one-click option to delete all failed proxies.

Once a campaign is selected or created, I need the tool to:

1. Accept input for an unlimited amount of keywords (1-10,000 keywords)
2. Go to google and search for each keyword, and scrape the top 1000 results
3. Move to the next keyword, and scrape the results, moving down the list of all keywords. Each time it runs a new search the tool should move to the next proxy.
4. Keep track of the keywords that have been successfully scraped and the keywords that failed scraping - Show me which keywords failed and which succeeded, plus allow me to "retry failed keywords"
5. Return all results and save them in the database for the current campaign.
6. After finishing, I need to be able to view all results in a table in the software
7. Click a button that will go and check the PR (Page Rank) of each URL (not domain, but specific URL) from the scraped results, and sort them by highest PR to lowest
8. Click another button that will filter/delete all pages below PR 5
9. Take this list and run a link-check function, where the software visits each page in the URL list created above and extracts all of the links that each of these URLs points to.
10. Check the status of each of these links that it found to see which ones return "no such host" errors (not 404 - page can't be found -- I'm looking for sites that are no longer live at all).
11. Save all of these "no such host" results to database and clear everything else, and show the results in a table in the software
12. Click a button to clean up these URLs - Trim them to root, remove subdomains, and delete "http://" and "https://" and "www." so that all that remains is a list of "no such host" domains in the following format:
[url removed, login to view]
[url removed, login to view]
[url removed, login to view]
etc.
13. Click a button that will run up to 3000-5000 domains through [url removed, login to view]'s bulk checker ([url removed, login to view]), and return results stating which ones are available for purchase and which ones are not.
14. Export the "available domains" to a separate area where the software can run them through SEOMoz free API to check the Domain Authority and Page Authority for the domain, and return the data in a table.
15. Automatically save as it goes so that if the software crashes it can pick up where it left off by clicking "Start" or "Resume," Plus a save button to save work when finished.
16. A separate area where I can add domains to wishlist (clicking a "+" image next to each domain in the table from step 14), where the available domain + seomoz data is listed in a table).

Steps 1-8 are basically ScrapeBox functions - If you've used the ScrapeBox software this will all make sense to you.
Steps 9-11 are basically Xenu Linksleuth type features (Xenu is free and you can try it out to see what I mean -- Xenu returns error 12007 for the type of results I'm looking for)

Also, once the tool hits the end of the proxy list, it needs to cycle back up and go through the proxies again. A random delay between each search on google for 20-60 seconds needs to be included. Multithreading is a must to speed up the process (for all steps). I need to be able to plug in my SEOMoz account data (Member ID and secret key). An option to set the number of threads in the settings area (where proxies can be added, and where SEOMoz account info can be added).

Please message me with any further questions!

Thanks for your time... Also I will need to add more to this tool in the very near future, and would be happy to pay much more to add some more features after the basic tool is created.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online