262128 Email extractor

IN PROGRESS
Bids
0
Avg Bid (USD)
N/A
Project Budget (USD)
N/A

Project Description:
I need a script (Perl or PHP) that can be run from a server that can extract emails either from a web site that is an interface of a database open to general public (http:// and https://) or from web sites containing information I want. This script also needs to be compiled as a stand-alone .exe program for Windows Server 2003 and Windows XP. This shouldn't be a big issue for people that have data scraping/extractors, since the script does just that, download pages, grab emails from them, and save them in an excel or access file. Not a big issue.

- Extract/capture email, and info of the email's owner (US address: given name, family name, organization name, city, state, postcode and phone #; if non-US address: given name, family name, organization name, city & state/province, country name, postcode and phone #) if available and URL/ID from a specified database and domain folder through URL addresses. For example, each of the following addresses is linked to a person's profile in a database or domain folder, including name, email address, organization name, location, and phone #, etc. Each address below has its own profile format. I will let you know the wanted databases in detail after you win this bid. We need to have options to manually set up the items we want to collect. For example, sometimes we may just want to collect nothing but emails; on the other times, we may like to collect names, emails, and organization names (but nothing else), etc, depending upon the URL we are visiting.

- Once we input a person's database address as below, this program should stay in the same database and infinitely loop to search for wanted information of all the people in the database, by increasing and decreasing ID number (multi-threads), until finished or manually stopped. This program should periodically save results to avoid unexpected outrage/error leading to data loss. You should notice that the addresses below all contain a string of “ID=” or “id=”. The program should automatically change the numbers right after the string of “ID=” or “id=”, retrieve the wanted information, save them in an access or excel file, and then loop to next one until the database is examined fully. Some numbers could contain no information, then the program should just loop to next one.

http://www.domain.org/index.cfm?pagename=app_memberDirectory&redirect=MemberDirectoryDetail.cfm&ID=49312

https://secure.domain.org/xxxxx/directory/(S(21ixcpaitctivy45t1cyjxbc))/MemberInfo.aspx?DirID=79053

http://subdomain.domain.edu/WhitePagesPublic.asp?task=showperson&id=178271374180279376174273&a=hs&r=83&kw=

http://subdomain.domain.edu/WhitePagesPublic.asp?task=showperson&id=176271376179277377172279&a=hs&r=83&kw=


- Extract emails from folder or subfolder in the domain, like domain.com or only from domain.com/folder and on, and not from the root one. Sometimes an email address is embedded under a name. Then collect the name as well as the email from embedded link. For example, http://www.domain.org/aids/faculty.asp

- Crawl pages only in the URL specified, or folder within the URL domain.com/folder, with a maximum of 7-10 hunting depth. Capture emails that can not be manually copied.

- Multithread extraction of emails, connection to URLs in multiple threads for faster speed.

- Delete duplicated emails automatically at the end of job

- Delete all emails (if we tick option) from URL where emails were extracted from.

- Authentication details. If it's a forum, a member needs to enter user/password. The script should allow for entering user/password and get identified.

- Add different unlimited URLs to a queue. We should be able to add any URL or jobs, so that they are started automatically when last job has ended. We should be able to add a new job when a crawling is being done in the background.

- Possibility of pausing, stopping or deleting a crawl job; manually stop crawling but save the info from what we have crawled.

- Have a list of all queues and extractions done, with day/time started, day/time finished, number of emails extracted after duplication has been applied. That is a log of everything done.

- Password protected area to enter online application

- Export to access 2003 and/or Excel 2003 after crawl has been done.

- Language: php+mysql, or PHP, or Perl, or whatever language you can do this.

Maybe you can use some ready made php email extractor scripts as the ground to build this one, starting with one like zubrag.com/scripts/email-extractor.php or similar.

Looking forward to your PM and a reasonable bid, I've been into outsourcing for quite some time now, so I know the business. I've made quite a few extractions, and I guess those scripts could be adapted for this matter, so if you have this I'm sure you can have this done as well. If you have done something similar or if you can do this easily, send me specs of what you've done and what it does, and how you'd do this for me.

Skills required:
Anything Goes, C Programming, Java, Microsoft Access, Perl, PHP
About the employer:
Verified
Public Clarification Board
Bids are hidden by the project creator. Log in as the employer to view bids or to bid on this project.
You will not be able to bid on this project if you are not qualified in one of the job categories. To see your qualifications click here.