69979 extractor

In Progress

have 2 sample programs that were never completed - one written in c++, one in jave, can provide both for examples. Job can be from scratch or just finishing up the coding on either one of these. the c++ has the most completed of the 2. price would vary depending on if you were just finishing this code, or if you had to start over. some of the live links below may be dead, cause this was an old writeup of mine, I will add new links, as soon as I get some bids Here is the project - would not allow my html tags, so I modified them, I think you should get the picture still though. MULTIPLE STEP MULTITHREADED DATA EXTRACTOR/ POST/GET METHOD - WIN32 APPLICATION Overview Extractor must be able to visit given/loaded/generated urls and extract *strings (sometimes for use in the next step) from the response and/or save the complete page *based on rules - appended to the same dat/xls/csv/txt file, if it was just a small string extracted, or saving as idividual files to a directory file001..file002..file003..etc.. creating subdirectories as number of files increases. ie. create a new directory within the main result directory every 10,000 saves. Must support redirects/302 errors and others/ process js asp and php. must be able to access ssl pages support logins, accept/reject cookies / keep alive and also be able to re-connect as if you closed your browser and reopened a new one... probably just deleting cookies/clearing cache everyrequest and reconnecting. must support the use of socks proxies. *Strings - first there should be a set of predifined strings. ie. html body/ formvalues/ mailto tags/ table values etc... and then capable of creating my own custom strings example 1 - could tell it we wanted to extract all data between [open html string[ AND [close html string[ in the response or multiple strings being - string one start name="firstname"value= - would extract all data from that starting string to ending string which would be whatever was after the data I wanted - could be / or [ or a space or line break or line feed or return character. - string 2 would start name="lastname"value= - again all data till the ending string. -string 3 etc.... until I had configured all starting/ending tags/ for all parts I wished to extract. *Rules - Page saving rules - do not save/ save only if - page includes [userdefined[ or page does not include [userdefined[ response size is [ / [ / = to [userdefined[ AUTOSAVE - I would like to always save in another defined file the url and variables that the extractor is pulling and all usernames/ that created a match/ or all number/number-letter combos that created a match *match - finding what it was looking for - file saved - data extracted.... UrlLists - loaded lists Must be capable of accepting large files lists, ie. point to the file list and grab lines as needed, as apposed to loading the entire list in memory. lists of usernames or a dictionary list to run against the site...etc.... UrlLists - program generated Must be able to create fixed variables ie. the main part of the domain ie. [url removed, login to view] [fixed doesn't change[ Must be able to create number sequences ie. 1000..1001...1002....1003 - by range/increment amount must be able to create letter generating combination - ie. every 2 letter combos plus additional changing/rotating/incrementing parts with *exceptions ie. [url removed, login to view][VARIABLE1[[VARIABLE2[ variable 1 might create every possible 2 letter combination / while variable 2 may increment by [userdefined[ howevermany numbers starting at say 1000 *exceptions in the case above there could be exceptions - that would say do not change the letter combo unless - page found/string found/ ie.... aa1000...aa1001...aa1002...aa1003 lets say here that it finds what it was looking for so it resets the number sets to 1000 and increments the letter combo by one letter. ie. now it would be ab1000..ab1001..ab1002 etc... in some cases there would be no exceptions - but we would still need to tell the program how to generate the url sections...ie.. do not change the number until all letter combos have been tried. or run through the range of numbers before you change the letter combo or randomly choose 2 letter combos until all have been tried or randomly choose numbers and keep the letter combos the same. in some letter/number cases....there is only 1 "match" so to speak - meaning that it found what it was looking for so it ran the extraction as programmed...now move on. ie. aa1000 through cc1000 did not contain anything that it was looking for the make the extraction happen, but when it got to cd1000 it found it. in these cases I would have already tested the sites and found that there is only 1 possible match for each 2 letters for each number / meaning that there would be no other 1000 site besides the cd1000 site that it found - so the letter combos would go back to "aa" and the number would increment by 1. MULTI-STEP EXTRACTIONS INVOLVING GET/POST/LOGIN first off there may be a login/password required at some of these sites - usually accomplished with the test feature I have listed below - but there are exceptions - so we will need a ONLY AS NEEDED STEP - which would be - only login if you find/do not find whatever we would specify - if the page s less than ?bytes or 404 error/ or forbidden error...etc... in some cases the main function of the extractor is not the extraction. sometimes I will need to post data to a page that I do not have the link for yet. On the page that we do know the link for, for example there may be an id number present that would be an extracted string and used in the next step. example. I extract from a site [url removed, login to view] - on the page it has a contactme link that could be something like [url removed, login to view] now on that contact page there would be a form with variables. the variables could be whatever / my name /my phone / my email/ my message----the extraction would be the Id number. so we would look for maybe "cgi?id=" extract everything after that till a space maybe and use that extracted string in the next step. the next step/request would be a post to whatever the fixed url is plus the variable it just extracted. ie. post *DATA to [url removed, login to view] *DATA the data for this perticalar case would be name/phone/email/message...now they will all be the same in this case so they can be fixed values just like the url....but there needs to be a feature to load csv/comma dlimited text for these entries. ie. maybe I want to rotate through a list of email addresses or a few different messages... maybe I want to post 2 messages to the same person - that would probably most likely be a step 3 of the extractor. AUTOSAVE - again would want to save the URL and the list of usernames/id numbers for each successful post/get. TEST FEATURE Will need to include a built in browser for testing purposes and to login to sites that require login. when using keep alive this initial login is acceptable for most sites unless they time out - then we would use the AS NEEDED step. mentioned above. LIVE EXAMPLES [url removed, login to view] - simple one just numbers [url removed, login to view][VARIABLE[ number generating sequencial [url removed, login to view] ractorPassword= another easy one... usernames load from files. [url removed, login to view][VARIABLE[&EntryPassword=danger&Admin=&ContractorUserName =&ContractorPassword= [url removed, login to view]+rm65565 letter/number combos - has a 302 redirect too... [url removed, login to view]+[VARIABLE1[[VARIABLE2[ 2 step example [url removed, login to view] verify KM4809 - to use in next step then post to [url removed, login to view] data to post pagename=KM4809&user=KM4809&uemail=&uphone=&problem= another 2 step ssl this time - easy though all numbers all in order....really only a 1 step - but we will use 2 just to verify [url removed, login to view] if page has [string[ move on.... then post to [url removed, login to view] with data RepEmail=&EmailAddr=&Subject=&Body=Dear+Barb+Williams%2C+ I will get you some login examples....but this should get you started....

Skills: Anything Goes, C Programming, Java, Visual Basic

See more: you by your numbers, win32 programming, which site do i create my own site, what program do i use to open a php file, what is scratch programming, what is java programming used for, what is a variable in programming, what is a string in programming, what is a method in programming, what do i need to start programming in java, what do i need to start php programming, variable programming, variable in programming, use case includes, use case include example, tools needed for programming, tools needed for java programming, strings in c programming, string programming, string problem, string match, string find c, string c plus plus, start java programming, starting programming

About the Employer:
( 6 reviews )

Project ID: #1817955

Awarded to:


Similar to WGET, huh but with lots of rules, I guess. I would prefer C++ coz it runs without any extra installations and threading is very robust. Want to make it run on all platforms?

$350 USD in 30 days
(0 Reviews)