Closed

69979 extractor

This project was awarded to pingpong for $350 USD.

Get free quotes for a project like this
Employer working
Project Budget
N/A
Total Bids
1
Project Description

have 2 sample programs that were never completed - one written in c++, one in jave, can provide both for examples. Job can be from scratch or just finishing up the coding on either one of these. the c++ has the most completed of the 2. price would vary depending on if you were just finishing this code, or if you had to start over.

some of the live links below may be dead, cause this was an old writeup of mine, I will add new links, as soon as I get some bids

Here is the project - would not allow my html tags, so I modified them, I think you should get the picture still though.

MULTIPLE STEP MULTITHREADED DATA EXTRACTOR/ POST/GET METHOD - WIN32 APPLICATION

Overview

Extractor must be able to visit given/loaded/generated urls and extract *strings (sometimes for use in the next step) from

the response and/or save the complete page *based on rules - appended to the same dat/xls/csv/txt file, if it was just a

small string extracted, or saving as idividual files to a directory file001..file002..file003..etc.. creating subdirectories

as number of files increases. ie. create a new directory within the main result directory every 10,000 saves.
Must support redirects/302 errors and others/ process js asp and php.
must be able to access ssl pages support logins, accept/reject cookies / keep alive and also be able to re-connect as if you

closed your browser and reopened a new one... probably just deleting cookies/clearing cache everyrequest and reconnecting.

must support the use of socks proxies.

*Strings - first there should be a set of predifined strings. ie. html body/ formvalues/ mailto tags/ table values etc...

and then capable of creating my own custom strings

example 1 - could tell it we wanted to extract all data between [open html string[ AND [close html string[ in the response or multiple strings being

- string one start name="firstname"value= - would extract all data from that starting string to ending string which would be

whatever was after the data I wanted - could be / or [ or a space or line break or line feed or return character.
- string 2 would start name="lastname"value= - again all data till the ending string.
-string 3 etc.... until I had configured all starting/ending tags/ for all parts I wished to extract.

*Rules - Page saving rules - do not save/ save only if - page includes [userdefined[ or page does not include [userdefined[

response size is [ / [ / = to [userdefined[

AUTOSAVE - I would like to always save in another defined file the url and variables that the extractor is pulling and all

usernames/ that created a match/ or all number/number-letter combos that created a match
*match - finding what it was looking for - file saved - data extracted....


UrlLists - loaded lists
Must be capable of accepting large files lists, ie. point to the file list and grab lines as needed, as apposed to loading

the entire list in memory. lists of usernames or a dictionary list to run against the site...etc....

UrlLists - program generated
Must be able to create fixed variables ie. the main part of the domain ie. [url removed, login to view] [fixed doesn't change[
Must be able to create number sequences ie. 1000..1001...1002....1003 - by range/increment amount
must be able to create letter generating combination - ie. every 2 letter combos

plus additional changing/rotating/incrementing parts with *exceptions ie. [url removed, login to view][VARIABLE1[[VARIABLE2[
variable 1 might create every possible 2 letter combination / while variable 2 may increment by [userdefined[ howevermany

numbers starting at say 1000

*exceptions
in the case above there could be exceptions - that would say do not change the letter combo unless - page found/string found/

ie.... aa1000...aa1001...aa1002...aa1003 lets say here that it finds what it was looking for so it resets the number sets to

1000 and increments the letter combo by one letter. ie. now it would be ab1000..ab1001..ab1002 etc...

in some cases there would be no exceptions - but we would still need to tell the program how to generate the url

sections...ie.. do not change the number until all letter combos have been tried. or run through the range of numbers before

you change the letter combo or randomly choose 2 letter combos until all have been tried or randomly choose numbers and keep

the letter combos the same.

in some letter/number cases....there is only 1 "match" so to speak - meaning that it found what it was looking for so it ran

the extraction as programmed...now move on. ie. aa1000 through cc1000 did not contain anything that it was looking for the

make the extraction happen, but when it got to cd1000 it found it. in these cases I would have already tested the sites and

found that there is only 1 possible match for each 2 letters for each number / meaning that there would be no other 1000 site

besides the cd1000 site that it found - so the letter combos would go back to "aa" and the number would increment by 1.

MULTI-STEP EXTRACTIONS INVOLVING GET/POST/LOGIN

first off there may be a login/password required at some of these sites - usually accomplished with the test feature I have

listed below - but there are exceptions - so we will need a ONLY AS NEEDED STEP - which would be - only login if you find/do

not find whatever we would specify - if the page s less than ?bytes or 404 error/ or forbidden error...etc...

in some cases the main function of the extractor is not the extraction. sometimes I will need to post data to a page that I

do not have the link for yet. On the page that we do know the link for, for example there may be an id number present that

would be an extracted string and used in the next step. example. I extract from a site [url removed, login to view] - on the page it

has a contactme link that could be something like [url removed, login to view] now on that contact page there

would be a form with variables. the variables could be whatever / my name /my phone / my email/ my message----the extraction

would be the Id number. so we would look for maybe "cgi?id=" extract everything after that till a space maybe and use that

extracted string in the next step. the next step/request would be a post to whatever the fixed url is plus the variable it

just extracted. ie. post *DATA to [url removed, login to view]

*DATA
the data for this perticalar case would be name/phone/email/message...now they will all be the same in this case so they can

be fixed values just like the url....but there needs to be a feature to load csv/comma dlimited text for these entries. ie.

maybe I want to rotate through a list of email addresses or a few different messages... maybe I want to post 2 messages to

the same person - that would probably most likely be a step 3 of the extractor.

AUTOSAVE - again would want to save the URL and the list of usernames/id numbers for each successful post/get.

TEST FEATURE

Will need to include a built in browser for testing purposes and to login to sites that require login. when using keep alive

this initial login is acceptable for most sites unless they time out - then we would use the AS NEEDED step. mentioned

above.

LIVE EXAMPLES


[url removed, login to view] - simple one just numbers [url removed, login to view][VARIABLE[ number

generating sequencial

[url removed, login to view]

ractorPassword=

another easy one... usernames load from files.
[url removed, login to view][VARIABLE[&EntryPassword=danger&Admin=&ContractorUserName

=&ContractorPassword=

[url removed, login to view]+rm65565 letter/number combos - has a 302 redirect too...

[url removed, login to view]+[VARIABLE1[[VARIABLE2[

2 step example

[url removed, login to view] verify KM4809 - to use in next step

then post to [url removed, login to view]
data to post pagename=KM4809&user=KM4809&uemail=&uphone=&problem=

another 2 step ssl this time - easy though all numbers all in order....really only a 1 step - but we will use 2 just to

verify
[url removed, login to view] if page has [string[ move on....
then post to [url removed, login to view]
with data RepEmail=&EmailAddr=&Subject=&Body=Dear+Barb+Williams%2C+

I will get you some login examples....but this should get you started....

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online