There are three phases to this project: 1) Collect the data. 2) Process the data and build data cleansing rules into a script. 3) Integrate those rules into a separate perl script.
I will provide a list of domains and email servers as an input file and the task is to connect to them and capture the SMTP banner. Once the banners are collected, I will review the data and determine which pieces should be parsed by a set of rules (regexes).
Here is an example:
$ telnet [url removed, login to view] 25
Connected to q3email.securesites.net.
Escape character is '^]'.
220 [url removed, login to view] ESMTP Sendmail 8.14.5/8.13.6; Wed, 8 May 2013 21:03:25 GMT
This is the banner I want to collect: "220 [url removed, login to view] ESMTP Sendmail 8.14.5/8.13.6; Wed, 8 May 2013 21:00:24 GMT" in the first phase.
Once we have a representative sample (~50,000 or so) we will review the data and determine the bits we want to parse. "Sendmail" is definitely the big one, "Sendmail 8.14.5/8.13.6" is another possibility, if we can accurately parse version then we will, but these decisions will be driven by what the data reveals.
Once the parsing is complete, the banner collection and parsing will be integrated into a perl script that takes an input file of domains and collects the mail servers. The task will be to also collect the banner, parsing it, and appending the parsed elements to the output of my other script.
Looking forward to working with you.