I generate a csv file weekly and I would like to compare the new file with the previous file and generate a report. I know that diff and join can do some of this, but what I want is for the script to be able to take command line options that allow me to specify which columns from each file I want to compare and the other columns are ignored. The key to join both files will be a standardized domain name. Each file will have a header row with field descriptors. File sizes can range from thousands to hundreds of thousands of rows. The report would show changes in the columns specified from one file to the other. The script would also allow changes detected to be limited to a specific list of keywords and other changes found would not be reported. If no keyword list is indicated, then all changes in specified columns would be reported on.
Report details would include:
File Names (1st and 2nd)
DateStamp at time of report generation
Row counts for both files
Field Descriptors selected for change report
Total # of changes detected
Limit report to selected keywords:
Then list of the actual changes in a pipe delimited output file. If there are multiple changes found for a single domain, then each change gets it's own row and the domain is printed each time. So the report would look like:
Domain File1_webserver File2_webserver
[url removed, login to view] apache IIS
I look forward to hearing from you.