I want to do a simple text mining task on a large number of files with python code.
The files are stored in a few larger network shares and sum up to about a million of files, with about 100 different filetypes. The "text and document" filetypes that are considered extra interesting are microsoft office files, pdf and text (doc, docx, xls, xlsx, xlsm, ppt,pptx, pdf, txt, html, xml, etc..). But there are also a few binary files, movies and others that are considered as less interesting. I have a list of about 100 interesting words organized in a textfile (txt) , one word per row. Now I want to identify all files that contain one or more instances of the words in their filename, path or in their file contents. I would like to get this task solved in python. I am not an experienced python programmer and I would like to get the code well written, well annotated and easy to modify. Communication and code should be in english. The code should work preferable in Windows, MacOSX and Linux.
I would like
1) A script to list all the files (not folders) in a networks share. Number the list, one file per row. List interesting file information, Columns separated by semicolon (;). Something like this.
File Counter; Full path; creation date; modifcation date; file owner, filetype; etc..
1; C:/mypath/myfile1.txt; date; date; owner; text document
2; C:/mypath/myfile2.doc; date; date; owner; Microsoft Word
2) One or several scripts to scan the names and contents (based on the file of interesting words) of the files from script 1). "Document filetypes and textfiles" (see above) shall be scanned for both the content and the full path (filname+path). Other files don't have to be scanned for content but need to be scanned for the full path. The script(s) shall report the action on each file and the result (name scan: yes/no/error, content scan: yes/no/error). If an error occurs with a file (e.g. for reading or parsing) this need to be stated in the result of the file scan, but should not interrupt the scan. The number of matches in the content and by which word should be stated. The number of unique positive (identified) words in the list of words, i.e. must be 0-about 100 should also be stated for each file.
Final output should be a similar list as in 1) but with additional columns containing. e.g.
Path Scan; Content Scan; #Matches; Words matched; #Unique matches
(Yes/No/Error); (Yes/No/Error); Integer; Monkey, Bannanas; 0-100,
Provide me with a way to run this search as easy and quickly as possible. (There are a lots of files and speed is important.) I don't want to wait two weeks for the search and I don't want to find out that the scripts run into an error after three hour and stops.
The bidders with recommendations, high evaluations, strong background in python and text mining are preferred.