We need a simple-to-use search environment to allow casual users to be able to search for key words and phrases throughout sets of data files. The interface to this search will be web based, hosted on a Windows server. The search will be based on the Lucene open source search engine.
The data sets will be the files and email from computers that have had their disks copied. These data sets will be held securely on an in-house server.
The search needs to be able to look at both file names and file contents, and return the following information:
• file name and location
• File type
• Extract of the file showing search terms highlighted
• Link to open the full file
The types of file that need to be searched are :
• MS Office files of all versions; Word, Excel, PowerPoint
• Open office files
• XML files
• HTML files
• PDF files
• Email stores, including those for Outlook, Outlook Express and thunderbird
• Nice to have: access databases
There needs to be a way for the searcher to specify individual, multiple or all datasets
The searcher needs to be able to tag files from the search results for easy retrieval later, as well as easy copying to other media, printing, emailing etc.
Besides search there needs to be a mechanism for browsing through files in the datasets, in a Windows Explorer type environment.
As an example of use : imagine we have taken a copy of the entire hard disk from 20 different computers, and have all of that data stored on our server classified as a dataset for each computer. A user wants to search for all information about a customer called "abc". He should be able to enter "abc" in to the search box and see all documents and emails containing that string. He should be able to narrow his search by enabling or disabling certain dataset. He should be able to tag certain results and export the tagged files or emails to be saved in another location.
This is a new project and it is very likely that there will be follow on work to add additional features as we use it.
We will want the deliverable to include all source code, build files etc and full instructions on how the software needs to be installed and configured.