Project Description:
We are looking to develop a web application for resume parsing written in Java/Spring. If you want to use something else please let us know.
This parser will be used to parse thousands of UNSTRUCTURED resumes in html, word (doc, docx), rtf, text and pdf formats.
Input: Resume files in the following formats: WORD, PDF, TEXT, TIF, html
Output: XML format files of the resume when all the words from resume are located in the correct tag of the XML.
The parser needs to be able to extract the following data from the resumes:
. first name
. last name
. address
. city
. state/province
. zip code
. country
. citizenship/immigration status
. email address
. resume job category
. resume title
. career objective or background
years of professional experience
. employment history
. education history
. licenses and certifications
. foreign languages
. references
. skills keywords
. publications
. security clearances
Output of the parser should be an xml tagged file, one xml file for each parsed resume, output file name to be the same as the input file name with extension changing from [url removed, login to view] to [url removed, login to view]
All of the parsed fields will be used to upload into a mysql database. Parser is required to do the database insertion as part of the parsing process.
We will supply a sample set of resumes, as many as you need to be successful.
Resumes are unstructured so formats and content vary widely. The ability to score the parsing performance would be beneficial. It would be helpful to be able to look at a parsing report (i.e. The application should contain a log file) that indicates which resumes the parser thinks it did poorly on so we can manually revisit those parsed resumes that have the highest probabilty of having parsing errors.
We need to be able to integrate the web application parser with our existing php website.
The application should contain at least 2 main modules:
[url removed, login to view] converter – Each file format will be translated by this module to text format
[url removed, login to view] engine – This engine should receive a text file and return an XML file
The separation is needed in order to allow additional file formats in the future.
Passing acceptance testing with several resumes will be required at project completion.
I expect there will be a lot more questions so feel free to ask.
Skills required:
Data Processing, Java, Research, Software Architecture, XML
Read your project requirements , and understood about your resume parser . I have lots past experience in data processing projects. So you can rely on me.
I would structure this project as follows :
Module 1 : Conversion of diffferent file formats (.doc, .rtf , .pdf, etc ) to .txt format.
Module 2 : Parse .txt contents into .xml file using SAX or DOM technologies (whichever suits best)
This module will also have logger to identify error-prone parsed entries. Also, we can keep track of successful entries.
Module 3 : Insert .xml data (or tags) into respective tables (MySQL database) . We can also track successful/unsuccessful database transactions.
Mentioned above is my understanding of your requirement.
I also have few questions :-
Are you expecting any GUI for this application?
When you say you want to integrate it with PHP , what part of this module you would like to integrate (starting point or ending point) ? Or in other words , will this app get input from PHP or it will provide output to PHP ?