Regex for international dates and company entity types
- Status: Open
- Prize: $900
- Entries Received: 28
I stress that this project involves as much RESEARCH as code-writing. The different formatting used internationally is vital to get right first; the coding that follows is easy.
This project is to write a script containing a series of regex scans that will extract metadata from plain documents. The purpose is to scan a big-data archive of document and retrieve specific meta data such as dates and company name.
Script execution speed is essential. Ideally the script is written in Perl, but python or php is okay.
This must work with international meta data! Please do not expect your entry to win without this prerequisite. The plain text source will be in UTF8 to cope with international characters.
The script can return in any practical format, as long as the format can be imported:-
E.g. Json, Serialized array
The document describes script sections - which means create one include file that can be included in a different project, and demonstrate how to call the functions. So there would be at least 4 main functions.
To make good use of your time, I'd suggest you first research the international dates and international company types, and send me privately a document. Obviously don't include what is already in Wiki. I can then give you feedback on whether that is comprehensive. Once you have that feedback it becomes worthwhile to code.