This project is to determine the feasibility of a larger project involving speech recognition. We want to determine if it is feasible to build our our "speech analytics" capability using an open source project.
We are a call recording company that records calls for call centers. Many of our clients are interested in the ability to convert the recorded conversation into words that can be searched, counted, summarized and reported. This requires large vocabulary, speaker-independent speech recognition, either in the form of phoneme recognition or word recognition. There are a number of commercial software capabilities to do this, but we want to see if we can build the same ourselves.
There are two open source projects we are interested in working with: Julius and Sphynx. You should only bid on this project if you are familiar with those or something very similar (e.g. RWTH ASR).
The deliverable for this project is a design that shows how to use the available to software (including supporting software to be created) to do the following:
1) Build an acoustic model for the call center's recordings. Assume we will have a large sample of audio recorded and can have it transcribed as needed.
2) Take each individual recording and process it through a recognition engine (we are assuming the use of stereo recordings in which we can process the speech of each party on the call separately.)
3) Generate a set of words or phonemes from the processing and store in a file that will be used for searching and reporting.
Depending on the response, we may award the project to multiple bidders to determine the best design. The bidder with the best design is most likely to then be awarded the project to develop the full system.
Do not bid unless you have a good understanding of speech recognition technology and the open source projects listed above. You will wasting your time and mine if you bid and do not have expertise in this area.
After researching the speech recognition field, we have decided to modify the requirements of this job. Here are the revised specifications:
1) Instead of words as outputs, we will want only phonemes. This will require the ability to process recorded audio and convert it into US english phonemes. Each phoneme should have a time stamp indicating the relative time - in seconds - of the phoneme in the recording (start = 0).
2) The process should have the ability to handle audio files that have dual inputs or sides. These are recordings of telephone conversations, so each phoneme would also indicate the side of the conversation in which the phoneme appears.
Because the need to convert the phonemes into words is eliminated, the processing speed must be several times faster than realtime. It should take less than 10 seconds to process a 60 second recording.
15 freelancers are bidding on average $3920 for this job
Dear sir, we are a group of expert on pattern matching and machine learning field. We can confirm you to deliver the best quality product on due time. Thank you.