We would like to analyse which are the most popular words in estate agents' property descriptions. Please can you implement a perl application [url removed, login to view]
The data will be recorded in a mysql table site_house. site_house will include 2 columns first_house_loader_id and full_desc. I’ll supply some sample data later.
DBI/DBD::mysql library should be used to connect to the database. Please implement a function connecttodb() (returns $dbh) which I can later override with our existing function. Mysql user/password/db should be hard-coded within connecttodb().
For each first_house_loader_id the application must choose the longest available full_desc.
The application should iterate through all the first_house_loader_ids and choose the longest full_desc for each first_house_loader_id. In some cases no full_descs are set for a first_house_loader_id, in which case this full_desc should be ignored and not be counted in any stats.
full_desc may contain html. We need to ensure we convert from HTML to text including converting html special chars to text. Please implement a function htmltotext() which I can later override with our existing function.
The app should remove:
1) characters that aren't part of words - but may "connect" words together without a space character, e.g. (),.^!:;*+-/"@_\?
2) all 1 and 2 letter words (e.g. a, in, an, etc.) - and all digits only words (e.g. phone numbers).
The application should output the results to ~/logs/top_description_words/[url removed, login to view]
The application should report on the top 1000 [configurable] most common words. For each word it should report on the number of full_descs (tested) in which the word appears. The application should report on how many full_descs were tested.
After the first run, we may find that we want to group together singulars/plurals or synonyms. So the application needs a feature where we can hard-code synonyms in a hash and count them as one and then report on the synonymous values at the end eg the first run my output might include
Having tied these together the 2nd run might include output
(92 is less than 66+54 because some descriptions will have contained both words)
The code needs to be well commented and well laid out to demonstrate that the coder is skilful enough to help with further projects. The app should include header comments explaining the aim and outline mechanism of the application. Each subroutine should come with comments explaining inputs and outputs.