write a script called docdistancesthat will calculate distances between pairs of text documents. These distances will be based on a vanilla version of term frequency–inverse document frequency (tf-idf). Your script will calculate the distances between 6 documents: 3 documents are synopsis of fairy tales (Red riding hood, the Princess and the pea and Cinderella); the other 3 documents are the abstract of papers related to protein function prediction (identified as CAFA1, CAFA2 and CAFA3). You will find these documents on the Moodle page (the files name are: [url removed, login to view], [url removed, login to view], [url removed, login to view], [url removed, login to view], [url removed, login to view], [url removed, login to view]).
Your script will:
1. For each document, calculate its tf-idf vector.
The tf-idf vector of a document is a vector whose length is equal to the total number of different terms (words) which are present in the corpus (in this case, the corpus is the entire set of 6 documents). Each term is assigned a specific element of the vector, which is in the same position for the tf-idf vector of every document. For a given document d, the vector element corresponding to term t is calculated as the product of 2 values:
a) Term frequency: the number of times that term t appears in document d
b) Inverse document frequency: the log base 10 of the inverse fraction of the documents that contain the term, i.e.