Section I (10 points)
Select a data set that contains any kind of corpus (text data similar to the ones you have seen in assignment 4 & 5) in xls, xlsx or csv format.
You can collect data from online sources like [login to view URL], [login to view URL] or even use web scraping techniques.
If you need help collecting data of your interest, contact me. I can assist you in downloading the data you want.
The data set must be no less than 6000 records and no more than 10000 records.
Clean the data.
Section II (30 points)
Graduates: Perform the following analyses on your clean data using Rapidminer.
Correlation analysis - It would be ideal if the data has a combination of numerical and text.
k-means Cluster analysis
Section III (20 points)
Prepare a small report of the results of your analysis. Make sure the report has the title of your project and highlights the analysis and the kind of operators you used for the analysis. You may use the following as a guide for your report.
About data and source
The goal of the analysis (what did you want to find out?)
Data mining technique used and final result diagram
Conclusion of the analysis (what did you find out?)
Section IV (40 points)
Please use the below guide to prepare your presentation.
1. Introduce your topic and data, why did you select that particular dataset.
2. How did you collect the data and did you perform any cleaning?
3. Problem/questions you are trying to answer from the data. Your initial ideas of data analysis and gathering insights.
4. Describe your findings in each of the mining techniques. Correlation, association and clustering.
6. Present your analysis/results.
Final design diagram
Correlation/Association and Clustering results with the related result diagrams.
7. Conclusion and experience working on the analysis.
Word/pdf of the report
Link to the video recording in the comments of the submission.