# ML - Data Analysis and Classification

₹2500-3000 INR

Paid on delivery

Overview of the Task:

The original test sheets contain many data sets each with 49 numbers. Each data set is a column. Each of the data sets/columns has 7 out of 49 numbers selected as Process numbers. These are given in bold red. Now, the last column, the rightmost column, is the target data set for prediction. All other columns are data sets to be used for training the model. The project's ultimate objective is to predict the 7 process numbers of that last column/data set using Machine Learning models. We are using as many as 5 different types of ML models to predict these 7 pattern numbers from the target data set which is the last column of each test sheet.

During this process of prediction, we have come across certain observations. We had to solve those observations and improve the prediction accuracy by overcoming those observations with methods or approaches to be developed by expert data scientists.

This task named “Data analysis and classification” is for that objective.

We have predicted the 7-process number of approximately 50 data sets using these 5 ML models at various test sizes. These prediction results are illustrated in the Excel workbook file named: “Comparison of prediction results of 50 data sets”. How to read and understand this Excel workbook is explained below:

1) The workbook has 50 sheets. The leftmost sheet is named 388 and it goes to 438 at the rightmost sheet. Out of these 50 sheets data is now filled up to 431, totalling to 44 data sets. Data of the remaining sheets shall be filled in due course as the data becomes available.

2) The numbers given as the sheet names are the numbers of the data sets. From 388, 438. Each of these numbers is also the name of the target data set, the rightmost column of each test sheet.

3) One data asset can have up to 6 to 7 test sheets. Named 388-1, 3881A, 3881B, 388-2 …. up to 388-5. Each test sheet has a varying number of data sets for training and one target data set. The number of data sets in each test sheet is stated in the Test sheet names.

4) A test sheet name starts with the number of the target variable (or target column) where we have to predict the 7 numbers.

5) Each of the 50 sheets of the workbook has a list of 9 numbers predicted by different ML models. The models used were RF - Random Forest Classifier, SVML - SVM Linear Classifier kernel, SVMR - SVM RBF Classifier kernel, SVMP – SVM poly classifier kernel and NB - Naive Bayes Classifier.

6) The actual 7 values or pattern numbers are given in the coloured cells in the top left of each sheet. Wherever these numbers have occurred in prediction results are also coloured with respective colours.

7) You may also notice something like - 388-1, 388-2, 388-3, 388-4, etc. These are different variations of test sheets of the dataset numbered 388 in each of these 5 to 7 various test sheets 388 is the target column. So, we make predictions using each of these test sheets of various sizes.

? Finally, we noticed getting better results by changing the test sizes during the test-train split. So, we have also tested each of the models in different test sizes - 0.2, 0.3, 0.4, 0.5, 0.6. These test size values are given in brackets against each test sheet name.

9) At the top left of each you can also notice 'Result type'. This describes a special data manipulation criterion. 'No column removal' - No columns are removed from the test sheet, 'Two column removal' - First two columns are removed from the test sheet, 'Four column removal' - First four columns the first four training data sets are removed from the test sheet etc. This resulted in increased prediction accuracy a little bit, so please be on the lookout for this variable.

The Task:

A. You have to first look through various predictions of each sheet, there are 150 predictions in each sheet, and count, list out/tabulate the facts available there such as:

a) How many of the pattern numbers have occurred in each type of prediction?

b) Which type of prediction has the highest number of correct pattern numbers?

c) Which type of prediction has a consistent result? This means having a similar number of correct numbers repeatedly.

d) Variations in Dataset: Explore the variations of the same dataset (e.g., 388-1, 388-2) and note any significant differences in prediction accuracy.

e) Effect of Test Sizes: Investigate the impact of different test sizes (0.2, 0.3, 0.4, 0.5, 0.6) on prediction accuracy for each model.

f) Influence of 'Result Type': Assess how different 'Result Types' affect the accuracy, especially whether column removal enhances or hinders the predictions.

And so on….

All such observations/facts available there will help us determine which type of mode and at what test size value has the best performance.

B. Analyse each test sheet in detail using various metrics used in data science to determine what are the characteristics of a test sheet or the target data set that gives the best prediction result.

a) Prediction Accuracy: Calculate the overall accuracy of predictions for each test sheet. This involves assessing the ratio of correct predictions to the total number of predictions.

b) Precision, Recall, and F1 Score: Break down the performance using precision, recall, and F1 score metrics. Precision measures the accuracy of positive predictions, recall assesses the ability to capture all positive instances, and F1 score combines both metrics.

c) Feature Importance: If applicable, analyze the importance of features in the prediction. This is particularly relevant if certain columns or variables significantly influence the model's performance. You may use the SHAP graphs generated using interpretML to achieve this.

d) Hyperparameter Tuning: Explore the impact of hyperparameter tuning on model performance. Assess how adjustments to parameters influence the predictive accuracy.

C. Analyse each dataset (each data set is the same as each column and has 49 numbers) in detail using various metrics that can be derived from a data set without taking into account or considering the prediction results.

a) Descriptive Statistics: Compute basic descriptive statistics such as mean, median, standard deviation, minimum, and maximum values. This provides an initial understanding of the central tendency and variability of the dataset.

b) Data Distribution: Visualize the distribution of the dataset using histograms, box plots, or kernel density plots. This helps identify any skewness, outliers, or patterns within the data.

The objective of this analysis and expected results:

After this detailed study and analysis, we will get the following ability/knowledge

I) Be able to classify or categorise the Test Sheets into categories or classes like:

a) Most friendly with SVM linear with ----test size.

b) Needs removal or addition of data set to get various metric values to satisfy getting better prediction results.

c) ……

d) …..

II) Be able to classify or categorise individual data sets into categories or classes like:

a) Most friendly with SVM linear with ----test size.

b) Needs removal or addition of data set to get various metric values to satisfy getting better prediction results.

c) ….

d) ……

III) Be able to remove or add training data sets from a test sheet to get the highest possible number of correct predictions per different types of prediction models and test size.

IV) Any other corrective actions to help us get high prediction accuracy

Plan of Action

In order to ensure the precise predictions of these models we have to compute a few metrics. These metrics generally depict the efficiency of the model. The list of these metrics is mentioned below along with details: -

Accuracy: Proportion for correctly classified occurrences as defined in the pattern set. You have to compute the counts which are matching to the pattern sets and compute the proportions. Similarly, it will give us the error rate as well. We know the threshold and use it to interpret the results.

Confusion Matrix: Accuracy alone is not enough to conclude the efficiency of the model. Conduct the in-depth analysis using underlying information. This matrix will give the True Negatives and True Positives. False Negative and False Positives. These measures will help to understand what are the variations and whether we can rely on a particular model or not.

Sensitivity and Specificity: These measures will give us an overview of how many true positives (Predictions) are identified as pattern numbers. Similarly, how many numbers are identified as non-pattern numbers?

Project ID: #37827443

### About the project

## 19 freelancers are bidding on average ₹3024 for this job

Hlo! I have done MS in statistics. I read your job description. I have expertise in SPSS, R studio, excel, ML, classification, and statistical analysis. I will provide you with the finest work that perfectly aligns wit More

I propose to conduct a comprehensive analysis of the Machine Learning models' predictions on various test sheets. The plan involves initial data exploration, counting and tabulating facts, exploring dataset and test si More

It will be done in no time, I have done it before too, so let me know how shall we proceed and I will get it done.

Dear Hiring Manager, I am excited about the opportunity to work on the task you outlined, focusing on data analysis and classification using machine learning models. Here's how I propose to approach and execute the pr More

With over 5 years of experience in the field, my deep understanding of both Data Science and Python makes me perfectly suited for your project. Navigating large datasets and applying Machine Learning models to predict More

Contact me i can solve it in better way. If you are interested you can contact i never disappoint the cleint. I can do this project in last price of 2000.

I'm the best candidate for your project as I've done various projects on ML prediction having higher accuracies. I'll showcase my projects if needed.

I am having 10 year experience working with python and machine learning. I solved about 40+ classifications problem, 30+ regressions problems and 10+ clustering problem during whole tenure of 10 years' experience.

already i made this model with your given dataset, if your ok i can present you, please let me know your interest, this presentations is free once your inrested we can move on further.