Evaluation methods in machine learning are used to assess the accuracy of predictions made by a model. There are many different evaluation metrics, but some common ones include accuracy, precision, recall, and F1 score.
To choose the appropriate evaluation metric for your task, you need to understand the trade-offs between them. For example, accuracy is very intuitive and easy to understand, but it can be misleading if there is a class imbalance in your data (i.e., one class is much more represented than another). In that case, you might want to use precision or recall instead.
Another considerations when choosing an evaluation metric is whether you care more about false positives or false negatives. For example, if you are building a model to detect fraudsters on a website, you care more about false negatives (i.e., fraudsters that are not detected) than false positives (i.e., non-fraudsters that are incorrectly flagged as fraudsters). In that case, you would want to use a metric like recall which penalizes false negatives more than false positives.
Accuracy
Accuracy is a measure of how well a model performs on unseen data. It is the percentage of correct predictions made by the model.
There are two types of accuracy: Classification Accuracy and Regression Accuracy.
Classification Accuracy is the percentage of correctly classified examples out of all the examples in the dataset.
$$ \text{Classification Accuracy} = \frac{\text{Number of correctly classified examples}}{\text{Total number of examples}}$$
Regression Accuracy is the percentage of predicted values that fall within a certain range around the actual value.
Precision

In machine learning, precision is a measure of the accuracy of predictions made by a model. Precision measures how many of the predictions made by the model are correct. It is important to note that precision is different from recall. While recall measures the percentage of relevant items that are retrieved, precision measures the percentage of retrieved items that are relevant.
A high precision score means that when a model makes a prediction, it is usually correct. A low precision score means that when a model makes a prediction, it is often wrong.
Precision is typically used in classification tasks such as spam filtering and medical diagnosis. In these tasks, we want to be able to weed out as many incorrect predictions as possible so as not to cause any undue harm (e.g., mistakenly diagnosing someone with a disease).
Recall

There are several ways to compute recall, but the most common is to simply take the number of true positives (examples that were correctly classified as positive) and divide it by the total number of positive examples. This can be written as:
Recall = TP / (TP + FN)
where TP is the number of true positives and FN is the number of false negatives.
“Evaluation methods in machine learning are constantly evolving as we strive to find more accurate ways of measuring performance. The goal is to always be able
F1 score

The F1 score is a measure of a classifier’s accuracy. It is the harmonic mean of the precision and recall, where precision is the number of true positives divided by the sum of true positives and false positives, and recall is the number of true positives divided by the sum of true positives and false negatives. The F1 score ranges from 0 to 1, with higher values indicating better accuracy.
Precision-Recall or PR curve

A precision-recall curve (PR curve) is a graphical representation of the precision and recall at different thresholds. It is typically used in binary classification to study the trade-off between the true positive rate (recall) and the false positive rate.
The PR curve is a tool to help you understand how your classifier is performing, and can be used to compare different classifiers. The x-axis represents the false positive rate, and the y-axis represents the true positive rate. A perfect classifier would have a PR curve that goes all the way to 1.0 on both axes.
The PR curve is created by varying the threshold for what counts as a positive prediction from 0 to 1, and computing precision and recall at each threshold. Precision is defined as TP / (TP + FP), where TP is the number of true positives and FP is the number of false positives. Recall is defined as TP / (TP + FN), where FN is the number of false negatives.
ROC (Receiver Operating Characteristics) curve
What is a ROC Curve?
A ROC curve is a graphical representation of the performance of a binary classification model. It plots the true positive rate (TPR) against the false positive rate (FPR) at different thresholds. The TPR is also known as sensitivity or recall, and the FPR is also known as the fall-out or 1-specificity.
A perfect classifier would have a TPR of 1 and an FPR of 0, meaning that it would correctly identify all positive examples and never incorrectly classify a negative example as positive. In reality, no classifier is perfect and there is always a trade-off between TPR and FPR. A model with high TPR but low FPR is said to have good discrimination ability, meaning it can accurately distinguish between positive and negative examples.
Why Use a ROC Curve?
ROC curves are useful for comparing different binary classification models, especially when you don’t have access to ground truth labels for the data set. They can also be used to tune model parameters to find an optimal balance between TPR and FPR. For example, in medical diagnosis you may want to maximize sensitivity (TPR) so that all sick patients are correctly identified, even if this means some healthy patients will be incorrectly diagnosed as well. On the other hand, in security applications you may want to maximize specificity (1-FPR) so that only genuine users are allowed access, even if this means some legitimate users will be denied access.