Performance Evaluation Metrics in Machine Learning

6 min readApr 11, 2021

Evaluation of a machine learning algorithm is an extremely essential process. The whole idea of building a machine learning model works on the principle that you build one, get the feedback, and improve till you get the desired accuracy and results. The motive of a good analyst or a data scientist would not be to just build a model but to create one which gives the best results and to achieve the best, it’s important to work on the robustness of the model. Receiving appropriate feedback and working on it decides the growth curve of the model and that’s where evaluation comes in hand. It describes the performance of the model.

Evaluation metrics are defined simply as ways of evaluation. There are various metrics that can be used. They must be carefully understood and chosen because the performance of the algorithm and the weightage given to different features in the result is completely dependent on your choice. Now you may wonder why are there so many different metrics available and the answer to that would be, each metric measures the performance in a different way, and your model may give a great answer to one of them but might perform poorly in terms of another. Also, some metrics evaluate a classification problem better and others evaluate a regression problem better. Every single one of them has a different significance and understanding all of them together and applying them accordingly is what would make your model perfect.

Here we will talk about 12 such evaluation metrics:

Confusion Matrix

2. Classification Accuracy

3. Precision

4. Recall

5. Specificity

6. F1 Score

7. ROC

8. AUC

9. Log Loss

10. Mean Absolute Error

11. Mean Square Error

12. Root Mean Squared Error

1. Confusion Matrix

As the name suggests, a confusion matrix is a matrix that describes the performance of a model. Its size is n*n where n is the number of classes. Usually, a confusion matrix is used for classification problems. It has two dimensions, actual and predicted. Along with that, it has 4 categories:

True positive: The case where the algorithm predicted true and the actual result was also true

True negative: The case where the algorithm predicted false and the actual result was also false.

False-positive: The case where the algorithm predicted true but the actual result was false

False-negative: The case where the algorithm predicted false but the actual result was true

2. Classification Accuracy

It is defined as the number of correct predictions made as a ratio of a total number of input samples. It is most popular for classification problems.

3. Precision

Precision is defined as the actual number of correct positive results divided by the number of positive results predicted by our model that is if we are building the predictive model of cancer then precision tells what proportion of predicted cancer patients actually have cancer.

4. Recall

The recall is defined as the number of correct positive results divided by the number of all relevant samples that are in a cancer predictive model recall would tell what proportion of cases that should have been predicted as cancer positive are actually predicted positively by the model.

5. Specificity

Specificity is defined as the true negative rate or basically the actual number of negative cases to everything that should have been predicted as negative by the model.

6. F1 Score

F1 score is basically a score made out of both precision and recall. It is defined as the harmonic mean of both for a classification problem. It judges both the precision and robustness of our model.

7. ROC (Receiver operating characteristic) curve

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true positive rate is also known as sensitivity, recall, or probability of detection in machine learning. The false-positive rate is also known as the fall-out, 1-specificity, or probability of false alarm.

Roc curves have the benefit that even if the proportion of positive to negative instances differ, the Roc curve won’t be affected.

8. AUC (Area Under the ROC Curve)

As the name suggests, AUC is the area under the ROC curve. AUC-ROC metric basically tells us about the capability of the model in distinguishing the classes. It is based on varying threshold values for classification problems. It lies between [0,1] and the higher the AUC, the better is the model. It is best used for binary classification problems.

9. LOGLOSS (Logarithmic Loss)

This metric is also called logistic regression loss or cross-entropy loss. It is nothing but a negative average of the log of corrected predicted probabilities for each instance. It is basically defined on probability estimates and measures the performance of a classification model where the input is a probability value between 0 and 1. Log Loss gradually declines as the predicted probability improves. So, lower the log loss, better the model. However, there is no absolute measure on a good log loss and it is use-case/application dependent.

10. Mean Absolute Error (MAE)

It is the simplest error metric used in regression problems. It is basically the sum of the average of the absolute difference between the predicted and actual values. In simple words, with MAE, we can get an idea of how wrong the predictions were. Mathematically it is represented as

11. Mean Square Error (MSE)

Mean square error is similar to MAE, the only difference being that MSE takes the average of the square of the difference between the original values and the predicted values. The advantage of MSE being that it is easier to compute the gradient, whereas Mean Absolute Error requires complicated linear programming tools to compute the gradient. As we take a square of the error, the effect of larger errors become more pronounced then smaller error, hence the model can now focus more on the larger errors.

12. Root mean squared error(RMSE)

RMSE is the most popular method of evaluation for regression problems. It is a standard way to measure the error of a model in predicting quantitative data. It follows an assumption that errors are unbiased and follow a normal distribution. RMSE metric is given by: