Evaluation Metrics for Classification Problem You Should Know with Python Implementation

  • Binary Classification
  • Multi-Class Classification
  • Multi-Label Classification
  • Imbalanced Classification

Binary Classification

  • Email spam detection (spam or not).
  • Review prediction(positive or negative).
  • Cancer prediction (yes or not).

Multi-Class Classification

  • Face classification.
  • Plant species classification.
  • Optical character recognition.

Multi-Label Classification

Imbalanced Classification

Warming up: The flow of Machine Learning model

1. Confusion Matrix, Accuracy, Precision, and Recall:

A. Confusion Matrix:

B. Accuracy :

When to use?

Caveats

B. Precision:

When to use?

C. Recall

When to use?

Case-1: COVID test prediction (cost of FN > cost of FP)

  • the result of TP will be that the COVID 19 residents diagnosed with COVID-19.
  • the result of TN will be that healthy residents are in good health.
  • the result of FP will be that those actually healthy residents are predicted as COVID 19 residents.
  • the result of FN will be that those actual COVID 19 residents are predicted as the healthy residents

Case 2: Email is spam/not spam prediction. (cost of FP > cost of FN)

  • the result of TP will be that spam emails are placed in the spam folder.
  • the result of TN will be that important emails are received.
  • the result of FP will be that important emails are placed in the spam folder.
  • the result of FN will be that spam emails are received.

Case 3: Cancer Diagnosis (Cost of FN > Cost of FP)

  • the result of TP will be that patient has cancer and the model predicted cancer (all good).
  • the result of TN will be that the model predicted the patient has no cancer and that actually true (again all good).
  • the result of FP will be that patient has no cancer but diagnosed (model predicted) as cancer (class 1).
  • the result of FN will be that patient has cancer but diagnosed (model predicted ) as not cancer (class 0).

Combining Precision and Recall — F1 Score

2. ROC/AUC

3. Log Loss/Binary Crossentropy

When to Use?

Caveats

4. Categorical Crossentropy

When to Use?

Caveats:

What about Multi-Class Problems?

Then how can you calculate Precision & Recall for problems with Multiple classes as labels?

MACRO AVERAGING:

  • [urgent,normal]=10 means 10 normal(actual label) mails has been classified as urgent.
  • [spam,urgent]=3 means 3 urgent(actual label) mails have been classified as spam

Micro-average Method :

True positive (TP1)  = 12
False positive (FP1) = 9
False negative (FN1) = 3
True positive (TP2)  = 50
False positive (FP2) = 23
False negative (FN2) = 9

Summary

References :

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store