Data Leakage

Overlap Leakage:

What is Overlap Leakage?

Overlap leakage refers to a situation in which there is unintentional sharing or overlap of information between the training and testing datasets in a machine learning model.

Causes of overlap leakage:

This can occur when the same or highly similar data points are present in both the training and testing sets. When the model is trained on a dataset that shares information with the test set, it may lead to overly optimistic performance evaluations and may not generalize well to new, unseen data.

Solutions for overlap leakage:

Managing the split between the training and testing datasets means reducing the risk of overlap leakage and obtaining more reliable performance evaluations for ML model.

Example of Overlap Leakage Code Displayed Below

In this example below, the code had sampling happen before split. As seen in the code, line 9 has fit_resample(X,y) before train_test_split.

Example of How Quick Fix Would Be Performed

When you hover and click on the highlighted issue, you can see the pop up that says "Potential overlap leakage associated with this variable". To do the quick fix, you would click on the button underneath it "Swap splitting and sampling".

Fix Overlap Leakage Displayed Below

Once the quick fix is performed, you can see that the train_test_split line is happening before the fit_resample line.