Overlap Leakage:
What is Overlap Leakage?
Overlap leakage refers to a situation in which there is unintentional sharing or overlap of information between the training and testing datasets in a machine learning model.
Causes of overlap leakage:
This can occur when the same or highly similar data points are present in both the training and testing sets. When the model is trained on a dataset that shares information with the test set, it may lead to overly optimistic performance evaluations and may not generalize well to new, unseen data.
Solutions for overlap leakage:
- Randomized Splitting: Use a randomized approach when splitting the dataset into training and testing sets. This helps ensure that instances in the training set are not overly similar to instances in the test set.
- Stratified Sampling: If the dataset has class imbalances, use stratified sampling to maintain the distribution of classes in both the training and testing sets. This can help prevent situations where certain classes are overrepresented or underrepresented in one of the sets.
- Temporal Splitting: If the data has a temporal dimension, split the dataset based on time. The training set should include data from earlier time periods, while the testing set should include data from later time periods. This helps simulate a more realistic scenario where the model needs to generalize to future, unseen data.
- Geographical Splitting: In some cases, especially in spatial data, geographical splitting can be useful. Ensure that instances from specific geographical regions are present in either the training or testing set but not in both.
Managing the split between the training and testing datasets means reducing the risk of overlap leakage and obtaining more reliable performance evaluations for ML model.
Example of Overlap Leakage Code Displayed Below
In this example below, the code had sampling happen before split. As seen in the code, line 9 has fit_resample(X,y) before train_test_split.
Example of How Quick Fix Would Be Performed
When you hover and click on the highlighted issue, you can see the pop up that says "Potential overlap leakage associated with this variable". To do the quick fix, you would click on the button underneath it "Swap splitting and sampling".
Fix Overlap Leakage Displayed Below
Once the quick fix is performed, you can see that the train_test_split line is happening before the fit_resample line.