Data Leakage

Preprocessing Leakage

What is preprocessing leakage?

Preprocessing leakage occurs when information from the test set influences the preprocessing steps applied to the training set.

Causes of preprocessing leakage:

If preprocessing steps are influenced by information from the test set, it can lead to overly optimistic performance estimates during model development and may result in poor generalization to new, unseen data.

Solutions for preprocessing leakage:

Perform Preprocessing Independently: Apply preprocessing steps based on information from the train set without considering the test set.
Use Pipelines: Implement preprocessing steps within a pipeline to ensure consistency and prevent information flow between the training and test sets.
Handle Missing Values Appropriately: If missing values are imputed, use methods based solely on information from the training set. Avoid using global statistics or values derived from the test set.

Example of Preprocessing Leakage Code Displayed Below

In the example shown below, there is preprocessing leakage because there is no split before feature selection.

Example of How Quick Fix Would Be Performed

When you hover and click on the highlighted issue, you can see the pop up that says "Potential preprocessing leakage associated with this variable. See Line 12 which contains the source of the leakage". To do the quick fix, you would click on the button underneath it "Perform split before feature selection".

Fix Preprocessing Leakage Displayed Below

In the fix, you can see that before the gbc.fit(X_train,y_train) line, the split() line is written before it.