Data Leakage

Preprocessing Leakage

What is preprocessing leakage?

Preprocessing leakage occurs when information from the test set influences the preprocessing steps applied to the training set.

Causes of preprocessing leakage:

If preprocessing steps are influenced by information from the test set, it can lead to overly optimistic performance estimates during model development and may result in poor generalization to new, unseen data.

Solutions for preprocessing leakage:

Example of Preprocessing Leakage Code Displayed Below

In the example shown below, there is preprocessing leakage because there is no split before feature selection.

Example of How Quick Fix Would Be Performed

When you hover and click on the highlighted issue, you can see the pop up that says "Potential preprocessing leakage associated with this variable. See Line 12 which contains the source of the leakage". To do the quick fix, you would click on the button underneath it "Perform split before feature selection".

Fix Preprocessing Leakage Displayed Below

In the fix, you can see that before the gbc.fit(X_train,y_train) line, the split() line is written before it.