Data Leakage

Multi-test Leakage

What is multi-test leakage?

Multi-test leakage refers to a situation in which information from multiple tests or experiments is unintentionally shared or used in a way that compromises the validity or independence of the tests.

Causes of multi-test leakage:

Potential for multi-test leakage arises because the tokenization and padding processes are applied to the entire dataset before splitting it into training, validation, and test sets.

Solutions for multi-test leakage:

Tokenization and padding within each split:

Tokenization and padding in a pipeline:

Overall, by applying tokenization and padding separately for each split, one can ensure that the information regarding the structure of the data (e.g., token sequences, padding) is not shared between the train, val, and test sets, reducing the risk of multi-test leakage.

Example of Muli-test Leakage Code Displayed Below

In the example shown shown below, X_test is used more than once (in line 14 and line 18) hence the mult-test leakage.

Example of How Quick Fix Would Be Performed

The variable that is being used more than once would be highlighted and when you click on it you would see a pop up that says "Potential multi-test leakage associated with this variable.Avoid reusing the same test data in seperate evaluations". Under this sentence, you'll see a button that says "Use new test data for each evaluation". Once you click on this button, the quick fix would be performed.

Fix Multi-Test Leakage Displayed Below

Once fixed, you see can that line 14 and line 18 now have different variable names for X_test. Line 14 X_test is now renamed to X_test_0 and line 18 X_test is renamed to X_test_1.