Loading...
「ツール」は右上に移動しました。
利用したサーバー: wtserver1
0いいね 40 views回再生

KDD 2023 - Auto-Validate by-History

Dezhan Tu, University of California, Los Angeles

Data quality (DQ) issues often creep into recurring pipelines, having a negative impact on downstream consumers. Therefore, data validation is very important. Current SOTA data-validation tools, such as Google TFDV and Amazon Deequ, largely rely on users to manually program DQ constraints. In this work, we propose Auto-Validate-by-History. Our key insight is that historical data from the past K executions of the same pipeline gives us strong signals in terms of whether the K+1th execution is anomalous or not. We leverage a large collection of statistical metrics on the same column from K past executions, and propose a principled optimization framework to auto-program DQ constraints, by judiciously selecting a most relevant set of metrics most suitable for each column, with statistical guarantees of precision, while maximizing expected recall. Experimental results show that our method outperforms most baselines in terms of accuracy, efficiency, and explainability.

コメント