Dezhan Tu, University of California, Los Angeles
Data quality (DQ) issues often creep into recurring pipelines, having a negative impact on downstream consumers. Therefore, data validation is very important. Current SOTA data-validation tools, such as Google TFDV and Amazon Deequ, largely rely on users to manually program DQ constraints. In this work, we propose Auto-Validate-by-History. Our key insight is that historical data from the past K executions of the same pipeline gives us strong signals in terms of whether the K+1th execution is anomalous or not. We leverage a large collection of statistical metrics on the same column from K past executions, and propose a principled optimization framework to auto-program DQ constraints, by judiciously selecting a most relevant set of metrics most suitable for each column, with statistical guarantees of precision, while maximizing expected recall. Experimental results show that our method outperforms most baselines in terms of accuracy, efficiency, and explainability.
コメント