University of Waterloo
Xu Chu
Georgia Institute of Technology
ACM Books #28
Copyright © 2019 by Association for Computing Machinery
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews—without the prior permission of the publisher.
Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which the Association for Computing Machinery is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration.
Data Cleaning
Ihab F. Ilyas
Xu Chu
ISBN: 978-1-4503-7152-0 hardcover
ISBN: 978-1-4503-7153-7 paperback
ISBN: 978-1-4503-7154-4 ePub
ISBN: 978-1-4503-7155-1 eBook
Series ISSN: 2374-6769 print 2374-6777 electronic
DOIs:
10.1145/3310205 Book
10.1145/3310205.3310206 Preface
10.1145/3310205.3310207 Chapter 1
10.1145/3310205.3310208 Chapter 2
10.1145/3310205.3310209 Chapter 3
10.1145/3310205.3310210 Chapter 4
10.1145/3310205.3310211 Chapter 5
10.1145/3310205.3310212 Chapter 6
10.1145/3310205.3310213 Chapter 7
10.1145/3310205.3310214 Chapter 8
10.1145/3310205.3310215 References/Index/Bios
A publication in the ACM Books series, #28
Editor in Chief: M. Tamer Özsu, University of Waterloo
This book was typeset in Arnhem Pro 10/14 and Flama using ZzTEX.
Cover photo: Jason Dorfman MIT / CSAIL
First Edition
10 9 8 7 6 5 4 3 2 1
To my family: Francis, Aida, Mirette, Andrew and Marina
To my wife Jianmei and my daughter Hannah
Contents
2.1 A Taxonomy of Outlier Detection Methods
2.2 Statistics-Based Outlier Detection
2.3 Distance-Based Outlier Detection
2.4 Model-Based Outlier Detection
2.5 Outlier Detection in High-Dimensional Data
3.2 Predicting Duplicate Pairs
3.3 Clustering
3.4 Blocking for Deduplication
3.5 Distributed Data Deduplication
3.6 Record Fusion and Entity Consolidation
3.7 Human-Involved Data Deduplication
3.8 Data Deduplication Tools
3.9 Conclusion
4.1 Syntactic Data Transformations
4.2 Semantic Data Transformations
4.3 ETL Tools
4.4 Conclusion
Chapter 5 Data Quality Rule Definition and Discovery
5.1 Functional Dependencies
5.2 Conditional Functional Dependencies
5.3 Denial Constraints
5.4 Other Types of Constraints
5.5 Conclusion
Chapter 6 Rule-Based Data Cleaning
6.1 Violation Detection
6.2 Error Repair
6.3 Conclusion
Chapter 7 Machine Learning and Probabilistic Data Cleaning
7.1 Machine