Ihab F. Ilyas

Data Cleaning


Скачать книгу

University of Waterloo

       Xu Chu

       Georgia Institute of Technology

       ACM Books #28

Image

      Copyright © 2019 by Association for Computing Machinery

      All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews—without the prior permission of the publisher.

      Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which the Association for Computing Machinery is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration.

       Data Cleaning

      Ihab F. Ilyas

      Xu Chu

       books.acm.org

       http://books.acm.org

      ISBN: 978-1-4503-7152-0 hardcover

      ISBN: 978-1-4503-7153-7 paperback

      ISBN: 978-1-4503-7154-4 ePub

      ISBN: 978-1-4503-7155-1 eBook

      Series ISSN: 2374-6769 print 2374-6777 electronic

      DOIs:

      10.1145/3310205 Book

      10.1145/3310205.3310206 Preface

      10.1145/3310205.3310207 Chapter 1

      10.1145/3310205.3310208 Chapter 2

      10.1145/3310205.3310209 Chapter 3

      10.1145/3310205.3310210 Chapter 4

      10.1145/3310205.3310211 Chapter 5

      10.1145/3310205.3310212 Chapter 6

      10.1145/3310205.3310213 Chapter 7

      10.1145/3310205.3310214 Chapter 8

      10.1145/3310205.3310215 References/Index/Bios

      A publication in the ACM Books series, #28

      Editor in Chief: M. Tamer Özsu, University of Waterloo

      This book was typeset in Arnhem Pro 10/14 and Flama using ZzTEX.

      Cover photo: Jason Dorfman MIT / CSAIL

      First Edition

      10 9 8 7 6 5 4 3 2 1

       To my family: Francis, Aida, Mirette, Andrew and Marina

       To my wife Jianmei and my daughter Hannah

      Contents

       Preface

       Figure and Table Credits

       Chapter 1 Introduction

       1.1 Data Cleaning Workflow

       1.2 Book Scope

       Chapter 2 Outlier Detection

       2.1 A Taxonomy of Outlier Detection Methods

       2.2 Statistics-Based Outlier Detection

       2.3 Distance-Based Outlier Detection

       2.4 Model-Based Outlier Detection

       2.5 Outlier Detection in High-Dimensional Data

       2.6 Conclusion

       Chapter 3 Data Deduplication

       3.1 Similarity Metrics

       3.2 Predicting Duplicate Pairs

       3.3 Clustering

       3.4 Blocking for Deduplication

       3.5 Distributed Data Deduplication

       3.6 Record Fusion and Entity Consolidation

       3.7 Human-Involved Data Deduplication

       3.8 Data Deduplication Tools

       3.9 Conclusion

       Chapter 4 Data Transformation

       4.1 Syntactic Data Transformations

       4.2 Semantic Data Transformations

       4.3 ETL Tools

       4.4 Conclusion

       Chapter 5 Data Quality Rule Definition and Discovery

       5.1 Functional Dependencies

       5.2 Conditional Functional Dependencies

       5.3 Denial Constraints

       5.4 Other Types of Constraints

       5.5 Conclusion

       Chapter 6 Rule-Based Data Cleaning

       6.1 Violation Detection

       6.2 Error Repair

       6.3 Conclusion

       Chapter 7 Machine Learning and Probabilistic Data Cleaning

       7.1 Machine