Reference documentation

5.7.0

Copies of this document may be made for your own use and for distribution to others, provided that you do not charge any fee for such copies and further provided that each copy contains this Copyright Notice, whether distributed in print or electronically.


Table of Contents

I. Introduction to DataCleaner
1. Background and concepts
What is data quality (DQ)?
What is data profiling?
What is data wrangling?
What is a datastore?
Composite datastore
What is data monitoring?
What is master data management (MDM)?
2. Getting started with DataCleaner desktop
Installing the desktop application
Connecting to your datastore
Adding components to the job
Wiring components together
Transformer output
Filter requirement
Output data streams
Executing jobs
Saving and opening jobs
Template jobs
Writing cleansed data to files
II. Analysis component reference
3. Transform
JavaScript transformer
Invoke child Analysis job
Apply classifier & Apply regression
Equals
Max rows
Not null
Union
4. Improve
Synonym lookup
Table lookup
5. Analyze
Boolean analyzer
Completeness analyzer
Character set distribution
Date gap analyzer
Date/time analyzer
Number analyzer
Pattern finder
Reference data matcher
Referential integrity
String analyzer
Unique key check
Value distribution
Value matcher
Weekday distribution
Machine Learning analyzers
6. Write
Create CSV file
Create Excel spreadsheet
Create staging table
Insert into table
Update table
III. Reference data
7. Dictionaries
8. Synonyms (aka. Synonym catalogs)
Text file synonym catalog
Datastore synonym catalog
9. String patterns
IV. Configuration reference
10. Configuration file
XML schema
Datastores
Database (JDBC) connections
Comma-Separated Values (CSV) files
Fixed width value files
Excel spreadsheets
XML file datastores
ElasticSearch index
MongoDB databases
CouchDB databases
Composite datastore
Reference data
Dictionaries
Synonym catalogs
String patterns
Task runner
Storage provider
11. Analysis job files
XML schema
Source section
12. Logging
Logging configuration file
Default logging configuration
Modifying logging levels
Alternative logging outputs
13. Database drivers
Installing Database drivers in DataCleaner desktop
V. Invoking DataCleaner jobs
14. Command-line interface
Executables
Usage scenarios
Executing an analysis job
Listing datastore contents and available components
Parameterizable jobs
Dynamically overriding configuration elements
15. Apache Hadoop and Spark interface
Hadoop deployment overview
Setting up Spark and DataCleaner environment
Upload configuration file to HDFS
Upload job file to HDFS
Upload executables to HDFS
Launching DataCleaner jobs using Spark
Using Hadoop in DataCleaner desktop
Configure Hadoop clusters
CSV datastores on HDFS
Limitations of the Hadoop interface
VI. Third party integrations
16. Pentaho integration
Configure DataCleaner in Pentaho Data Integration
Launch DataCleaner to profile Pentaho Data Integration steps
Run DataCleaner jobs in Pentaho Data Integration
VII. Developer's guide
17. Architecture
Data access
Processing framework
18. Executing jobs through code
Overview of steps and options
Step 1: Configuration
Step 2: Job
Step 3: Execution
Step 4: Result
19. Developer resources
Extension development tutorials
Building DataCleaner
20. Extension packaging
Annotated components
Single JAR file
Extension metadata XML
Component icons
21. Embedding DataCleaner

List of Tables

3.1. JavaScript variables
3.2. JavaScript data types
3.3. Machine learning transformer properties
5.1. Completeness analyzer properties
5.2. Pattern finder properties
5.3. Referential integrity properties
5.4. Unique key check properties
5.5. Value distribution properties