Processing framework

The way DataCleaner processes data is slightly different compared to most similar (ETL-like) tools. Firstly in the way multithreading is applied, secondly in the way DataCleaner may sometimes optimize the graph at execution time.

Multithreading: The multithreading strategy in DataCleaner enables the tool to have the minimum amount of blocking and buffering and the maximum amount of parallelism and potentially also distribution. Most ETL-like tools apply a threading strategy where each component in a job has its own thread-management as well as an input- and an output-buffer. In DataCleaner the thread management is made such that every record is processed in parallel - each unit of work is stepping through the complete job graph in one single pass. This has a number of interesting traits:

  1. There is a high degree of automatic 'load balancing' among the components - less constraints and bottlenecks around the slowest components in the job.

  2. The system lends itself to highly distributed processing because statefulness is the exception instead of the rule.

  3. There is less waste in the form of buffers inbetween the components of a job.

  4. One downside to this approach is that the order of the processed records cannot be guaranteed. This is only very rarely required in the domain of data profiling and analysis, and if it is required there are technical workarounds to apply.

Graph optimization: While a job graph (see wiring components together ) may show a particular following order, the engine may at runtime do certain optimizations to it. Some components may provide optimization strategies that involves changing the source query so that the number of (or content of) processed records is changed. Obviously this is a side-effect of using a component that will only be applied in cases where it does not impact other components in a job. The principle is sometimes also referred to as 'Push down optimization'.

An example of this is a 'Null check' filter: If a Null check is applied on a source column and all other components require either a NULL or a NOT_NULL outcome (either explicitly or implicitly), then the 'Null check' filter may add a predicate to the source query to filter out all irrelevant records. For more information on this principle, please read the blog entry ' Push down query optimization in DataCleaner ' by Kasper Sørensen.