Limitations of the Hadoop interface

Limitations of the Hadoop interface
Prev	Chapter 15. Apache Hadoop and Spark interface	Next

While the Hadoop interface for DataCleaner allows distributed running of DataCleaner jobs on the Hadoop platform, there are a few limitations:

Datastore support

Currently we support a limited set of source datastores from HDFS. CSV files are the primary source here. We do require that files on HDFS are UTF8 encoded and that only single-line values occur.
Non-distributable components

A few components are by nature not distributable. If your job depends on these, DataCleaner will resort to executing the job on a single Spark executor, which may have a significant performance impact.
Hadoop Distributions without Namenode

Some Hadoop Distributions (such as MapR) have replaced the concept of Namenode with something else. This is mostly fine, but it does mean that file paths with username+port of Namenodes are obviously not working.

Prev	Up	Next
Using Hadoop in DataCleaner desktop	Home	Part VI. Third party integrations