Chapter 15. Apache Hadoop and Spark interface

Abstract

DataCleaner allows big data processing on the Apache Hadoop platform. In this chapter we will guide you through the process of setting up and running your first DataCleaner job on Hadoop.

Table of Contents

Hadoop deployment overview
Setting up Spark and DataCleaner environment
Upload configuration file to HDFS
Upload job file to HDFS
Upload executables to HDFS
Launching DataCleaner jobs using Spark
Using Hadoop in DataCleaner desktop
Configure Hadoop clusters
CSV datastores on HDFS
Limitations of the Hadoop interface