Launching DataCleaner jobs using Spark

Launching DataCleaner jobs using Spark
Prev	Chapter 15. Apache Hadoop and Spark interface	Next

Go to the Spark installation path to run the job. Use the following command line template:

			bin/spark-submit --class org.datacleaner.spark.Main --master yarn-cluster /path/to/DataCleaner-spark.jar
			/path/to/conf.xml /path/to/job_file.analysis.xml ([/path/to/custom_properties.properties])

A convenient way to organize it is in a shell script like the below, where every individual argument can be edited line by line:

			#!/bin/sh
			SPARK_HOME=/path/to/apache-spark
			SPARK_MASTER=yarn-cluster
			DC_PRIMARY_JAR=/path/to/DataCleaner-spark.jar
			DC_EXTENSION_JARS=/path/to/extension1.jar,/path/to/extension2.jar
			DC_CONF_FILE=hdfs:///path/to/conf.xml
			DC_JOB_FILE=hdfs:///path/to/job_file.analysis.xml
			DC_PROPS=hdfs:///path/to/custom_properties.properties
			
			DC_COMMAND="$SPARK_HOME/bin/spark-submit"
			DC_COMMAND="$DC_COMMAND --class org.datacleaner.spark.Main"
			DC_COMMAND="$DC_COMMAND --master $SPARK_MASTER"
			
			echo "Using DataCleaner executable: $DC_PRIMARY_JAR"
			if [ "$DC_EXTENSION_JARS" != "" ]; then
			  echo "Adding extensions: $DC_EXTENSION_JARS"
			  DC_COMMAND="$DC_COMMAND --jars $DC_EXTENSION_JARS"
			fi

			DC_COMMAND="$DC_COMMAND $DC_PRIMARY_JAR $DC_CONF_FILE $DC_JOB_FILE $DC_PROPS"
			
			echo "Submitting DataCleaner job $DC_JOB_FILE to Spark $SPARK_MASTER"
			$DC_COMMAND

The example makes it clear that there are a few more parameters to invoking the job. Let's go through them:

SPARK_MASTER represents the location of the Driver program, see the section on Hadoop deployment overview.
DC_EXTENSION_JARS allows you to add additional JAR files with extensions to DataCleaner.
DC_PROPS is maybe the most important one. It allows you to add a .properties file which can be used for a number of things:
1. Special property datacleaner.result.hdfs.path which allows you to specify the filename (on HDFS) where the analysis result (.analysis.result.dat) file is stored. It defaults to /datacleaner/results/[job name]-[timestamp].analysis.result.dat
2. Special property datacleaner.result.hdfs.enabled which can be either 'true' (default) or 'false'. Setting this property to false will disable result gathering completely from the DataCleaner job, which gives a significant increase in performance, but no analyzer results are gathered or written. This is thus only relevant for ETL-style jobs where the purpose of the job is to create/insert/update/delete from other datastores or files.
3. Properties to override configuration defaults.
4. Properties to set job variables/parameters.

Prev	Up	Next
Setting up Spark and DataCleaner environment	Home	Using Hadoop in DataCleaner desktop