Launching DataCleaner jobs using Spark

Go to the Spark installation path to run the job. Use the following command line template:

			bin/spark-submit --class org.datacleaner.spark.Main --master yarn-cluster /path/to/DataCleaner-spark.jar
			/path/to/conf.xml /path/to/job_file.analysis.xml ([/path/to/custom_properties.properties])
		

A convenient way to organize it is in a shell script like the below, where every individual argument can be edited line by line:

			#!/bin/sh
			SPARK_HOME=/path/to/apache-spark
			SPARK_MASTER=yarn-cluster
			DC_PRIMARY_JAR=/path/to/DataCleaner-spark.jar
			DC_EXTENSION_JARS=/path/to/extension1.jar,/path/to/extension2.jar
			DC_CONF_FILE=hdfs:///path/to/conf.xml
			DC_JOB_FILE=hdfs:///path/to/job_file.analysis.xml
			DC_PROPS=hdfs:///path/to/custom_properties.properties
			
			DC_COMMAND="$SPARK_HOME/bin/spark-submit"
			DC_COMMAND="$DC_COMMAND --class org.datacleaner.spark.Main"
			DC_COMMAND="$DC_COMMAND --master $SPARK_MASTER"
			
			echo "Using DataCleaner executable: $DC_PRIMARY_JAR"
			if [ "$DC_EXTENSION_JARS" != "" ]; then
			  echo "Adding extensions: $DC_EXTENSION_JARS"
			  DC_COMMAND="$DC_COMMAND --jars $DC_EXTENSION_JARS"
			fi

			DC_COMMAND="$DC_COMMAND $DC_PRIMARY_JAR $DC_CONF_FILE $DC_JOB_FILE $DC_PROPS"
			
			echo "Submitting DataCleaner job $DC_JOB_FILE to Spark $SPARK_MASTER"
			$DC_COMMAND
		

The example makes it clear that there are a few more parameters to invoking the job. Let's go through them:

  1. SPARK_MASTER represents the location of the Driver program, see the section on Hadoop deployment overview.

  2. DC_EXTENSION_JARS allows you to add additional JAR files with extensions to DataCleaner.

  3. DC_PROPS is maybe the most important one. It allows you to add a .properties file which can be used for a number of things:

    1. Special property datacleaner.result.hdfs.path which allows you to specify the filename (on HDFS) where the analysis result (.analysis.result.dat) file is stored. It defaults to /datacleaner/results/[job name]-[timestamp].analysis.result.dat

    2. Special property datacleaner.result.hdfs.enabled which can be either 'true' (default) or 'false'. Setting this property to false will disable result gathering completely from the DataCleaner job, which gives a significant increase in performance, but no analyzer results are gathered or written. This is thus only relevant for ETL-style jobs where the purpose of the job is to create/insert/update/delete from other datastores or files.

    3. Properties to override configuration defaults.

    4. Properties to set job variables/parameters.