Go to the Spark installation path to run the job. Use the following command line template:
bin/spark-submit --class org.datacleaner.spark.Main --master yarn-cluster /path/to/DataCleaner-spark.jar /path/to/conf.xml /path/to/job_file.analysis.xml ([/path/to/custom_properties.properties])
A convenient way to organize it is in a shell script like the below, where every individual argument can be edited line by line:
#!/bin/sh SPARK_HOME=/path/to/apache-spark SPARK_MASTER=yarn-cluster DC_PRIMARY_JAR=/path/to/DataCleaner-spark.jar DC_EXTENSION_JARS=/path/to/extension1.jar,/path/to/extension2.jar DC_CONF_FILE=hdfs:///path/to/conf.xml DC_JOB_FILE=hdfs:///path/to/job_file.analysis.xml DC_PROPS=hdfs:///path/to/custom_properties.properties DC_COMMAND="$SPARK_HOME/bin/spark-submit" DC_COMMAND="$DC_COMMAND --class org.datacleaner.spark.Main" DC_COMMAND="$DC_COMMAND --master $SPARK_MASTER" echo "Using DataCleaner executable: $DC_PRIMARY_JAR" if [ "$DC_EXTENSION_JARS" != "" ]; then echo "Adding extensions: $DC_EXTENSION_JARS" DC_COMMAND="$DC_COMMAND --jars $DC_EXTENSION_JARS" fi DC_COMMAND="$DC_COMMAND $DC_PRIMARY_JAR $DC_CONF_FILE $DC_JOB_FILE $DC_PROPS" echo "Submitting DataCleaner job $DC_JOB_FILE to Spark $SPARK_MASTER" $DC_COMMAND
The example makes it clear that there are a few more parameters to invoking the job. Let's go through them:
SPARK_MASTER represents the location of the Driver program, see the section on Hadoop deployment overview.
DC_EXTENSION_JARS allows you to add additional JAR files with extensions to DataCleaner.
DC_PROPS is maybe the most important one. It allows you to add a .properties file which can be used for a number of things:
Special property datacleaner.result.hdfs.path which allows you to specify the filename (on HDFS) where the analysis result (.analysis.result.dat) file is stored. It defaults to /datacleaner/results/[job name]-[timestamp].analysis.result.dat
Special property datacleaner.result.hdfs.enabled which can be either 'true' (default) or 'false'. Setting this property to false will disable result gathering completely from the DataCleaner job, which gives a significant increase in performance, but no analyzer results are gathered or written. This is thus only relevant for ETL-style jobs where the purpose of the job is to create/insert/update/delete from other datastores or files.
Properties to override configuration defaults.
Properties to set job variables/parameters.