In order to work, Apache Spark requires either of environmental variables HADOOP_CONF_DIR or YARN_CONF_DIR to a directory containing your Hadoop/Yarn configuration files such as core-site.xml , yarn-site.xml etc.
DataCleaner on Hadoop needs a regular DataCleaner configuration file (conf.xml). It's best to upload this to the hadoop distributed file system (HDFS). We recommend putting this file into the path /datacleaner/conf.xml . Simple example of a configuration file (conf.xml) with a CSV datastore based on a HDFS file or directory:
<?xml version="1.0" encoding="UTF-8"?> <configuration xmlns="http://eobjects.org/analyzerbeans/configuration/1.0"> <datastore-catalog> <csv-datastore name="mydata"> <filename>hdfs://bigdatavm:9000/path/to/data.txt</filename> <multiline-values>false</multiline-values> </csv-datastore> </datastore-catalog> </configuration>
Notice the filename which is here specified with scheme, hostname and port:
<filename>hdfs://bigdatavm:9000/path/to/data.txt</filename>
This here refers to the Hadoop Namenode's hostname and port.
It can also be specified more implicityly, without the username and port:
<filename>hdfs:///path/to/data.txt</filename>
Or even without scheme:
<filename>/path/to/data.txt</filename>
Upload the DataCleaner job you wish to run (a DataCleaner .analysis.xml job file) to HDFS. We recommend putting this file into a path such as /datacleaner/jobs/myjob.xml. The jobs can be built using the DataCleaner desktop UI, but do ensure that they map well to the configuration file also on HDFS.
Example job file based on the above datastore:
<?xml version="1.0" encoding="UTF-8"?> <job xmlns="http://eobjects.org/analyzerbeans/job/1.0"> <source> <data-context ref="mydata" /> <columns> <column id="col_country" path="country" /> <column id="col_company" path="company" /> </columns> </source> <analysis> <analyzer> <descriptor ref="Create CSV file"/> <properties> <property name="File" value="hdfs:///path/to/output.csv"/> <property name="Separator char" value=","/> <property name="Quote char" value="""/> <property name="Escape char" value="\"/> <property name="Include header" value="true"/> <property name="Encoding" value="UTF-8"/> <property name="Fields" value="[COUNTRY,CUSTOMERNUMBER]"/> <property name="Overwrite file if exists" value="true"/> </properties> <input ref="col_country" name="Columns"/> <input ref="col_company" name="Columns"/> </analyzer> </analysis> </job>
This particular job is very simplistic - it just copies data from A to B. Notes about the job file contents:
The job is referring to mydata which was the name of the CSV datastore defined in the configuration file.
There is another HDFS file reference used in the "File" property. The filename format is the same as in the configuration file.
If your desktop application has access to the namenode then you can build this job in the desktop application, save it and run it on spark. There is nothing particular about this job that makes it runnable on spark, except that the file references involved are resolvable from the hadoop nodes.
In the installation of DataCleaner you will find the file 'DataCleaner-spark.jar'.
This jar file contains the core of what is needed to run DataCleaner with Apache Spark on Hadoop. It contains also the standard components of DataCleaner.
Upload this jar file to HDFS in the folder /datacleaner/lib.
Upload your DataCleaner license file to /datacleaner/hi_datacleaner.lic.
Upload any extension jar files that you need (for instance Groovy-DataCleaner.jar) to that same folder.