The configuration for DataCleaner is represented in the class DataCleanerConfiguration (previously 'AnalyzerBeansConfiguration'). You need a DataCleanerConfiguration as a prerequisite for most of the coming operations.
The easiest and probably most convenient option for acquiring an DataCleanerConfiguration instance is to load it from a file, typically named conf.xml (See the Configuration file chapter for more details on this file format). To load the file, use the JaxbConfigurationReader class, like this:
InputStream inputStream = new FileInputStream("conf.xml"); JaxbConfigurationReader configurationReader = new JaxbConfigurationReader(); DataCleanerConfiguration configuration = configurationReader.read(inputStream);
Alternatively, you can build the configuration programmatically, through code. This is typically more cumbersome, but in some cases also quite useful if the configuration is to be build dynamically or something like that.
Here's an example where we configure DataCleaner with 2 example datastores and a threadpool of 10 threads:
Datastore datastore1 = new CsvDatastore("my CSV file", "some_data.csv"); boolean multipleConnections = true Datastore datastore2 = new JdbcDatastore("my database", "jdbc:vendor://localhost/database", "com.database.Driver", "username", "password", multipleConnections); DataCleanerConfigurationImpl configuration = new DataCleanerConfigurationImpl(); configuration = configuration.replace(new MultiThreadedTaskRunner(10)); configuration = configuration.replace(new DatastoreCatalogImpl(datastore1, datastore2));
Either way we do it, we now have an DataCleanerConfiguration with the variable name 'configuration'. Then we can proceed to defining the job to run.