Datastores can be configured in the configuration file under the element <datastore-catalog>. The following sections will go into further details with particular types of datastores.
Here are a few examples of common database types.
The DataCleaner User Interface makes it a lot easier to figure out the url (connection string) and driver class part of the connection properties. It's a good place to start if you don't know these properties already.
MySQL
<jdbc-datastore name="MySQL datastore">
<url>jdbc:mysql://hostname:3306/database?defaultFetchSize=-2147483648</url>
<driver>com.mysql.jdbc.Driver</driver>
<username>username</username>
<password>password</password>
<multiple-connections>true</multiple-connections>
</jdbc-datastore>
Oracle
<jdbc-datastore name="Oracle datastore">
<url>jdbc:oracle:thin:@hostname:1521:sid</url>
<driver>oracle.jdbc.OracleDriver</driver>
<username>username</username>
<password>password</password>
<multiple-connections>true</multiple-connections>
</jdbc-datastore>
Microsoft SQL Server
A typical connection to Microsoft SQL server will look like this:
<jdbc-datastore name="MS SQL Server datastore">
<url>jdbc:jtds:sqlserver://hostname/database;useUnicode=true;characterEncoding=UTF-8</url>
<driver>net.sourceforge.jtds.jdbc.Driver</driver>
<username>username</username>
<password>password</password>
<multiple-connections>true</multiple-connections>
</jdbc-datastore>
However, if you want to use an instance name based connection, then the SQL Server Browser service MUST BE RUNNING and then you can include the instance parameter: Here's an example for connecting to a SQLEXPRESS instance:
<url>jdbc:jtds:sqlserver://hostname/database;instance=SQLEXPRESS;useUnicode=true;characterEncoding=UTF-8</url>
This is an example of a CSV file datastore
<csv-datastore name="my_csv_file"> <filename>/path/to/file.csv</filename> <quote-char>"</quote-char> <separator-char>;</separator-char> <encoding>UTF-8</encoding> <fail-on-inconsistencies>true</fail-on-inconsistencies> <header-line-number>1</header-line-number> </csv-datastore>
Files with fixed width values can be registered in two ways - either with a single fixed-width size for all columns, or with individual value-widths.
Here's an example with a fixed width specification for all columns:
<fixed-width-datastore name="FIXED-WIDTH-ALL-COLUMNS"> <filename>/path/to/the/file.txt</filename> <width-specification> <fixed-value-width>20</fixed-value-width> </width-specification> <encoding>UTF-8</encoding> <header-line-number>1</header-line-number> <fail-on-inconsistencies>true</fail-on-inconsistencies> <skip-ebcdic-header>false</skip-ebcdic-header> <eol-present>true</eol-present> </fixed-width-datastore>
Here's an example with individual (2 columns) width specifications:
<fixed-width-datastore name="FIXED-WIDTH-2-COLUMNS"> <filename>/path/to/the/file.txt</filename> <width-specification> <value-width>20</value-width> <value-width>30</value-width> </width-specification> <encoding>UTF-8</encoding> <header-line-number>1</header-line-number> <fail-on-inconsistencies>true</fail-on-inconsistencies> <skip-ebcdic-header>false</skip-ebcdic-header> <eol-present>true</eol-present> </fixed-width-datastore>
Here's an example with an EBCDIC file:
<fixed-width-datastore name="FIXED-WIDTH-EBCDIC"> <filename>/path/to/the/file.ebc</filename> <width-specification> <value-width>2</value-width> <value-width>10</value-width> </width-specification> <encoding>IBM01148</encoding> <header-line-number>0</header-line-number> <fail-on-inconsistencies>true</fail-on-inconsistencies> <skip-ebcdic-header>true</skip-ebcdic-header> <eol-present>false</eol-present> </fixed-width-datastore>
This is an example of an Excel spreadsheet datastore
<excel-datastore name="my_excel_spreadsheet"> <filename>/path/to/file.xls</filename> </excel-datastore>
Defining XML datastores can be done in both a simple (automatically mapped) way, or an advanced (and more performant and memory effective way).
The simple way is just to define a xml-datastore with a filename, like this:
<xml-datastore name="my_xml_datastore"> <filename>/path/to/file.xml</filename> </xml-datastore>
This kind of XML datastore works find when the file size is small and the hierarchy is not too complex. The downside to it is that it tries to automatically detect a table structure that is fitting to represent the XML contents (which is a tree structure, not really a table).
To get around this problem you can also define your own table structure in which you specify the XPaths that make up your rows and the values within your rows. Here's an example:
<xml-datastore name="my_xml_datastore"> <filename>/path/to/file.xml</filename> <table-def> <rowXpath>/greetings/greeting</rowXpath> <valueXpath>/greetings/greeting/how</valueXpath> <valueXpath>/greetings/greeting/what</valueXpath> </table-def> </xml-datastore>
The datastore defines a single table, where each record is defined as the element which matches the XPath "/greetings/greeting". The table has two columns, which are represented by the "how" and "what" elements that are child elements to the row's path.
For more details on the XPath expressions that define the table model of XML datastores, please refer to MetaModel's tutorial on the topic (MetaModel is the data access library used to read data in DataCleaner).
This is an example of an ElasticSearch index datastore
<elasticsearch-datastore name="my_elastic_search_index"> <hostname>localhost</hostname> <port>9300</port> <cluster-name>my_es_cluster</cluster-name> <index-name>my_index</index-name> </elasticsearch-datastore>
This is an example of a fully specified MongoDB datastore, with an example table structure based on two collections.
<mongodb-datastore name="my_mongodb_datastore"> <hostname>localhost</hostname> <port>27017</port> <database-name>my_database</database-name> <username>user</username> <password>pass</password> <table-def> <collection>company_collection</collection> <property> <name>company_name</name> <type>VARCHAR</type> </property> <property> <name>customer</name> <type>BOOLEAN</type> </property> <property> <name>num_employees</name> <type>INTEGER</type> </property> <property> <name>address_details</name> <type>MAP</type> </property> </table-def> <table-def> <collection>person_collection</collection> <property> <name>person_name</name> <type>VARCHAR</type> </property> <property> <name>birthdate</name> <type>DATE</type> </property> <property> <name>emails</name> <type>LIST</type> </property> </table-def> </mongodb-datastore>
If the hostname and port elements are left out, localhost:27017 will be assumed.
If the username and password elements are left out, an anonymous connection will be made.
If there are no table-def elements, the database will be inspected and table definitions will be auto-detected based on the first 1000 documents of each collection.
This is an example of a fully specified CouchDB datastore, with an example table structure based on two CouchDB databases.
<couchdb-datastore name="my_couchdb_datastore"> <hostname>localhost</hostname> <port>5984</port> <username>user</username> <password>pass</password> <ssl>true</ssl> <table-def> <database>company_collection</database> <field> <name>company_name</name> <type>VARCHAR</type> </field> <field> <name>customer</name> <type>BOOLEAN</type> </field> <field> <name>num_employees</name> <type>INTEGER</type> </field> <field> <name>address_details</name> <type>MAP</type> </field> </table-def> <table-def> <database>person_collection</database> <field> <name>person_name</name> <type>VARCHAR</type> </field> <field> <name>birthdate</name> <type>DATE</type> </field> <field> <name>emails</name> <type>LIST</type> </field> </table-def> </couchdb-datastore>
If the hostname and port elements are left out, localhost:5984 will be assumed.
If the username and password elements are left out, an anonymous connection will be made.
If the "ssl" element is false or left out, a regular HTTP connection will be used.
If there are no table-def elements, the database will be inspected and table definitions will be auto-detected based on the first 1000 documents of each database.