Datastores

Datastores can be configured in the configuration file under the element <datastore-catalog>. The following sections will go into further details with particular types of datastores.

Database (JDBC) connections

Here are a few examples of common database types.

Tip

The DataCleaner User Interface makes it a lot easier to figure out the url (connection string) and driver class part of the connection properties. It's a good place to start if you don't know these properties already.

MySQL

				<jdbc-datastore name="MySQL datastore">
				 <url>jdbc:mysql://hostname:3306/database?defaultFetchSize=-2147483648</url>
				 <driver>com.mysql.jdbc.Driver</driver>
				 <username>username</username>
				 <password>password</password>
				 <multiple-connections>true</multiple-connections>
				</jdbc-datastore> 

Oracle

				<jdbc-datastore name="Oracle datastore">
				 <url>jdbc:oracle:thin:@hostname:1521:sid</url>
				 <driver>oracle.jdbc.OracleDriver</driver>
				 <username>username</username>
				 <password>password</password>
				 <multiple-connections>true</multiple-connections>
				</jdbc-datastore> 

Microsoft SQL Server

A typical connection to Microsoft SQL server will look like this:

				<jdbc-datastore name="MS SQL Server datastore">
				 <url>jdbc:jtds:sqlserver://hostname/database;useUnicode=true;characterEncoding=UTF-8</url>
				 <driver>net.sourceforge.jtds.jdbc.Driver</driver>
				 <username>username</username>
				 <password>password</password>
				 <multiple-connections>true</multiple-connections>
				</jdbc-datastore> 

However, if you want to use an instance name based connection, then the SQL Server Browser service MUST BE RUNNING and then you can include the instance parameter: Here's an example for connecting to a SQLEXPRESS instance:

				 <url>jdbc:jtds:sqlserver://hostname/database;instance=SQLEXPRESS;useUnicode=true;characterEncoding=UTF-8</url>
			

Comma-Separated Values (CSV) files

This is an example of a CSV file datastore

				<csv-datastore name="my_csv_file">
				 <filename>/path/to/file.csv</filename>
				 <quote-char>"</quote-char>
				 <separator-char>;</separator-char>
				 <encoding>UTF-8</encoding>
				 <fail-on-inconsistencies>true</fail-on-inconsistencies>
				 <header-line-number>1</header-line-number>
				</csv-datastore> 

Fixed width value files

Files with fixed width values can be registered in two ways - either with a single fixed-width size for all columns, or with individual value-widths.

Here's an example with a fixed width specification for all columns:

				<fixed-width-datastore name="FIXED-WIDTH-ALL-COLUMNS">
				 <filename>/path/to/the/file.txt</filename>
				 <width-specification>
				  <fixed-value-width>20</fixed-value-width>
				 </width-specification>
				 <encoding>UTF-8</encoding>
				 <header-line-number>1</header-line-number>
				 <fail-on-inconsistencies>true</fail-on-inconsistencies>
				 <skip-ebcdic-header>false</skip-ebcdic-header>
				 <eol-present>true</eol-present>
				</fixed-width-datastore>
			

Here's an example with individual (2 columns) width specifications:

				<fixed-width-datastore name="FIXED-WIDTH-2-COLUMNS">
				 <filename>/path/to/the/file.txt</filename>
				 <width-specification>
				  <value-width>20</value-width>
				  <value-width>30</value-width>
				 </width-specification>
				 <encoding>UTF-8</encoding>
				 <header-line-number>1</header-line-number>
				 <fail-on-inconsistencies>true</fail-on-inconsistencies>
				 <skip-ebcdic-header>false</skip-ebcdic-header>
				 <eol-present>true</eol-present>
				</fixed-width-datastore>
			

Here's an example with an EBCDIC file:

				<fixed-width-datastore name="FIXED-WIDTH-EBCDIC">
				 <filename>/path/to/the/file.ebc</filename>
				 <width-specification>
				  <value-width>2</value-width>
				  <value-width>10</value-width>
				 </width-specification>
				 <encoding>IBM01148</encoding>
				 <header-line-number>0</header-line-number>
				 <fail-on-inconsistencies>true</fail-on-inconsistencies>
				 <skip-ebcdic-header>true</skip-ebcdic-header>
				 <eol-present>false</eol-present>
				</fixed-width-datastore>
			

Excel spreadsheets

This is an example of an Excel spreadsheet datastore

				<excel-datastore name="my_excel_spreadsheet">
				 <filename>/path/to/file.xls</filename>
				</excel-datastore> 

XML file datastores

Defining XML datastores can be done in both a simple (automatically mapped) way, or an advanced (and more performant and memory effective way).

The simple way is just to define a xml-datastore with a filename, like this:

				<xml-datastore name="my_xml_datastore">
				 <filename>/path/to/file.xml</filename>
				</xml-datastore> 

This kind of XML datastore works find when the file size is small and the hierarchy is not too complex. The downside to it is that it tries to automatically detect a table structure that is fitting to represent the XML contents (which is a tree structure, not really a table).

To get around this problem you can also define your own table structure in which you specify the XPaths that make up your rows and the values within your rows. Here's an example:

				<xml-datastore name="my_xml_datastore">
				 <filename>/path/to/file.xml</filename>
				 <table-def>
				   <rowXpath>/greetings/greeting</rowXpath>
				   <valueXpath>/greetings/greeting/how</valueXpath>
				   <valueXpath>/greetings/greeting/what</valueXpath>
				 </table-def>
				</xml-datastore> 

The datastore defines a single table, where each record is defined as the element which matches the XPath "/greetings/greeting". The table has two columns, which are represented by the "how" and "what" elements that are child elements to the row's path.

For more details on the XPath expressions that define the table model of XML datastores, please refer to MetaModel's tutorial on the topic (MetaModel is the data access library used to read data in DataCleaner).

ElasticSearch index

This is an example of an ElasticSearch index datastore

				<elasticsearch-datastore name="my_elastic_search_index">
				 <hostname>localhost</hostname>
				 <port>9300</port>
				 <cluster-name>my_es_cluster</cluster-name>
				 <index-name>my_index</index-name>
				</elasticsearch-datastore> 

MongoDB databases

This is an example of a fully specified MongoDB datastore, with an example table structure based on two collections.

				<mongodb-datastore name="my_mongodb_datastore">
				 <hostname>localhost</hostname>
				 <port>27017</port>
				 <database-name>my_database</database-name>
				 <username>user</username>
				 <password>pass</password>
				 <table-def>
				   <collection>company_collection</collection>
				   <property>
				    <name>company_name</name>
				    <type>VARCHAR</type>
				   </property>
				   <property>
				    <name>customer</name>
				    <type>BOOLEAN</type>
				   </property>
				   <property>
				    <name>num_employees</name>
				    <type>INTEGER</type>
				   </property>
				   <property>
				    <name>address_details</name>
				    <type>MAP</type>
				   </property>
				 </table-def>
				 <table-def>
				   <collection>person_collection</collection>
				   <property>
				    <name>person_name</name>
				    <type>VARCHAR</type>
				   </property>
				   <property>
				    <name>birthdate</name>
				    <type>DATE</type>
				   </property>
				   <property>
				    <name>emails</name>
				    <type>LIST</type>
				   </property>
				 </table-def>
				</mongodb-datastore> 

If the hostname and port elements are left out, localhost:27017 will be assumed.

If the username and password elements are left out, an anonymous connection will be made.

If there are no table-def elements, the database will be inspected and table definitions will be auto-detected based on the first 1000 documents of each collection.

CouchDB databases

This is an example of a fully specified CouchDB datastore, with an example table structure based on two CouchDB databases.

				<couchdb-datastore name="my_couchdb_datastore">
				 <hostname>localhost</hostname>
				 <port>5984</port>
				 <username>user</username>
				 <password>pass</password>
				 <ssl>true</ssl>
				 <table-def>
				   <database>company_collection</database>
				   <field>
				    <name>company_name</name>
				    <type>VARCHAR</type>
				   </field>
				   <field>
				    <name>customer</name>
				    <type>BOOLEAN</type>
				   </field>
				   <field>
				    <name>num_employees</name>
				    <type>INTEGER</type>
				   </field>
				   <field>
				    <name>address_details</name>
				    <type>MAP</type>
				   </field>
				 </table-def>
				 <table-def>
				   <database>person_collection</database>
				   <field>
				    <name>person_name</name>
				    <type>VARCHAR</type>
				   </field>
				   <field>
				    <name>birthdate</name>
				    <type>DATE</type>
				   </field>
				   <field>
				    <name>emails</name>
				    <type>LIST</type>
				   </field>
				 </table-def>
				</couchdb-datastore> 

If the hostname and port elements are left out, localhost:5984 will be assumed.

If the username and password elements are left out, an anonymous connection will be made.

If the "ssl" element is false or left out, a regular HTTP connection will be used.

If there are no table-def elements, the database will be inspected and table definitions will be auto-detected based on the first 1000 documents of each database.

Composite datastore

This is an example of a composite datastore. It contains data from 2 other datastores: Datastore 1 and Datastore 2.

				<composite-datastore name="my composite">
					<datastore-name>Datastore 1</datastore-name>
					<datastore-name>Datastore 2</datastore-name>
				</composite-datastore>