Storage provider

The storage provider is used for storing temporary data used while executing an analysis job. There are two types of storage: Large collections of (single) values and "annotated rows", ie. rows that have been sampled or marked with a specific category which will be of interest to the user to inspect.

To explain the storage provider configuration let's look at the default element:

			<storage-provider>
			 <combined>
			  <collections-storage>
			   <berkeley-db/>
			  </collections-storage>
			  <row-annotation-storage>
			   <in-memory max-rows-threshold="1000" max-sets-threshold="200"/>
			  </row-annotation-storage>
			 </combined>
			</storage-provider> 

The element defines a combined storage strategy.

Collections are stored using berkeley-db, an embedded database by Oracle. This is the recommended strategy for collections.

Row annotations are stored in memory. There's a threshold of 1000 rows in maximum 200 sets. This means that if more than 1000 records are annotated with the same category then additional records will not be saved (and thus is not viewable by the user). Furthermore it means that only up until 200 sample sets will be saved. Further annotations will not be sampled, but metrics still be counted. Most user scenarios will not require more than max. 1000 annotated records for inspection, but if this is really neccessary a different strategy can be pursued:

Using MongoDB for annotated rows

If you have a local MongoDB instance, you can use this as a store for annotated rows. This is how the configuration looks like:

				  <row-annotation-storage>
				   <custom-storage-provider class-name="org.datacleaner.storage.MongoDbStorageProvider"/>
				  </row-annotation-storage> 

The MongoDB storage provider solution has shown very good performance metrics, but does add more complexity to the installation, which is why it is still considered experimental and only for savvy users.