fuzzy-row-matcher

Framework for finding similar rows in a JDBC source

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

What things you need to install the software and how to install them

Gradle 5.4.1 or greater if you want to build from source
JVM 8+

Installing

Running with Maven

If you're using Maven simply specify the GAV coordinate below and Maven will do the rest

<dependency>
  <groupId>org.bradfordmiller</groupId>
  <artifactId>fuzzy-row-matcher</artifactId>
  <version>1.0.10</version>
</dependency>

Running with SBT

Add this GAV coordinate to your SBT dependency list

libraryDependencies += "org.bradfordmiller" %% "fuzzy-row-matcher" % "1.0.10"

Running with Gradle

Add this GAV coordinate to your Gradle dependencies section

dependencies {
    ...
    ...
    implementation 'org.bradfordmiller:fuzzy-row-matcher:1.0.10'
}

Using the library

API docs were generated using dokka and are hosted here.

Configuation information is stored using the simple-jndi API so a jndi.properties file will need to be present in src/main/resources and correctly configured for this library to work.

Configuring jndi

See jndi.properties sample file in this project. This will need to be configured and dropped into src/main/resources for the API to work. Actual jndi files will be searched and loaded based upon the path in the org.osjava.sj.root property setting. In the example file for this project, you will see the path is src/main/resources/jndi.

Configuring jndi contexts

Jndi property files can be dropped into the org.osjava.sj.root configured path. In our case, that path is src/main/resources/jndi. There is one context that fuzzy-row-matcher can handle: A javax.sql.DataSource

Datasources are used when reading or writing data using a JDBC interface. These concepts will be explained in detail later. You can see a sample jndi context file here. Note the location of the context file is in the directory set in org.osjava.sj.root: src/test/resources/jndi. All jndi context files must be placed under this directory.

Here is a sample DataSource entry for a sql lite database which is used by this projects unit tests (note that username and password are optional and depend on how the security of the database being targeted is configured):

SqliteChinook/type=javax.sql.DataSource  
SqliteChinook/driver=org.sqlite.JDBC  
SqliteChinook/url=jdbc:sqlite:src/test/resources/data/chinook.db  
SqliteChinook/user=  
SqliteChinook/password=

The jndi name in this case is "SqliteChinook". The context is "default_ds" because the name of the property file is "default_ds.properties".

Adding Jndi entries programatically

Use the JNDIUtils class in the deduper library to add jndi entries programatically

Kotlin code for adding a new DataSource jndi entry to the default_ds.properties jndi file:

import org.bradfordmiller.deduper.jndi.JNDIUtils  
...  
JNDIUtils.addJndiConnection(  
	    "BradTestJNDI_23",  
	    "default_ds",  
	     mapOf(  
		    "type" to "javax.sql.DataSource",  
		    "driver" to "org.sqlite.JDBC",  
		    "url" to "jdbc:sqlite:src/test/resources/data/outputData/real_estate.db",  
		    "user" to "test_user",  
		    "password" to "test_password"  
	    )  
    )

Configuring and running a deduper process

The library uses the builder design pattern to construct the configuration to run a fuzzy matching job.

There are a bunch of options which can be configured as part of a fuzzy matching process. Let's start with the basics. Use the Config class to set up the fuzzy matching job. This object will be passed into the FuzzyRowMatcher class as part of the instantiation of the FuzzyRowMatcher class.

The only required inputs to fuzzy-row-matcher are a JDBC souce in the form of a JNDI Connection and at least one algorithm and it's threshold value. The JNDI connection is set up using the SourceJndi class. Here is some Kotlin code which instantiates a SourceJndi object.

import org.bradfordmiller.fuzzyrowmatcher.config.config.SourceJndi
...  
val csvSourceJndi = SourceJndi("RealEstateIn", "default_ds", "Sacramentorealestatetransactions")

In the above case "RealEstateIn" is the jndi name, "default_ds" is the context name (and correlates to "default_ds.properties"), and "Sacramentorealestatetransactions" is the table to be queried.

By default, a "SELECT *" query will be issued against the table ("Sacramentorealestatetransactions" in this case). It is also possible to pass in a query, rather than a table name, like so:

import org.bradfordmiller.fuzzyrowmatcher.config.SourceJndi  
...  
val csvSourceJndi = SourceJndi("RealEstateIn", "default_ds", "SELECT street from Sacramentorealestatetransactions")

Fuzzy Row Matcher is an engine which can detect row similarity within a table or flat file, so by default it will use every value in the row to create a duplicate. The API also accepts a subset of columns in the table on which to "fuzzily match". Here is some Kotlin code which demonstrates this:

import org.bradfordmiller.fuzzyrowmatcher.config.SourceJndi  
...  
val hashColumns = mutableSetOf("street","city", "state", "zip", "price")  
val csvSourceJndi = SourceJndi("RealEstateIn", "default_ds", "Sacramentorealestatetransactions", hashColumns)

Now only the columns specified in the column set will be considered for detecting duplicates.

Complete example with Jaro-Winkler scoring algorithm

import org.bradfordmiller.fuzzyrowmatcher.config.SourceJndi
import org.bradfordmiller.fuzzyrowmatcher.FuzzyRowMatcher
import org.bradfordmiller.fuzzyrowmatcher.config.Config
...

val csvSourceJndi = SourceJndi("RealEstateIn", "default_ds", "Sacramentorealestatetransactions")

val config = 
Config.ConfigBuilder()
    .sourceJndi(csvSourceJndi)
    .applyJaroDistance(98.0)
    .build()

val frm = FuzzyRowMatcher(config)

val result = frm.fuzzyMatch()

println(result)

The output of this run is:

 FuzzyRowMatcherRpt(rowCount=987, comparisonCount=485606, matchCount=7, duplicateCount=0, algos={JaroDistance=AlgoStats(min=63.25954223397656, firstQuartile=73.01287149712347, median=74.83295979588313, thirdQuartile=76.85458437015583, max=100.0, mean=75.04290962954407, stddeviation=3.0175386794224877)})

So this run of the FuzzyMatch found a total of 987 rows, ran a total of 485606 comparisons against the table, and found a total of 7 rows which are similar. Similar in this case means that there were seven rows in the table which passed the 98.0 percent similar threshold of the JaroWinkler distance algorithm.

Also note that each algorithm returns a set of seven statistics for the run:

The minimum threshold found (in the above output, 63.25954223397656)
The first quartile of the threshold found (in the above output, 73.01287149712347)
The median threshold found (in the above output, 74.83295979588313)
The third quartile of the threshold found (in the above output, 76.85458437015583)
The maximum threshold found (in the above output, 100.0)
The mean threshold found (in the above output, 75.04290962954407)
The standard deviation of the threshold found (in the above output, 3.0175386794224877)

If more algorithms are applied, their statistics will also appear in the algos variable. Note also in the above example all columns in each row were concatenated together because a list of columns was not provided in the SourceJndi object instantiation.

List of algorithms availability for fuzzy matching:

Fuzzy Matcher has the following algorithms available:

The results of these algorithms can be combined for the qualification of similarity, or each algorithm can be applied independently. There is a setting in the config for controlling this:

val config =
    Config.ConfigBuilder()
            .sourceJndi(sourceJndi)
            .applyJaroDistance(98.0)
            .applyLevenshtein(5)
            .aggregateScoreResults(true)
            .build()

In this example, the Jaro-Winkler threshold is set to 98 percent, and the Levenshtein algorithm is set to 5. Because aggregateScoreResults is set to true, only rows which pass the threshold of 98 on Jaro and 5 on Levenshtein will be considered "similar". By default, aggregateScoreResults is set to false. In the case where aggregateScoreResults is set to false rows will be considered "similar" if the Jaro threshold of 98 OR a Levenshtein threshold of 5 is met.

Handling duplicates

The Fuzzy Match engine has the ability to ignore duplicates if you are only interested in similar but not exact matches. This is controlled by the

 .ignoreDupes(true)

method on the Config object. If this flag is set to true then the fuzzy comparisons will not be done when an exact match is detected. This can speed up processing if the table you are interrogating has lots of exact matches. The default value for this is false.

String length differences

Another optimization the engine contains is in regards to string length differences. You can tell the engine to ignore strings which differ greatly in their length. The method for this is

 .strLenDeltaPct(50.0)

on the Config object In this case, the engine will not perform similarity comparisons on strings which differ in length by more than 50 percent.

Persisting the score results

Thus far we have only seen examples of the Fuzzy Matcher aggregating information about statistics and matches, now we will see how to persist the results of the Fuzzy Match run. The API has a setting for a target JNDI location as seen here:

.targetJndi(TargetJndi("SqlLiteTest", "default_ds"))

A full example looks like this:

val config =
                Config.ConfigBuilder()
                        .sourceJndi(sourceJndi)
                        .targetJndi(TargetJndi("SqlLiteTest", "default_ds"))
                        .applyJaroDistance(98.0)
                        .applyLevenshtein(5)
                        .aggregateScoreResults(false)
                        .build()

The TargetJndi object accepts a named jndi resource, in this case "SqlLiteTest" and a context, in this case "default_ds". See the Configuring JNDI and Configuring JNDI Contexts sections on how to add JNDI entries to Fuzzy Matcher. Note that as of right, now the only supported output is to SqlLite.

The output created by Fuzzy Matcher consists of two tables and a view. The tables are created with an appended timestamp for convenience so that we can avoid table name collisions.

The templates for the documentation below can be found here

The first table is simply a "jsonified" view of the row being compared, and has a schema which is as follows:

CREATE TABLE json_data_**TIMESTAMP** (
	id	INTEGER NOT NULL,
	json_row	TEXT NOT NULL,
	PRIMARY KEY(id)
);

Again, this table has a unique id and a string which represents the row in a "jsonified" format. Here is an example row:

id: 1 json_row:

{"zip":"95838","baths":"1","city":"SACRAMENTO","sale_date":"Wed May 21 00:00:00 EDT 2008","street":"3526 HIGH ST","price":"59222","latitude":"38.631913","state":"CA","beds":"2","type":"Residential","sq__ft":"836","longitude":"-121.434879"}

The second table contains qualifying "similar" rows and their scores. The schema for the "scores" table is as follows:

CREATE TABLE scores_**TIMESTAMP** (
	id	INTEGER NOT NULL,
	json_data1_row_id	INTEGER NOT NULL,
	json_data2_row_id	INTEGER NOT NULL,
	jaro_dist_score	REAL NULL,
	levenshtein_distance_score	INTEGER NULL,
	hamming_distance_score	INTEGER NULL,
	jaccard_distance_score	REAL NULL,
	cosine_distance_score	REAL NULL,
	fuzzy_similiarity_score	INTEGER NULL,
	FOREIGN KEY(json_data1_row_id) REFERENCES json_data(id),
	FOREIGN KEY(json_data2_row_id) REFERENCES json_data(id),
	PRIMARY KEY(id,json_data1_row_id,json_data2_row_id)
);

This table contains the id of each row being compared (foreign keys link back to the json_data table described earlier), as well as each algorithm scored calculated between the two rows.

Lastly, there is a view which ties everything together for ease of viewing the results:

CREATE VIEW final_scores_**TIMESTAMP** AS
SELECT a.json_row json_row_1, c.json_row json_row_2,
b.jaro_dist_score,
	b.levenshtein_distance_score	,
	b.hamming_distance_score	,
	b.jaccard_distance_score	,
	b.cosine_distance_score	,
	b.fuzzy_similiarity_score
FROM
json_data_**TIMESTAMP** a
INNER JOIN scores_**TIMESTAMP** b ON a.id = b.json_data1_row_id
INNER JOIN json_data_**TIMESTAMP** c ON c.id = b.json_data2_row_id;

This view simply links back the actual json strings as well as the scores for a simpler viewing of the run.

Note that score values will be populated as NULL in the scores table for those algorithms which were not chosen for the particular run.

Future versions of the software will support more database types to output results to (IE, postgres, mysql, oracle, etc)

Fully configured example with all flags being toggled:

val config =
        Config.ConfigBuilder()
            .sourceJndi(sourceJndi)
            .targetJndi(TargetJndi("SqlLiteTest", "default_ds"))
            .applyJaroDistance(98.0)
            .applyLevenshtein(5)
            .applyFuzzyScore(90)
            .applyCosineDistance(30.0)
            .applyHammingDistance(15)
            .applyJaccardDistance(90.0)
            .strLenDeltaPct(50.0)
            .aggregateScoreResults(false)
            .ignoreDupes(true)
            .build()

    val frm = FuzzyRowMatcher(config)
    val result = frm.fuzzyMatch()

Built With

kotlin - The programming language
simple-jndi - source and target configuration management
csvjdbc - Provides a query API on top of csv
apache-commons-text - For string distance and similarity algorithms
apache-commons-math - For statistical calculations

Versioning

For the versions available, see the tags on this repository.

Authors

Bradford Miller - Initial work - bfm

See also the list of contributors who participated in this project.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details

Acknowledgments

Thanks to PurpleBooth for the README template as seen here

bmiller1009 / fuzzy-row-matcher 1.0.10