Realtime Fuzzy-Join in Pathway

Part 1: Fuzzy joins: 'errare humanum est'

As the ancient maxim says, 'errare humanum est': to err is human. More than two thousands years later, this lesson is still very accurate in our modern world. Everyone makes mistakes and writing does not escape this fate: the longer the text the more mistakes there will be. However, most mistakes we usually make are small and do not hinder understanding.

Unfortunately, computers, just like accountants, don't like mistakes. Computers cannot cope with mistakes. No matter how small the mistake, the computer will just reject the whole answer and throw an error. You have written your 10-digit password but finished with a lower case 'a' instead of a capital 'A'? The passwords obviously do not match, and you shall enter your password again!

While this zero tolerance policy may make sense for security processes, it can be terrible when users have to enter long texts. For example, accountants may have to enter long logs of transactions by hand, creating many opportunities for mistakes. If those logs have to be compared to other logs (e.g. a log automatically generated by a pay station) then mismatches would appear: 'mr' instead of 'Mr'. Mistakes can also come from the way the data has been collected: using nicknames instead of full names, different email addresses etc. While humans could be able to match those logs despite the mistakes, computers cannot.

Does it mean the computer is helpless in those cases, shifting all the tedious work of matching similar but different entries to human? Fortunately not, several mechanisms exist to assist or even perform the matching, and fuzzy join is one of them: a fuzzy join is process which automatically matches entries from different logs despite not having a perfect matching between their keys.

Fuzzy join in Pathway

Fuzzy join is used to perform a join on datasets when the keys do not match exactly. Simple use cases include matching lower case strings with camelCase strings or matching floats with some precision threshold.

Pathway standard library comes with a powerful smart_fuzzy_join functionality. This tutorial is a showcase of its capabilities. We will develop a Data Application which allows for fuzzy-joining two streams of data against each other, and also for maintaining audit entries and updating results on the fly - here is a sneak preview.

Demo animation

The data

We will be doing the fuzzy-join between two datasets on money transfers’ banking logs. When doing banking or bookkeeping, this operation would be known as reconciliation of two sets of transactions records. One dataset comes in a perfectly organized format - csv, the other dataset consists of 'human written' lines describing the transactions.

Here are samples from the datasets:

Data sourced automatically from a bank feed, in 'standard' CSV format

iddateamountrecipientsenderrecipient_acc_nosender_acc_no
02020-06-048946M. PerezJessie RobertsHU30186000000000000008280573ES2314520000000006226902
12014-08-068529C. BarnardMario MillerES8300590000000002968016PL59879710390000000009681693
22017-01-225048S. CardJames PalettaPL65889200090000000009197250PL46193013890000000009427616
32020-09-157541C. BaxterHector HaleyPL40881800090000000005784046DE84733500000003419377
42019-05-253580L. ProuseRonald AdamsPL44124061590000000008986827SI54028570008259759

The first dataset is sourced automatically from a bank feed. Every few seconds a new batch of transactions is saved to transactions/formatA/batch_timestamp.csv.

Transaction logs entered by hand

iddescription
0Received 8521 € on 2014-08-07 by INTERNATIONAL interest payment from ??? to C. Barnard, recipient acc. no. 000002968016 by BANCO DE MADRID, amount EUR €, flat fee 8 €
1EUR 8944 on 2020-06-06 by INTERNATIONAL transfer credited to 00000000008280573 (M. Perez) by BNP Paribas Securities Services, fee EUR 2, amount EUR 8946.
2Finally got 5M quid on 2017-01-23 by DOMESTIC payment from Sergio Marquina to Bella Ciao, r. acc. 0000000009197250, oryg. amount 5_000_048, fees 5 quid.
33578 EUR am 2019-05-25 von INTERNATIONAL dividend payment by Pathway Inc. an L. Prouse, Empfängerkonto 8986827, Betrag 3580 EUR
4Received 7540 EUR on 2020-09-15. Invoice, recipient C. Baxter, 0000000005784046, amount EUR 7541, fees EUR 1

As you can see, it seems that each entry in the first dataset (data sourced automatically) has a corresponding entry in the other dataset (transaction logs entered by hand). In this example we will use the smart_fuzzy_join function from Pathway's standard library to make sure all is correctly matched.

What are we going to obtain?

We want to obtain a table in which the matchings are expressed, e.g. the entry 0 for the first table corresponds to the entry 1 in the second table. In addition, we will include the confidence, a number expressing how confident we are in the matching.

Code

First things first - imports.

import pandas as pdimport pathway as pw

And now, here come a few lines of code that read two datasets, try to match rows, and report matchings by writing to a csv file.

The data is read from csv files. For the purpose of this demonstration we will simply print a table with matchings found on the data sample presented above. But the code below works also in a production environment. In production:

  • All csv files will be dynamically ingested from these directories in their order of appearance.
  • The output will be updated immediately as new data appears at input.

First, we need our csv files:

%%capture --no-display!wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/fuzzy_join_part_1_transactionsA.csv!wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/fuzzy_join_part_1_transactionsB.csv
def match_transactions(transactionsA, transactionsB):    matching = pw.ml.smart_table_ops.fuzzy_match_tables(transactionsA, transactionsB)    transactionsA_reconciled = (        pw.Table.empty(left=pw.Pointer, right=pw.Pointer, confidence=float)        .update_rows(transactionsA.select(left=None, right=None, confidence=0.0))        .update_rows(            matching.select(                matching.left, matching.right, confidence=matching.weight            ).with_id(matching.left)        )    )    return transactionsA_reconcileddef reconcile_transactions(    formatA="fuzzy_join_part_1_transactionsA.csv",    formatB="fuzzy_join_part_1_transactionsB.csv",):    transactionsA = pw.csv.read(formatA)    transactionsB = pw.csv.read(formatB)    transactionsA_reconciled = match_transactions(transactionsA, transactionsB)    return transactionsA_reconciledpw.debug.compute_and_print(reconcile_transactions())
   | left | right | confidence
^0 | ^0   | ^1    | 1.0
^1 | ^1   | ^0    | 0.8333333333333333
^2 | ^2   | ^2    | 1.3333333333333333
^3 | ^3   | ^4    | 1.3333333333333333
^4 | ^4   | ^3    | 1.5

Success, all matchings were found!

Super easy, few lines of code and you flawlessly manage datasets in different formats. Hassle-free.

Scaling with Pathway

smart_fuzzy_join is able to handle much bigger datasets. Feel free to test it on your own data or use the full datasets from this tutorial, available in this Google Spreadsheet.

In the tutorial we just printed a matching found on a small data sample. In a dynamic production environment:

  • All csv files will be dynamically ingested from these directories in order of appearance.
  • The output will be updated immediately as new data appears at input.

Conclusion and follow-up tasks

While errors are human and we are unlikely to stop making some, we can free ourselves of the pain of correcting them each time something goes wrong. Sometimes, entries are harder to match and may require help: in that case you can check out our extension and see how we extend our pipeline with an auditor that supervises the process of reconciliation. From now on, you have no excuses for having mismatching logs: 'errare humanum est, perseverare diabolicum'!

If you would like to get some more experience with Pathway, you can try those two challenges:

Challenge 1

Extend the match_transactions function so that, after finding a matching, it extends the first input table (standard csv format) with columns 'fees' and 'currency'.

Challenge 2

Try to augment the datasets so that they are still reasonable but smart_fuzzy_join fails to find all matchings 😉