As the ancient maxim says, 'errare humanum est': to err is human. More than two thousands years later, this lesson is still very accurate in our modern world. Everyone makes mistakes and writing does not escape this fate: the longer the text the more mistakes there will be. However, most mistakes we usually make are small and do not hinder understanding.
Unfortunately, computers, just like accountants, don't like mistakes. Computers cannot cope with mistakes. No matter how small the mistake, the computer will just reject the whole answer and throw an error. You have written your 10-digit password but finished with a lower case 'a' instead of a capital 'A'? The passwords obviously do not match, and you shall enter your password again!
While this zero tolerance policy may make sense for security processes, it can be terrible when users have to enter long texts. For example, accountants may have to enter long logs of transactions by hand, creating many opportunities for mistakes. If those logs have to be compared to other logs (e.g. a log automatically generated by a pay station) then mismatches would appear: 'mr' instead of 'Mr'. Mistakes can also come from the way the data has been collected: using nicknames instead of full names, different email addresses etc. While humans could be able to match those logs despite the mistakes, computers cannot.
Does it mean the computer is helpless in those cases, shifting all the tedious work of matching similar but different entries to human? Fortunately not, several mechanisms exist to assist or even perform the matching, and fuzzy join is one of them: a fuzzy join is process which automatically matches entries from different logs despite not having a perfect matching between their keys.
Fuzzy join is used to perform a join on datasets when the keys do not match exactly. Simple use cases include matching lower case strings with camelCase strings or matching floats with some precision threshold.
Pathway standard library comes with a powerful
This tutorial is a showcase of its capabilities. We will develop a Data Application which allows for fuzzy-joining
two streams of data against each other, and also for maintaining audit entries and updating results on the fly - here is a sneak preview.
We will be doing the fuzzy-join between two datasets on money transfers’ banking logs. When doing banking or bookkeeping, this operation would be known as reconciliation of two sets of transactions records. One dataset comes in a perfectly organized format - csv, the other dataset consists of 'human written' lines describing the transactions.
Here are samples from the datasets:
Data sourced automatically from a bank feed, in 'standard' CSV format
|0||2020-06-04||8946||M. Perez||Jessie Roberts||HU30186000000000000008280573||ES2314520000000006226902|
|1||2014-08-06||8529||C. Barnard||Mario Miller||ES8300590000000002968016||PL59879710390000000009681693|
|2||2017-01-22||5048||S. Card||James Paletta||PL65889200090000000009197250||PL46193013890000000009427616|
|3||2020-09-15||7541||C. Baxter||Hector Haley||PL40881800090000000005784046||DE84733500000003419377|
|4||2019-05-25||3580||L. Prouse||Ronald Adams||PL44124061590000000008986827||SI54028570008259759|
The first dataset is sourced automatically from a bank feed. Every few seconds a new batch of transactions is saved to
Transaction logs entered by hand
|0||Received 8521 € on 2014-08-07 by INTERNATIONAL interest payment from ??? to C. Barnard, recipient acc. no. 000002968016 by BANCO DE MADRID, amount EUR €, flat fee 8 €|
|1||EUR 8944 on 2020-06-06 by INTERNATIONAL transfer credited to 00000000008280573 (M. Perez) by BNP Paribas Securities Services, fee EUR 2, amount EUR 8946.|
|2||Finally got 5M quid on 2017-01-23 by DOMESTIC payment from Sergio Marquina to Bella Ciao, r. acc. 0000000009197250, oryg. amount 5_000_048, fees 5 quid.|
|3||3578 EUR am 2019-05-25 von INTERNATIONAL dividend payment by Pathway Inc. an L. Prouse, Empfängerkonto 8986827, Betrag 3580 EUR|
|4||Received 7540 EUR on 2020-09-15. Invoice, recipient C. Baxter, 0000000005784046, amount EUR 7541, fees EUR 1|
As you can see, it seems that each entry in the first dataset (data sourced automatically) has a corresponding entry in the other dataset (transaction logs entered by hand).
In this example we will use the
smart_fuzzy_join function from Pathway's standard library to make sure all is correctly matched.
We want to obtain a table in which the matchings are expressed, e.g. the entry 0 for the first table corresponds to the entry 1 in the second table. In addition, we will include the confidence, a number expressing how confident we are in the matching.
First things first - imports.
import pandas as pdimport pathway as pw
And now, here come a few lines of code that read two datasets, try to match rows, and report matchings by writing to a csv file.
The data is read from csv files. For the purpose of this demonstration we will simply print a table with matchings found on the data sample presented above. But the code below works also in a production environment. In production:
- All csv files will be dynamically ingested from these directories in their order of appearance.
- The output will be updated immediately as new data appears at input.
# Uncomment to download the required files.# %%capture --no-display# !wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/fuzzy_join_part_1_transactionsA.csv -O transactionsA.csv# !wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/fuzzy_join_part_1_transactionsB.csv -O transactionsB.csv
We use our csv connectors to read the csv files:
transactionsA = pw.csv.read("./transactionsA.csv", ["","date","amount","recipient","sender","recipient_acc_no","sender_acc_no"], id_columns=["recipient_acc_no"])transactionsA = transactionsA.select(*pw.this.without(pw.this[""]))transactionsB = pw.csv.read("./transactionsB.csv", ["","description"], id_columns=["description"])transactionsB = transactionsB.select(*pw.this.without(pw.this[""]))pw.debug.compute_and_print(transactionsA)pw.debug.compute_and_print(transactionsB)
| recipient_acc_no | date | amount | recipient | sender | sender_acc_no ^F0T7SKD... | ES8300590000000002968016 | 2014-08-06 | 8529 | C. Barnard | Mario Miller | PL59879710390000000009681693 ^XWS98WZ... | HU30186000000000000008280573 | 2020-06-04 | 8946 | M. Perez | Jessie Roberts | ES2314520000000006226902 ^ZE12NM1... | PL40881800090000000005784046 | 2020-09-15 | 7541 | C. Baxter | Hector Haley | DE84733500000003419377 ^1KZWWMG... | PL44124061590000000008986827 | 2019-05-25 | 3580 | L. Prouse | Ronald Adams | SI54028570008259759 ^EVAJ2VF... | PL65889200090000000009197250 | 2017-01-22 | 5048 | S. Card | James Paletta | PL46193013890000000009427616 | description ^BMFPASY... | Received EUR 3578 on 2019-05-25 by INTERNATIONAL dividend from R. Adams to L. Prouse, recipient account 0000000008986827 by None, amount EUR 3580, fees EUR 2 ^EB801YB... | Received EUR 5043 on 2017-01-23 by DOMESTIC payment from J. Paletta to S. Card, recipient account 0000000009197250 by None, amount EUR 5048, fees EUR 5 ^V3N1B3G... | Received EUR 7540 on 2020-09-15 by INTERNATIONAL invoice from H. Haley to C. Baxter, recipient account 0000000005784046 by None, amount EUR 7541, fees EUR 1 ^JWF56Q4... | Received EUR 8521 on 2014-08-07 by INTERNATIONAL interest from M. Miller to C. Barnard, recipient account 000002968016 by BANCO DE MADRID, amount EUR 8529, fees EUR 8 ^KPS156S... | Received EUR 8944 on 2020-06-06 by INTERNATIONAL transaction from J. Roberts to M. Perez, recipient account 00000000008280573 by BNP Paribas Securities Services, amount EUR 8946, fees EUR 2
Then we use our fuzzy join functions to do the reconciliation between the two tables.
def match_transactions(transactionsA, transactionsB): matching = pw.ml.smart_table_ops.fuzzy_match_tables(transactionsA, transactionsB) transactionsA_reconciled = ( pw.Table.empty(left=pw.Pointer, right=pw.Pointer, confidence=float) .update_rows(transactionsA.select(left=None, right=None, confidence=0.0)) .update_rows( matching.select( pw.this.left, pw.this.right, confidence=pw.this.weight ).with_id(pw.this.left) ) ) return transactionsA_reconciledpw.debug.compute_and_print(match_transactions(transactionsA, transactionsB))
| left | right | confidence ^XWS98WZ... | ^XWS98WZ... | ^KPS156S... | 1.0 ^EVAJ2VF... | ^EVAJ2VF... | ^EB801YB... | 1.0 ^F0T7SKD... | ^F0T7SKD... | ^JWF56Q4... | 0.8333333333333333 ^ZE12NM1... | ^ZE12NM1... | ^V3N1B3G... | 1.3333333333333333 ^1KZWWMG... | ^1KZWWMG... | ^BMFPASY... | 1.5
Success, all matchings were found!
Super easy, few lines of code and you flawlessly manage datasets in different formats. Hassle-free.
smart_fuzzy_join is able to handle much bigger datasets.
Feel free to test it on your own data or use the full datasets from this tutorial,
available in this Google Spreadsheet.
In the tutorial we just printed a matching found on a small data sample. In a dynamic production environment:
- All csv files will be dynamically ingested from these directories in order of appearance.
- The output will be updated immediately as new data appears at input.
While errors are human and we are unlikely to stop making some, we can free ourselves of the pain of correcting them each time something goes wrong. Sometimes, entries are harder to match and may require help: in that case you can check out our extension and see how we extend our pipeline with an auditor that supervises the process of reconciliation. From now on, you have no excuses for having mismatching logs: 'errare humanum est, perseverare diabolicum'!
If you would like to get some more experience with Pathway, you can try those two challenges:
match_transactions function so that, after finding a matching, it extends the first input table (standard csv format) with columns 'fees' and 'currency'.
Try to augment the datasets so that they are still reasonable but
smart_fuzzy_join fails to find all matchings 😉