Summary – SUPERFAST FUZZY JOIN IN PYTHON USING PANDAS.

Sometimes we have to join two different data set about same entities but variations in entity name, this makes direct join process incapable of handling the joins and results in mismatch. Here fuzzy join is handy as it uses soundex algorithm  to convert names to the an index based on how it is pronounced.

Pandas is a library in python used by a lot of data scientists, it handles a lot of heavy load in data manipulation.

Jupyter notebook is a python library that provides us an interface where we can code in bits and pieces and see results, and yes it is a bliss for data scientists.

It is recommended to use Anaconda distribution of python, in this tutorial python 3.6 has been used.

If you want to see how to install python (Anaconda Distribution) and run Jupyter notebook see this tutorial .

Download Code for this tutorial.

Two Steps:

  1. Reading data and convert names to Soundex Index.
  2. Merge data and write output.

The Problem

DATASCIENCE-1-2018.11.28-SUPERFAST-FUZZY-JOIN-IN-PYTHON-USING-PANDAS-problem

DATASCIENCE-1-2018.11.28-SUPERFAST-FUZZY-JOIN-IN-PYTHON-USING-PANDAS-problem

Problem: Sometimes we have to join two different data set about same entities but variations in entity name, this makes direct join process incapable of handling the joins and results in mismatch..

Procedure

Prepare data and convert names to Soundex index

DATASCIENCE-1-2018.11.28-SUPERFAST-FUZZY-JOIN-IN-PYTHON-USING-PANDAS-1

Prepare data and convert names to Soundex index

Import libraries:
#Import Libraries
import panda as pd
import jellyfish
Read Datasets:
# Read first dataset
df1 = pd.read_csv('data1.csv')

# Read second dataset
df2 = pd.read_csv('data2.csv')

Create Soundex Index to join data:
# Create join column in first dataset
df1['soundex'] = df1['data1'].apply(lambda x: jellyfish.soundex(x))
# Create join column in second dataset
df2['soundex'] = df2['data3'].apply(lambda x: jellyfish.soundex(x))

Merge data and write output.

DATASCIENCE-1-2018.11.28-SUPERFAST-FUZZY-JOIN-IN-PYTHON-USING-PANDAS-2

Merge data and write output.

Merge data:
# Merge both datasets
df = pd.merge(df1, df2, on=['soundex'])
Write output data:
# Export file to csv
df.to_csv('joined_data.csv', encoding='utf-8', index=False)

Tadda done!!

DATASCIENCE-1-2018.11.28-SUPERFAST-FUZZY-JOIN-IN-PYTHON-USING-PANDAS-feature

Tadda done

if you like the tutorial please spread the word and comment below if you have any questions.

Download Code for this tutorial.