DATASCIENCE-1-2018.11.28-SUPERFAST FUZZY JOIN IN PYTHON USING PANDAS

Summary – SUPERFAST FUZZY JOIN IN PYTHON USING PANDAS.

Sometimes we have to join two different data set about same entities but variations in entity name, this makes direct join process incapable of handling the joins and results in mismatch. Here fuzzy join is handy as it uses soundex algorithm to convert names to the an index based on how it is pronounced.

Pandas is a library in python used by a lot of data scientists, it handles a lot of heavy load in data manipulation.

Jupyter notebook is a python library that provides us an interface where we can code in bits and pieces and see results, and yes it is a bliss for data scientists.

It is recommended to use Anaconda distribution of python, in this tutorial python 3.6 has been used.

If you want to see how to install python (Anaconda Distribution) and run Jupyter notebook see this tutorial .

Download Code for this tutorial.

Two Steps:

Reading data and convert names to Soundex Index.
Merge data and write output.

The Problem

DATASCIENCE-1-2018.11.28-SUPERFAST-FUZZY-JOIN-IN-PYTHON-USING-PANDAS-problem

Problem: Sometimes we have to join two different data set about same entities but variations in entity name, this makes direct join process incapable of handling the joins and results in mismatch..

Procedure

Prepare data and convert names to Soundex index

DATASCIENCE-1-2018.11.28-SUPERFAST-FUZZY-JOIN-IN-PYTHON-USING-PANDAS-1

Prepare data and convert names to Soundex index

Import libraries:

#Import Libraries
import panda as pd
import jellyfish

Read Datasets:

# Read first dataset
df1 = pd.read_csv('data1.csv')

# Read second dataset
df2 = pd.read_csv('data2.csv')

Create Soundex Index to join data:

# Create join column in first dataset
df1['soundex'] = df1['data1'].apply(lambda x: jellyfish.soundex(x))

# Create join column in second dataset
df2['soundex'] = df2['data3'].apply(lambda x: jellyfish.soundex(x))

Merge data and write output.

DATASCIENCE-1-2018.11.28-SUPERFAST-FUZZY-JOIN-IN-PYTHON-USING-PANDAS-2

Merge data and write output.

Merge data:

# Merge both datasets
df = pd.merge(df1, df2, on=['soundex'])

Write output data:

# Export file to csv
df.to_csv('joined_data.csv', encoding='utf-8', index=False)

Tadda done!!

DATASCIENCE-1-2018.11.28-SUPERFAST-FUZZY-JOIN-IN-PYTHON-USING-PANDAS-feature

Tadda done

if you like the tutorial please spread the word and comment below if you have any questions.

DATASCIENCE-1-2018.11.28-SUPERFAST FUZZY JOIN IN PYTHON USING PANDAS

Published by Sandy on November 28, 2018November 28, 2018

Summary – SUPERFAST FUZZY JOIN IN PYTHON USING PANDAS.

Download Code for this tutorial.

The Problem

Procedure

Prepare data and convert names to Soundex index

Import libraries:

Read Datasets:

Create Soundex Index to join data:

Merge data:

Write output data:

Tadda done!!

Download Code for this tutorial.

ARCGIS

GIS0016 2019.09.04 Landuse Pixelation to Identify Major Landuse by Area in each Grid

City Maps

MAP-4 Kanpur City Map 28-04-2019

City Maps

MAP-3 Kishangarh City Map 28-04-2019

July 2024
S	M	T	W	T	F	S
« Jun
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

DATASCIENCE-1-2018.11.28-SUPERFAST FUZZY JOIN IN PYTHON USING PANDAS

Published by Sandy on November 28, 2018November 28, 2018

Summary – SUPERFAST FUZZY JOIN IN PYTHON USING PANDAS.

The Problem

Procedure

Prepare data and convert names to Soundex index

Import libraries:

Read Datasets:

Create Soundex Index to join data:

Merge data:

Write output data:

Tadda done!!

Related Posts