是否可以与python pandas进行模糊匹配合并?

我有两个要基于列合并的DataFrame。但是,由于其他拼写方式,空格数量不同,不存在变音符,只要它们彼此相似,我希望能够合并。

任何相似性算法都可以使用(soundex,Levenshtein,difflib)。

假设一个DataFrame具有以下数据:

df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])

number

one 1

two 2

three 3

four 4

five 5

df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

letter

one a

too b

three c

fours d

five e

然后我想得到结果DataFrame

       number letter

one 1 a

two 2 b

three 3 c

four 4 d

five 5 e

回答:

类似@locojay建议,你可以申请difflibget_close_matches到df2的指标,然后应用join

In [23]: import difflib 

In [24]: difflib.get_close_matches

Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2

Out[26]:

letter

one a

two b

three c

four d

five e

In [31]: df1.join(df2)

Out[31]:

number letter

one 1 a

two 2 b

three 3 c

four 4 d

five 5 e

如果这些是列,则可以按照相同的方式应用于该列,然后merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])

df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])

df1.merge(df2)

以上是 是否可以与python pandas进行模糊匹配合并? 的全部内容, 来源链接: utcz.com/qa/434822.html

回到顶部