Python Fuzzywuzzy错误字符串或缓冲区期望 [英] Python fuzzywuzzy error string or buffer expect

查看:142
本文介绍了Python Fuzzywuzzy错误字符串或缓冲区期望的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Fuzzywuzzy在公司名称的csv中查找附近的匹配项.我正在将手动匹配的字符串与未匹配的字符串进行比较,以期找到一些有用的接近度匹配,但是,我在Fuzzywuzzy中遇到了字符串或缓冲区错误.我的代码是:

I'm using fuzzywuzzy to find near matches in a csv of company names. I'm comparing manually matched strings with the unmatched strings in the hope of finding some useful proximity matches, however, I'm getting a string or buffer error within fuzzywuzzy. My code is:

from fuzzywuzzy import process
from pandas import read_csv

if __name__ == '__main__':
    df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
    df_false = df[df['match_manual'].isnull()]  
    df_true = df[df['match_manual'].notnull()]
    sss_false = df_false['sss'].values.tolist()
    sss_true = df_true['sss'].values.tolist()


    for sssf in sss_false:
        mmm = process.extractOne(sssf, sss_true) # find best choice
        print sssf + str(tuple(mmm))

这会产生以下错误:

Traceback (most recent call last):
File "fuzzywuzzy_usm2_csv_test.py", line 21, in <module>
mmm = process.extractOne(sssf, sss_true) # find best choice
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 123, in extractOne
best_list = extract(query, choices, processor, scorer, limit=1)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 84, in extract
processed = processor(choice)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 63, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer

这与导入具有指定编码的熊猫的效果有关,我添加该编码是为了防止UnicodeDecodeErrors,但具有引起此错误的连锁反应.我尝试使用str(sssf)强制对象,但这不起作用.

This is something to do with the effects of importing into pandas with encoding specified, which I added to prevent UnicodeDecodeErrors but had the knock on effect of causing this error. I've tried to force the object using str(sssf) but that doesn't work.

因此,我在这里隔离了导致错误的行:#N/A,,,,,,(下面粘贴的代码中的第29行).我以为是引起错误的#,但奇怪的是不是,引起问题的是A char,因为删除该文件后该文件有效.对我来说奇怪的是,下面两行的字符串是N/A,它可以很好地解析,但是,即使字段看起来与下面的字段相同,当我删除#符号时,行29也无法解析. >

So, I've isolated a line that is causing the error, here: #N/A,,,,,, (line 29 in code pasted below). I assumed it was the # that was causing the error, but strangely its not, its the A char that is causing the problem, because the file works when it is removed. What is strange to me is that the string two rows below is N/A which parses fine, however, row 29 won't parse when I delete the # symbol, even though the field appears identical to the field below.

sss,sid,match_manual,notes,match_date,source,match_by
N20 KIDS,1095543_cha,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 FESTIVAL,08190588_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 LTD,,,,,,
N21 LTD.,04615294_com,true,,2014-12-02,,OpenCorps
N2 CHECK,08105000_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 CHECK LIMITED,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LIMITED,08184223_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 3)
N 2 CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 CHECK LTD,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2E & BACK LTD,05218805_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 GROUP LLC,04627044_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 GROUP LTD,04475764_com,true,,2014-05-05,data taken from u_supplier_match,20140429_fuzzy_match.ktr (stream 2)
N2R PRODUCTIONS,SC266951_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 VISUAL COMMUNICATIONS LIMITED,,,,,,
N2 VISUAL COMMUNICATIONS LTD,03144224_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N2WEB,07636689_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N3 DISPLAY GRAPHICS LTD,04008480_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N3O LIMITED,06561158_com,true,,2014-12-02,,OpenCorps
N3O LTD,,,,,,
N400138,,,,,,
N400360,,,,,,
N4K LTD,07054740_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N51 LTD,,,,,,
N68 LTD,,,,,,
N8 LTD,,,,,,
N9 DESIGN,07342091_com,true,,2015-02-07,openrefine/opencorporates,IM
#N/A,,,,,,
N A,,,,,,
N/A,red_general_xtr,true,Matches done manually,2015-04-16,manual matching,IM
(N) A & A BUILDERS LTD,,,,,,

推荐答案

默认情况下,

By default, pandas.read_csv parses the string 'N/A' as Not a Number (NaN)

在您的情况下,这意味着您以nan值而不是字符串结尾.在您的样本数据集中,这发生在两个地方

In your case, that means that you end up with a nan value rather than a string. In your sample data set, this happens in two places

从底部开始的第三行(您在问题中突出显示的行)显示为sss_false[-3] == nan

The third line from the bottom (the line you highlight in the question) results in sss_false[-3] == nan

最后一行是sss_true[-1] == nan.

如果要将字符串'N/A'解析为字符串而不是nan,则此方法是替换

If you want to parse the string 'N/A' as a string instead of nan, the way to do this is to replace

df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")

使用

df = read_csv("usm_clean.csv", encoding = "ISO-8859-1", keep_default_na=False, na_values='')

na_values :类似列表或字典,默认为无

要识别为NA/NaN的其他字符串.如果dict通过,则特定的每列NA值

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values

keep_default_na :布尔型,默认为True

如果指定了na_values且keep_default_na为False,则默认的NaN值将被覆盖,否则会将其附加到

If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to

因此,以上修改告诉熊猫将空字符串识别为NA并丢弃默认值'N/A'

So, the above modification tells pandas to recognize the empty string as NA and discard the default value 'N/A'

如果要在第一列中丢弃带有'N/A'的行,则需要从sss_truesss_false中删除nan成员.一种方法是:

If you want to discard lines with 'N/A' in the first column you need to remove the nan members from sss_true and sss_false. one way to do this is:

sss_true = [x for x in sss_true if type(x) != str]
sss_false = [x for x in sss_false if type(x) != str]

这篇关于Python Fuzzywuzzy错误字符串或缓冲区期望的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆