Python pandas 十进制马克欧盟到美国 [英] Python Pandas Decimal Mark EU to US

查看:52
本文介绍了Python pandas 十进制马克欧盟到美国的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用红色邮寄了有关欧盟到美国小数点转换的邮件,这些邮件很有帮助,但是我仍然感觉需要专家的帮助..我的数据来自ERP系统,其数字格式为"1 '000'000,32",我只想将其转换为"1000000.32",以便在熊猫中进行进一步处理.

我从欧盟开始获取美国格式的实际解决方案如下:

... 
 # read_csv and merge, clean .. different CSV files
 # result = merge (some_DataFrame_EU_format, ...)
...
result.to_csv(path, sep';')
result = read_csv(path, sep';', converters={'column_name': lambda x: float(x.replace   ('.','').replace(',','.'))})
....
result.to_csv(path, sep';')

我觉得这是用'.'更改','的缓慢方法.由于存在read_csv和to_csv(以及磁盘..),因此愿意直接在DataFrame上尝试.replace方法以节省一些处理时间.

我最初的想法是如下所示(我在论坛上的其他地方都发了红..):

result['column_name'] = result['column_name'].replace( '.', '')
result['column_name'] = result['column_name'].replace( ',', '.')
result['column_name'] =  result['column_name'].astype(float)

哪个没有用,并导致无效的浮点文字"错误.

我很感动到:

for i in range (0, len(result)):
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '')
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( ',', '.')
result['column_name'] =  result['column_name'].astype(float)

上面的方法可以正常工作..但令人惊讶的是,它似乎比read_csv/converters解决方案慢了3倍.以下内容的使用在某种程度上有所帮助:

    for i in range (0, len(result)):
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '').replace( ',', '.')
    result['column_name'] =  result['column_name'].astype(float)

我用红色了精美的手册..,并且知道read_csv是经过优化的..,但我并没有真正想到红色/写入/读取/写入cicle的速度是for循环的三倍!

您认为值得为此做更多工作吗?有什么建议吗?还是保留重复的写入/读取/写入方法更好?

我的文件大约有30k行x 150列,读/写/读(转换)/写大约需要18秒,.ix的第一种循环时间超过52秒(而32种带有分组的.replace). ).

您将DataFrame从EU转换为US格式有什么经验?建议的一些改进方法?那映射"或语言环境"呢?他们会更快吗?

非常感谢你,法比奥.

P.S.我意识到我是冗长的"而没有足够的"pythonic" ..对不起,对不起..我还在学习...:-)

解决方案

实际上read_csv中有一个千位和十进制参数 (请参见 github问题) >

I red the mails about EU to US decimal mark conversion, these helped a lot, but I'm still feeling to need some help from the experts .. My datas are from an ERP system with numbers in the format like "1'000'000,32" and I'd like simply to convert into something like "1000000.32" for further processing in Pandas.

My actual solution to obtain the US format starting from the EU looks like:

... 
 # read_csv and merge, clean .. different CSV files
 # result = merge (some_DataFrame_EU_format, ...)
...
result.to_csv(path, sep';')
result = read_csv(path, sep';', converters={'column_name': lambda x: float(x.replace   ('.','').replace(',','.'))})
....
result.to_csv(path, sep';')

I had the feeling this to be a slow method to change the ',' with '.' because of the read_csv and to_csv (and the disk ..) so was willing to try the .replace method directly on the DataFrame to save some processing time.

My initial tentative had been something like the below (that I red elsewhere here on the forum ..):

result['column_name'] = result['column_name'].replace( '.', '')
result['column_name'] = result['column_name'].replace( ',', '.')
result['column_name'] =  result['column_name'].astype(float)

Which did not worked and resulted in an 'invalid literal for float' error'.

I so moved to:

for i in range (0, len(result)):
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '')
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( ',', '.')
result['column_name'] =  result['column_name'].astype(float)

The above worked .. but with some surprise it appeared about 3 times slower than the read_csv/converters solution. The use of the below helped in some way:

    for i in range (0, len(result)):
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '').replace( ',', '.')
    result['column_name'] =  result['column_name'].astype(float)

I red the fine manuals .. and know that read_csv is optimized .. but did not really expected a red / write /read/ write cicle to be three times faster than a for loop !!

Do you think it might be worth the case to work more on this? Any suggestion ? Or is it better to stay with repeated write/read/write approach?

My file is aboout 30k lines x 150 columns, the read/write/read(convert)/write takes about 18 seconds, the .ix for is above 52 sec with the first kind of loop (and 32 with grouped .replace).

What is your experience in converting DataFrames from EU to US format ? Some suggested method to improve? What about 'mapping' or 'locale' ? Might they be faster?

Thank you so much, Fabio.

P.S. I realize I was 'verbose' and not enought 'pythonic' .. sorry sorry .. I'm still learning ...:-)

解决方案

In fact there is a thousands and decimal parameter in read_csv (see pandas documentation read_csv but unfortunately the both don't yet work together (see issue:github issue )

这篇关于Python pandas 十进制马克欧盟到美国的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆