Pandas脚本将数字修改为长浮点数,当它不应该修改该列/元素 [英] Pandas script modifying numbers to long float numbers when it shouldn't even be modifying that column/element
问题描述
我有一个下面的熊猫脚本让我头疼,因为它不断修改我的数据,当它不应该,下面的例子可以重新创建100%完美的问题。 (让我永远找出导致这个问题的原因)
基本上,如果你将原始文件与修改的 testing2.csv
,您将看到如下数字:
0.357
从第一行变成: 0.35700000000000004
第2行数字 0.1128
不会改变...
应该 strong>正在修改这些数字,它们都应该是原样。
testing.py
import re
import pandas
#文本文件中的每个块都将是此列表的一个元素
matchers = [[]]
i = 0
with open('testing.txt')as infile:
for line in infile:
line = line.strip()
#块由空行分隔
如果len(line)== 0:
i + = 1
matchers.append([])
#假设在项目之间总是有两个空行
# lext行
infile.next()
continue
matchers [i] .append(line)
#这个正则表达式匹配变量号的学生在每个块
studentlike = re.compile('(\d +)(。+)(\d + / \d +)')
#这些是我们期望的字段的名称在每个块的结尾
datanames = ['Data','misc2','bla3']
#我们将构建一个包含每个学生的元素列表的表
table = ]
在匹配器中的匹配器:
#我们使用一个迭代器在块行上使索引更简单
it = iter(matcher)
#前两个元素是匹配值
m1,m2 = it.next(),it.next()
#然后有一些学生
学生= []
为可能在其中:
m = studentlike.match(possiblestudent)
如果m:
students.append(list(m.groups()))
else:
break
#学生来的数据元素,我们读入字典
#我们还添加在最后一个可能的学生行,因为不匹配学生re
dataitems = dict(item.split()for item最后,我们为学生中的学生构建表
:
#我们使用字典.get()方法为缺少的字段返回空格
tablebuffer([m1,m2] + student + [dataitems.get(d,'')for d in datanames])
textcols = ['MATCH2','MATCH1' ,'TITLE01','MATCH3','TITLE02','Data','misc2','bla3']
csvdata = pandas.read_csv('testing.csv')
textdata = pandas.DataFrame (table,columns = textcols)
#添加任何新列
newCols = textdata.columns - csvdata.columns
for c in newCols:
csvdata [c] = None
mergecols = ['MATCH2','MATCH1','MATCH3']
csvdata.set_index(mergecols,inplace = True,drop = False)
textdata.set_index (mergecols,inplace = True,drop = False)
csvdata.update(textdata)
csvdata.to_csv('testing2.csv',index = False)
testing.csv
- http://pastebin.com/raw.php?i=HxVE0nA0 (由于文件大小而上传)
testing.txt
A)
DMATCH1
3 Tommy 144512/23332
1 Jim 90000/222311
1 Elz M 90000/222311
1 Ben 90000/222311
数据$ 50.90
misc2 $ 10.40
bla3 $ 20.20
MData(B / B)
DMATCH2
4 James Smith 2333/114441
4 Mike 90000/222311
4 Jessica Long 2333/114441
数据$ 50.90
bla3 $ 5.44
任何人都有任何想法如何解决这个问题?
提前感谢
- Hyflex
尝试:)
csvdata = pandas.read_csv 'testing.csv',dtype = {'TITLE5':'object','TITLE5.1':'object','TITLE5.2':'object','TITLE5.3':'object'})
I've got a pandas script below causing me a headache because it keeps modifying my data when it shouldn't be, the example below can re-create the issue 100% perfectly. (Took me forever to find out what was causing this problem)
Basically, if you compare the original file to the modified testing2.csv
you'll see that numbers like: 0.357
from the first line turn into: 0.35700000000000004
yet on line 2 the number 0.1128
doesn't change at all...
It should NOT be modifying these numbers, they should all be as they are.
testing.py
import re
import pandas
# each block in the text file will be one element of this list
matchers = [[]]
i = 0
with open('testing.txt') as infile:
for line in infile:
line = line.strip()
# Blocks are seperated by blank lines
if len(line) == 0:
i += 1
matchers.append([])
# assume there are always two blank lines between items
# and just skip to the lext line
infile.next()
continue
matchers[i].append(line)
# This regular expression matches the variable number of students in each block
studentlike = re.compile('(\d+) (.+) (\d+/\d+)')
# These are the names of the fields we expect at the end of each block
datanames = ['Data', 'misc2', 'bla3']
# We will build a table containing a list of elements for each student
table = []
for matcher in matchers:
# We use an iterator over the block lines to make indexing simpler
it = iter(matcher)
# The first two elements are match values
m1, m2 = it.next(), it.next()
# then there are a number of students
students = []
for possiblestudent in it:
m = studentlike.match(possiblestudent)
if m:
students.append(list(m.groups()))
else:
break
# After the students come the data elements, which we read into a dictionary
# We also add in the last possible student line as that didn't match the student re
dataitems = dict(item.split() for item in [possiblestudent] + list(it))
# Finally we construct the table
for student in students:
# We use the dictionary .get() method to return blanks for the missing fields
table.append([m1, m2] + student + [dataitems.get(d, '') for d in datanames])
textcols = ['MATCH2', 'MATCH1', 'TITLE01', 'MATCH3', 'TITLE02', 'Data', 'misc2', 'bla3']
csvdata = pandas.read_csv('testing.csv')
textdata = pandas.DataFrame(table, columns=textcols)
# Add any new columns
newCols = textdata.columns - csvdata.columns
for c in newCols:
csvdata[c] = None
mergecols = ['MATCH2', 'MATCH1', 'MATCH3']
csvdata.set_index(mergecols, inplace=True, drop=False)
textdata.set_index(mergecols, inplace=True,drop=False)
csvdata.update(textdata)
csvdata.to_csv('testing2.csv', index=False)
testing.csv
- http://pastebin.com/raw.php?i=HxVE0nA0 (Uploaded because of file size)
testing.txt
MData (N/A)
DMATCH1
3 Tommy 144512/23332
1 Jim 90000/222311
1 Elz M 90000/222311
1 Ben 90000/222311
Data $50.90
misc2 $10.40
bla3 $20.20
MData (B/B)
DMATCH2
4 James Smith 2333/114441
4 Mike 90000/222311
4 Jessica Long 2333/114441
Data $50.90
bla3 $5.44
Anyone have any ideas how to fix this?
Thanks in advance
- Hyflex
Try this :)
csvdata = pandas.read_csv('testing.csv', dtype={'TITLE5' : 'object', 'TITLE5.1' : 'object', 'TITLE5.2' : 'object', 'TITLE5.3' : 'object'})
这篇关于Pandas脚本将数字修改为长浮点数,当它不应该修改该列/元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!