在Python中修补CSV文件中缺少的行 [英] Patch over missing rows in CSV file in Python

查看:289
本文介绍了在Python中修补CSV文件中缺少的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,其中包含多天的每一分钟的行。它由数据采集系统生成,有时会错过几行。



数据看起来像这样 - 一个datetime字段后跟一些整数

 2017-01-07 03:00:02,7,3,2,13,0
2017-01-07 03:01:02 ,7,3,2,13,0
2017-01-07 03:02:02,7 12,0
2017-01-07 03:07:02,7,3,2,12,0
-07 03:08:02,6,3,2,12,1
2017-01-07 03:09:02,7 3,2,11,1,3,2,12,0
2017-01-07 03:10:02

上面(真实数据)示例中缺少行。由于数据在样本之间变化不大,我想将最后一个有效数据复制到缺少的行。我遇到的问题是检测哪些行丢失。



我正在处理的CSV与我拼凑在一起的python程序(我很新蟒蛇)。
这用于处理我有的数据。

  import csv 
import datetime

打开(minutedata.csv rb')as f:
reader = csv.reader(f,delimiter =',')
读取器中的行:
date = datetime.datetime.strptime(row [0] %Y-%m-%d%H:%M:%S)
v1 = int(row [1])$ ​​b $ b v2 = int(row [2])
v3 = int(row [3])
v4 = int(row [4])
v5 = int(row [5])
...
$ b ...(保存数据)...



编辑以添加:



p>我试图使用Pandas现在感谢jeremycg的指针。



我已经添加一个标题行到CSV,所以现在看起来像:

  time,v1,v2,v3,v4,v5 
2017-01-07 03:00: 02,7,3,2,13,0
2017-01-07 03:01:02,7 ,13,0
2017-01-07 03:02:02,7,3,2,12,0
-01-07 03:07:02,7,3,2,12,0
2017-01-07 03:08:02 ,3,2,12,1
2017-01-07 03:09:02,7 0
2017-01-07 03:10:02,6,3,2,11,1

处理代码现在是:

  import pandas as pd 
import io
z = pd.read_csv('minutedata.csv')
z ['time'] = pd.to_datetime(z ['time'])
z.set_index ('time')。reindex(pd.date_range(min(z ['time']),max(z ['time']),freq =1min))ffill()
z:
date = datetime.datetime.strptime(row [0],%Y-%m-%d%H:%M:%S)
v1 = int )
v2 = int(row [2])
v3 = int(row [3])
v4 = int(row [4])
v5 = int 5])
...(过程值)...

...(保存数据)...

但是错误出在:

 
文件process_day.py,第14行,在< module>
z.set_index('time')。reindex(pd.date_range(min(z ['time']),max(z ['time']),freq =1min))ffill
文件/usr/local/lib/python2.7/site-packages/pandas/core/frame.py,第2821行,在reindex
** kwargs)
文件/ usr / local / lib / python2.7 / site-packages / pandas / core / generic.py,行2259,在reindex fill_value中拷贝).__ finalize __(self)
文件 python2.7 / site-packages / pandas / core / frame.py,第2767行,在_reindex_axes中
fill_value,limit,tolerance)
文件/usr/local/lib/python2.7/site -packages / core / frame.py,第2778行,在_reindex_index中allow_dups = False)
文件/usr/local/lib/python2.7/site-packages/pandas/core/generic.py ,line 2371,in _reindex_with_indexers copy = copy)
文件/usr/local/lib/python2.7/site-packages/pandas/core/internals.py,第3839行,位于reindex_indexer self.axes [axis ] ._ can_reindex(indexer)
文件/usr/local/lib/python2.7/site-packages/pandas/indexes/base.py,第2494行,位于_can_reindex raise ValueError(无法从重复索引轴)
ValueError:无法从重复轴重建索引

它现在声称是破碎。



有关此修复程序,请参阅下面的注释。



现在的工作代码是:

  import pandas as pd 
import datetime

z = pd.read_csv('minutedata1.csv')
z = z [〜z.time.duplicated()]
z ['time'] = pd.to_datetime(z ['time'])
z.set_index('time')。reindex(pd.date_range(min(z ['time']),max(z ['time']) ffill()
索引,z.iterrows()中的行:
date = datetime.datetime.strptime(row [0],%Y-%m- %d%H:%M:%S)
v1 = int(row [1])$ ​​b $ b v2 = int(row [2])
v3 = int )
v4 = int(row [4])
v5 = int(row [5])
...(过程值)...

。 ..(保存数据)...



我真诚感谢大家的帮助。 - David

解决方案

你应该使用熊猫,因为它是为这种东西。



首先阅读csv:

  import pandas as pd 
import io
x ='''
time,a,b,c,d,e
2017-01-07 03:00:02,7,3,2 ,13,0
2017-01-07 03:01:02,7,3,2,13 2017-01-07 03:02:02,7,3,2,12,0
2017-01-07 03:07:02,7 ,3,2,12,0
2017-01-07 03:08:02,6,3,2,12 1
2017-01-07 03:09:02,7,3,2,12,0
2017-01-07 03 (10):10:02,6,3,2,11,1,'''#添加标题
z = pd.read_csv(io.StringIO )#这里你可以使用你的文件名

现在z是一个pandas数据框架:

  z.head()

时间abcde
0 2017-01-07 03:00:02 7 3 2 13 0
1 2017-01-07 03:01:02 7 3 2 13 0
2 2017-01-07 03:02:02 7 3 2 12 0
3 2017 -01-07 03:07:02 7 3 2 12 0
4 2017-01-07 03:08:02 6 3 2 12 1

我们要:
将'time'列转换为pd.datetime:

  z ['time'] = pd.to_datetime(z ['time'])

将数据框架的索引设置为时间,然后在我们的范围内重新索引:

  z = z。 set_index('time')。reindex(pd.date_range(min(z ['time']),max(z ['time']),freq =1min))
z

abcde
2017-01-07 03:00:02 7.0 3.0 2.0 13.0 0.0
2017-01-07 03:01:02 7.0 3.0 2.0 13.0 0.0
2017-01-07 03 :02:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:03:02 NaN NaN NaN NaN NaN
2017-01-07 03:04:02 NaN NaN NaN NaN NaN
2017-01-07 03:05:02 NaN NaN NaN NaN NaN
2017-01-07 03:06:02 NaN NaN NaN NaN NaN
2017-01-07 03:07:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:08:02 6.0 3.0 2.0 12.0 1.0
2017-01-07 03:09:02 7.0 3.0 2.0 12.0 0.0
2017-01- 07 03:10:02 6.0 3.0 2.0 11.0 1.0

然后使用.ffill上一个值:

  z.ffill()

abcde
2017-01-07 03:00:02 7.0 3.0 2.0 13.0 0.0
2017-01-07 03:01:02 7.0 3.0 2.0 13.0 0.0
2017-01-07 03:02:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:03:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:04:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:05:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:06:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:07:02 7.0 3.0 2.0 12.0 0.0
2017-01 -07 03:08:02 6.0 3.0 2.0 12.0 1.0
2017-01-07 03:09:02 7.0 3.0 2.0 12.0 0.0
2017-01-07 03:10:02 6.0 3.0 2.0 11.0 1.0

或者共同:

  z = pd.read_csv(io.StringIO(x))
z ['time'] = pd.to_datetime(z ['time'])
z.set_index 'time')。reindex(pd.date_range(min(z ['time']),max(z ['time']),freq =1min))。ffill()


I've got a CSV file that contains rows for every minute of the day for multiple days. It is generated by a data acquisition system that sometimes misses a few rows.

The data looks like this - a datetime field followed by some integers

"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"

There's missing rows in the above (real data) example. As the data doesn't change very much between samples, I'd like to just copy the last valid data in to the missing rows. The problem I'm having is detecting which rows are missing.

I'm processing the CSV with a python program I've cobbled together (I'm very new to python). This works to process the data I have.

import csv
import datetime

with open("minutedata.csv", 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
    date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
    v1 = int(row[1])
    v2 = int(row[2])
    v3 = int(row[3])
    v4 = int(row[4])
    v5 = int(row[5])
    ...(process values)...

...(save data)...

I'm unsure how to check if the current row is next in sequence, or comes after some missing rows.

Edit to add :

I'm trying to use Pandas now thanks to jeremycg for the pointer to that.

I've added a header row to the CSV, so now it looks like:

time,v1,v2,v3,v4,v5
"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"

The processing code is now:

import pandas as pd
import io
z = pd.read_csv('minutedata.csv')
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']),freq="1min")).ffill()
for row in z:
    date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
    v1 = int(row[1])
    v2 = int(row[2])
    v3 = int(row[3])
    v4 = int(row[4])
    v5 = int(row[5])
    ...(process values)...

...(save data)...

but this errors out:

Traceback (most recent call last):
File "process_day.py", line 14, in <module>
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min")).ffill()
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2821, in reindex
**kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 2259, in reindex fill_value, copy).__finalize__(self)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2767, in _reindex_axes
fill_value, limit, tolerance)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2778, in _reindex_index allow_dups=False)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 2371, in _reindex_with_indexers copy=copy)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3839, in reindex_indexer self.axes[axis]._can_reindex(indexer)
File "/usr/local/lib/python2.7/site-packages/pandas/indexes/base.py", line 2494, in _can_reindex raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

I'm lost as to what it is now claiming is broken.

See comment further down for this fix for this.

The working code is now :

import pandas as pd
import datetime

z = pd.read_csv('minutedata1.csv')
z = z[~z.time.duplicated()]
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']),freq="1min")).ffill()
for index,row in z.iterrows():
    date = datetime.datetime.strptime (row [0],"%Y-%m-%d %H:%M:%S")
    v1 = int(row[1])
    v2 = int(row[2])
    v3 = int(row[3])
    v4 = int(row[4])
    v5 = int(row[5])
    ...(process values)...

...(save data)...

My sincere thanks to everyone that helped. - David

解决方案

You should probably be using pandas for this, as it is made for this kind of stuff.

First read the csv:

import pandas as pd
import io
x = '''
time,a,b,c,d,e
"2017-01-07 03:00:02","7","3","2","13","0"
"2017-01-07 03:01:02","7","3","2","13","0"
"2017-01-07 03:02:02","7","3","2","12","0"
"2017-01-07 03:07:02","7","3","2","12","0"
"2017-01-07 03:08:02","6","3","2","12","1"
"2017-01-07 03:09:02","7","3","2","12","0"
"2017-01-07 03:10:02","6","3","2","11","1"''' #your data, with added headers
z = pd.read_csv(io.StringIO(x)) #you can use your file name here

now z is a pandas dataframe:

z.head()

time    a   b   c   d   e
0   2017-01-07 03:00:02 7   3   2   13  0
1   2017-01-07 03:01:02 7   3   2   13  0
2   2017-01-07 03:02:02 7   3   2   12  0
3   2017-01-07 03:07:02 7   3   2   12  0
4   2017-01-07 03:08:02 6   3   2   12  1

We want to: Convert the 'time' column to pd.datetime:

z['time'] = pd.to_datetime(z['time'])

Set the 'index' of the dataframe to be the time, then reindex over our range:

z = z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min"))
z

a   b   c   d   e
2017-01-07 03:00:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:01:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:02:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:03:02 NaN NaN NaN NaN NaN
2017-01-07 03:04:02 NaN NaN NaN NaN NaN
2017-01-07 03:05:02 NaN NaN NaN NaN NaN
2017-01-07 03:06:02 NaN NaN NaN NaN NaN
2017-01-07 03:07:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:08:02 6.0 3.0 2.0 12.0    1.0
2017-01-07 03:09:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:10:02 6.0 3.0 2.0 11.0    1.0

Then use .ffill() to fill in from the previous value:

z.ffill()

a   b   c   d   e
2017-01-07 03:00:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:01:02 7.0 3.0 2.0 13.0    0.0
2017-01-07 03:02:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:03:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:04:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:05:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:06:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:07:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:08:02 6.0 3.0 2.0 12.0    1.0
2017-01-07 03:09:02 7.0 3.0 2.0 12.0    0.0
2017-01-07 03:10:02 6.0 3.0 2.0 11.0    1.0

or, all together:

z = pd.read_csv(io.StringIO(x))
z['time'] = pd.to_datetime(z['time'])
z.set_index('time').reindex(pd.date_range(min(z['time']), max(z['time']), freq="1min")).ffill()

这篇关于在Python中修补CSV文件中缺少的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆