麻烦优化python插值脚本 [英] Trouble optimizing python interpolation script
问题描述
我为一些公共交通数据插补到达时间。我有一个工作脚本,但它似乎是在二次时间运行。这里是脚本:
import pandas as pd
#read txt文件
st = pd.read_csv('interpolated_test.csv')
#首先由trip_id排序,然后由stop_sequence排序
sorted_st = st.sort(['trip_id','stop_sequence' = [False,True])
#重置prep中的索引值。 for iteration
reindexed = sorted_st.reset_index(drop = True)
#在arrival_time中的每一行,值为hh:mm:ss
['arrival_time']:
如果pd.isnull(i)== False:
#splice hh: mm:ss
hour = int(i [:2])
minute = int(i [3:5])
#assign hh:mm:ss to numeric value
minute_value =(小时* 60)+分钟
#用int值替换当前字符串
#在Macbook Pro上执行大约655秒,整个stop_times.txt
#二次时间
reindexed = reindexed.replace(i,minute_value)
#插入和写出
new = reindexed.apply(pd.Series.interpolate)
print (新)
这里是csv的链接: https://gist.github.com/adampitchie/0192933ed0eba122ba7e
我缩短了csv,所以你可以运行该文件,而不必等待它完成。
这对于任何熟悉熊猫的人来说都应该是低调的,但是我被困住了,非常感谢。
[UPDATE]
所以我试着运行相同的代码与 FULL CSV FILE ,我收到此错误:
Traceback(最近一次调用):
在< module>中的文件/Users/tester/Desktop/ETL/interpolate.py,第49行。
reindexed [col] .dt.hour * 60
文件pandas / src / properties.pyx,第34行,在pandas.lib.cache_readonly .__ get__(pandas / lib.c:40664)
文件/Library/Python/2.7/site-packages/pandas/core/series.py,行2513,在dt
中raise TypeError(只能使用带有datetimelike值的.dt存取器)
TypeError:只能使用带有datetimelike值的.dt存取器
看起来像 pd.to_datetime(reindexed [col])
不工作。
这里是代码,为了完成:
import pandas as pd
st = pd.read_csv('csv / stop_times.csv')
sorted_st = st.sort(['trip_id','stop_sequence'],ascending = [False,True])
reindexed = sorted_st.reset_index(drop = True)
for('arrival_time','departure_time'):
reindexed [col] = pd.to_datetime [col])
reindexed [col] =(
reindexed [col] .dt.hour * 60
+ reindexed [col] .dt.minute)
reindexed [col] = reindexed [col] .interpolate()
print(reindexed.iloc [:,:3])
$ b b
无论何时,尝试将计算短语计算为整个列而不是行,或逐项。而不是一次处理 reindexed ['arrival_time']
中的每个值,您可以将整个列转换为 datetime64
s使用 pd.to_datetime
。 A系列 datetime64
s具有 dt
属性,允许您以整数形式访问小时和分钟。所以你可以这样表示整列的计算:
为col in('arrival_time','departure_time'):
reindexed [col] = pd.to_datetime(reindexed [col])
reindexed [col] =(
reindexed [col] .dt.hour * 60
+ reindexed [col ] .dt.minute)
reindexed [col] = reindexed [col] .interpolate()
print(reindexed.iloc [:5,:3])
产生
trip_id arrival_time departure_time
0 1423492 647.000000 647.000000
1 1423492 649.666667 649.666667
2 1423492 652.333333 652.333333
3 1423492 655.000000 655.000000
4 1423492 655.750000 655.750000
调试 TypeError:只能使用.dt存取器datatimelike values
:
确实,正如你所指出的, pd.to_datetime
不会将时间转换为datetime64s。相反,它是
只是返回相同的数据作为字符串。 pd.to_datetime
在尝试将输入转换为数据时遇到错误时返回输入。您可以通过添加 errors ='raise'
参数来收集更多关于错误的信息:
pd.to_datetime(reindexed ['arrival_time'],errors ='raise')
引发
ValueError:hour必须在0..23
so aha - 时间格式可能有小时数超过23的时间。
col ='arrival_time'
x = reindexed [col]
mask = x.str.extract r'(\d +):( \d +):( \d +)')[0] .astype('int')> 23
我们可以看到小时数大于23的行示例:
在[48]:x [mask] .head()
Out [48]:
42605 26:09:00
42610 26:12:00
42611 26:20:00
42612 26:30:00
42613 26:35:00
名称:arrival_time,dtype:对象
x.str.extract 使用正则表达式模式分割到达时间字符串
r'(\d + ):( \d +):( \d +)'
。它返回一个带有三列的DataFrame。
这段调试代码提出了一个解决方法。而不是
pd.to_datetime
,
我们可以使用x.str.extract
来查找小时和分钟:import pandas as pd
st = pd.read_csv('csv / stop_times.csv' )
sorted_st = st.sort(['trip_id','stop_sequence'],ascending = [False,True])
reindexed = sorted_st.reset_index(drop = True )
for('arrival_time','departure_time'):
df = reindexed [col] .str.extract(
r'(?P< hour& \\ d +):(?P< minute> \d +):(?P< second> \d +)'。astype('float')
reindexed [col] = df ['hour'] * 60 + df ['minute']
reindexed [col] = reindexed [col] .interpolate()
print(reindexed.iloc [:5,:3])
产生
trip_id arrival_time departure_time
0 1423492 647.000000 647.000000
1 1423492 649.666667 649.666667
2 1423492 652.333333 652.333333
3 1423492 655.000000 655.000000
4 1423492 655.750000 655.750000
I am interpolating arrival times for some public transportation data I have. I have a working script, but it seems to be running in quadratic time. Here is the script:
import pandas as pd #read the txt file st = pd.read_csv('interpolated_test.csv') # sort first by trip_id, then by stop_sequence sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True]) # reset the index values in prep. for iteration reindexed = sorted_st.reset_index(drop=True) # for each row in 'arrival_time' that has a value of hh:mm:ss for i in reindexed['arrival_time']: # for i in range(len(reindexed['arrival_time'])): if pd.isnull(i) == False: # splice hh:mm:ss hour = int(i[:2]) minute = int(i[3:5]) # assign hh:mm:ss to numeric value minute_value = (hour * 60) + minute # replace current string with int value # takes ~655s to execute on Macbook Pro w/ entire stop_times.txt # runs in quadratic time reindexed = reindexed.replace(i,minute_value) # interpolate and write out new = reindexed.apply(pd.Series.interpolate) print(new)
Here is a link to the csv: https://gist.github.com/adampitchie/0192933ed0eba122ba7e
I shortened the csv so you can run the file without waiting for it to finish.
This should be low-hanging fruit for anybody familiar with pandas, but I'm stuck and any help would be appreciated.
[UPDATE] So I tried running the same code with the FULL CSV FILE, and I get this error:
Traceback (most recent call last): File "/Users/tester/Desktop/ETL/interpolate.py", line 49, in <module> reindexed[col].dt.hour * 60 File "pandas/src/properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:40664) File "/Library/Python/2.7/site-packages/pandas/core/series.py", line 2513, in dt raise TypeError("Can only use .dt accessor with datetimelike values") TypeError: Can only use .dt accessor with datetimelike values
It looks like
pd.to_datetime(reindexed[col])
is not working. Here is the code, for the sake of completedness:import pandas as pd st = pd.read_csv('csv/stop_times.csv') sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True]) reindexed = sorted_st.reset_index(drop=True) for col in ('arrival_time', 'departure_time'): reindexed[col] = pd.to_datetime(reindexed[col]) reindexed[col] = ( reindexed[col].dt.hour * 60 + reindexed[col].dt.minute) reindexed[col] = reindexed[col].interpolate() print(reindexed.iloc[:, :3])
解决方案Whenever you can, try to phrase computations as operations on whole columns rather than rows, or item-by-item. Instead of handling each value in
reindexed['arrival_time']
one at a time, you can convert the whole column intodatetime64
s usingpd.to_datetime
. A Series ofdatetime64
s has adt
attribute which allows you to access the hour and minutes as integers. So you can express the calculation for the whole column like this:for col in ('arrival_time', 'departure_time'): reindexed[col] = pd.to_datetime(reindexed[col]) reindexed[col] = ( reindexed[col].dt.hour * 60 + reindexed[col].dt.minute) reindexed[col] = reindexed[col].interpolate() print(reindexed.iloc[:5, :3])
yields
trip_id arrival_time departure_time 0 1423492 647.000000 647.000000 1 1423492 649.666667 649.666667 2 1423492 652.333333 652.333333 3 1423492 655.000000 655.000000 4 1423492 655.750000 655.750000
Debugging
TypeError: Can only use .dt accessor with datetimelike values
:Indeed, as you pointed out,
pd.to_datetime
is not converting the times to datetime64s. Instead, it is just returning the same data as strings.pd.to_datetime
returns the input when it encounters an error trying to convert the input to datetimes. You can gather a bit more information about what is going wrong by adding theerrors='raise'
parameter:pd.to_datetime(reindexed['arrival_time'], errors='raise')
raises
ValueError: hour must be in 0..23
So aha -- the time format probably has times whose hours exceed 23.
Using
col = 'arrival_time' x = reindexed[col] mask = x.str.extract(r'(\d+):(\d+):(\d+)')[0].astype('int') > 23
we can see examples of rows where the hours is greater than 23:
In [48]: x[mask].head() Out[48]: 42605 26:09:00 42610 26:12:00 42611 26:20:00 42612 26:30:00 42613 26:35:00 Name: arrival_time, dtype: object
The x.str.extract splits the arrival time strings using the regex pattern
r'(\d+):(\d+):(\d+)'
. It returns a DataFrame with three columns.This piece of debugging code suggests a workaround. Instead of
pd.to_datetime
, we could usex.str.extract
to find the hours and minutes:import pandas as pd st = pd.read_csv('csv/stop_times.csv') sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True]) reindexed = sorted_st.reset_index(drop=True) for col in ('arrival_time', 'departure_time'): df = reindexed[col].str.extract( r'(?P<hour>\d+):(?P<minute>\d+):(?P<second>\d+)').astype('float') reindexed[col] = df['hour'] * 60 + df['minute'] reindexed[col] = reindexed[col].interpolate() print(reindexed.iloc[:5, :3])
yields
trip_id arrival_time departure_time 0 1423492 647.000000 647.000000 1 1423492 649.666667 649.666667 2 1423492 652.333333 652.333333 3 1423492 655.000000 655.000000 4 1423492 655.750000 655.750000
这篇关于麻烦优化python插值脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!