麻烦优化python插值脚本 [英] Trouble optimizing python interpolation script

查看:128
本文介绍了麻烦优化python插值脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为一些公共交通数据插补到达时间。我有一个工作脚本,但它似乎是在二次时间运行。这里是脚本:

  import pandas as pd 

#read txt文件
st = pd.read_csv('interpolated_test.csv')

#首先由trip_id排序,然后由stop_sequence排序
sorted_st = st.sort(['trip_id','stop_sequence' = [False,True])

#重置prep中的索引值。 for iteration
reindexed = sorted_st.reset_index(drop = True)

#在arrival_time中的每一行,值为hh:mm:ss
['arrival_time']:
如果pd.isnull(i)== False:
#splice hh: mm:ss
hour = int(i [:2])
minute = int(i [3:5])
#assign hh:mm:ss to numeric value
minute_value =(小时* 60)+分钟

#用int值替换当前字符串
#在Macbook Pro上执行大约655秒,整个stop_times.txt
#二次时间
reindexed = reindexed.replace(i,minute_value)

#插入和写出
new = reindexed.apply(pd.Series.interpolate)
print (新)

这里是csv的链接: https://gist.github.com/adampitchie/0192933ed0eba122ba7e



我缩短了csv,所以你可以运行该文件,而不必等待它完成。



这对于任何熟悉熊猫的人来说都应该是低调的,但是我被困住了,非常感谢。



[UPDATE]
所以我试着运行相同的代码与 FULL CSV FILE ,我收到此错误:

  Traceback(最近一次调用):
在< module>中的文件/Users/tester/Desktop/ETL/interpolate.py,第49行。
reindexed [col] .dt.hour * 60
文件pandas / src / properties.pyx,第34行,在pandas.lib.cache_readonly .__ get__(pandas / lib.c:40664)
文件/Library/Python/2.7/site-packages/pandas/core/series.py,行2513,在dt
中raise TypeError(只能使用带有datetimelike值的.dt存取器)
TypeError:只能使用带有datetimelike值的.dt存取器

看起来像 pd.to_datetime(reindexed [col])不工作。
这里是代码,为了完成:

  import pandas as pd 

st = pd.read_csv('csv / stop_times.csv')

sorted_st = st.sort(['trip_id','stop_sequence'],ascending = [False,True])

reindexed = sorted_st.reset_index(drop = True)

for('arrival_time','departure_time'):
reindexed [col] = pd.to_datetime [col])
reindexed [col] =(
reindexed [col] .dt.hour * 60
+ reindexed [col] .dt.minute)
reindexed [col] = reindexed [col] .interpolate()

print(reindexed.iloc [:,:3])

$ b b

解决方案

无论何时,尝试将计算短语计算为整个列而不是行,或逐项。而不是一次处理 reindexed ['arrival_time'] 中的每个值,您可以将整个列转换为 datetime64 s使用 pd.to_datetime 。 A系列 datetime64 s具有 dt 属性,允许您以整数形式访问小时和分钟。所以你可以这样表示整列的计算:

 为col in('arrival_time','departure_time'): 
reindexed [col] = pd.to_datetime(reindexed [col])
reindexed [col] =(
reindexed [col] .dt.hour * 60
+ reindexed [col ] .dt.minute)
reindexed [col] = reindexed [col] .interpolate()

print(reindexed.iloc [:5,:3])

产生

  trip_id arrival_time departure_time 
0 1423492 647.000000 647.000000
1 1423492 649.666667 649.666667
2 1423492 652.333333 652.333333
3 1423492 655.000000 655.000000
4 1423492 655.750000 655.750000






调试 TypeError:只能使用.dt存取器datatimelike values



确实,正如你所指出的, pd.to_datetime 不会将时间转换为datetime64s。相反,它是
只是返回相同的数据作为字符串。 pd.to_datetime 在尝试将输入转换为数据时遇到错误时返回输入。您可以通过添加 errors ='raise'参数来收集更多关于错误的信息:

  pd.to_datetime(reindexed ['arrival_time'],errors ='raise')

引发

  ValueError:hour必须在0..23 



so aha - 时间格式可能有小时数超过23的时间。



  col ='arrival_time'
x = reindexed [col]
mask = x.str.extract r'(\d +):( \d +):( \d +)')[0] .astype('int')> 23

我们可以看到小时数大于23的行示例:

 在[48]:x [mask] .head()
Out [48]:
42605 26:09:00
42610 26:12:00
42611 26:20:00
42612 26:30:00
42613 26:35:00
名称:arrival_time,dtype:对象

x.str.extract 使用正则表达式模式分割到达时间字符串
r'(\d + ):( \d +):( \d +)'。它返回一个带有三列的DataFrame。



这段调试代码提出了一个解决方法。而不是 pd.to_datetime
我们可以使用 x.str.extract 来查找小时和分钟:

  import pandas as pd 

st = pd.read_csv('csv / stop_times.csv' )

sorted_st = st.sort(['trip_id','stop_sequence'],ascending = [False,True])

reindexed = sorted_st.reset_index(drop = True )

for('arrival_time','departure_time'):
df = reindexed [col] .str.extract(
r'(?P< hour& \\ d +):(?P< minute> \d +):(?P< second> \d +)'。astype('float')
reindexed [col] = df ['hour'] * 60 + df ['minute']
reindexed [col] = reindexed [col] .interpolate()

print(reindexed.iloc [:5,:3])

产生

  trip_id arrival_time departure_time 
0 1423492 647.000000 647.000000
1 1423492 649.666667 649.666667
2 1423492 652.333333 652.333333
3 1423492 655.000000 655.000000
4 1423492 655.750000 655.750000


I am interpolating arrival times for some public transportation data I have. I have a working script, but it seems to be running in quadratic time. Here is the script:

import pandas as pd

#read the txt file
st = pd.read_csv('interpolated_test.csv')

# sort first by trip_id, then by stop_sequence
sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])

# reset the index values in prep. for iteration
reindexed = sorted_st.reset_index(drop=True)

# for each row in 'arrival_time' that has a value of hh:mm:ss
for i in reindexed['arrival_time']:
# for i in range(len(reindexed['arrival_time'])):
    if pd.isnull(i) == False:
        # splice hh:mm:ss
        hour = int(i[:2])
        minute = int(i[3:5])
        # assign hh:mm:ss to numeric value
        minute_value = (hour * 60) + minute

        # replace current string with int value
        # takes ~655s to execute on Macbook Pro w/ entire stop_times.txt
        # runs in quadratic time
        reindexed = reindexed.replace(i,minute_value)

# interpolate and write out
new = reindexed.apply(pd.Series.interpolate)
print(new)

Here is a link to the csv: https://gist.github.com/adampitchie/0192933ed0eba122ba7e

I shortened the csv so you can run the file without waiting for it to finish.

This should be low-hanging fruit for anybody familiar with pandas, but I'm stuck and any help would be appreciated.

[UPDATE] So I tried running the same code with the FULL CSV FILE, and I get this error:

Traceback (most recent call last):
  File "/Users/tester/Desktop/ETL/interpolate.py", line 49, in <module>
    reindexed[col].dt.hour * 60
  File "pandas/src/properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:40664)
  File "/Library/Python/2.7/site-packages/pandas/core/series.py", line 2513, in dt
    raise TypeError("Can only use .dt accessor with datetimelike values")
TypeError: Can only use .dt accessor with datetimelike values

It looks like pd.to_datetime(reindexed[col]) is not working. Here is the code, for the sake of completedness:

import pandas as pd

st = pd.read_csv('csv/stop_times.csv')

sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])

reindexed = sorted_st.reset_index(drop=True)

for col in ('arrival_time', 'departure_time'):
    reindexed[col] = pd.to_datetime(reindexed[col])
    reindexed[col] = (
        reindexed[col].dt.hour * 60
        + reindexed[col].dt.minute)
    reindexed[col] = reindexed[col].interpolate()

print(reindexed.iloc[:, :3])

解决方案

Whenever you can, try to phrase computations as operations on whole columns rather than rows, or item-by-item. Instead of handling each value in reindexed['arrival_time'] one at a time, you can convert the whole column into datetime64s using pd.to_datetime. A Series of datetime64s has a dt attribute which allows you to access the hour and minutes as integers. So you can express the calculation for the whole column like this:

for col in ('arrival_time', 'departure_time'):
    reindexed[col] = pd.to_datetime(reindexed[col])
    reindexed[col] = (
        reindexed[col].dt.hour * 60
        + reindexed[col].dt.minute)
    reindexed[col] = reindexed[col].interpolate()

print(reindexed.iloc[:5, :3])

yields

    trip_id  arrival_time  departure_time
0   1423492    647.000000      647.000000
1   1423492    649.666667      649.666667
2   1423492    652.333333      652.333333
3   1423492    655.000000      655.000000
4   1423492    655.750000      655.750000


Debugging TypeError: Can only use .dt accessor with datetimelike values:

Indeed, as you pointed out, pd.to_datetime is not converting the times to datetime64s. Instead, it is just returning the same data as strings. pd.to_datetime returns the input when it encounters an error trying to convert the input to datetimes. You can gather a bit more information about what is going wrong by adding the errors='raise' parameter:

pd.to_datetime(reindexed['arrival_time'], errors='raise')

raises

ValueError: hour must be in 0..23

So aha -- the time format probably has times whose hours exceed 23.

Using

col = 'arrival_time'
x = reindexed[col]
mask = x.str.extract(r'(\d+):(\d+):(\d+)')[0].astype('int')  > 23

we can see examples of rows where the hours is greater than 23:

In [48]: x[mask].head()
Out[48]: 
42605    26:09:00
42610    26:12:00
42611    26:20:00
42612    26:30:00
42613    26:35:00
Name: arrival_time, dtype: object

The x.str.extract splits the arrival time strings using the regex pattern r'(\d+):(\d+):(\d+)'. It returns a DataFrame with three columns.

This piece of debugging code suggests a workaround. Instead of pd.to_datetime, we could use x.str.extract to find the hours and minutes:

import pandas as pd

st = pd.read_csv('csv/stop_times.csv')

sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])

reindexed = sorted_st.reset_index(drop=True)

for col in ('arrival_time', 'departure_time'):
    df = reindexed[col].str.extract(
        r'(?P<hour>\d+):(?P<minute>\d+):(?P<second>\d+)').astype('float')
    reindexed[col] = df['hour'] * 60 + df['minute']
    reindexed[col] = reindexed[col].interpolate()

print(reindexed.iloc[:5, :3])

yields

   trip_id  arrival_time  departure_time
0  1423492    647.000000      647.000000
1  1423492    649.666667      649.666667
2  1423492    652.333333      652.333333
3  1423492    655.000000      655.000000
4  1423492    655.750000      655.750000

这篇关于麻烦优化python插值脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆