麻烦优化python插值脚本 [英] Trouble optimizing python interpolation script

查看：128 发布时间：2017/2/24 21:36:38 python csv pandas

本文介绍了麻烦优化python插值脚本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我为一些公共交通数据插补到达时间。我有一个工作脚本，但它似乎是在二次时间运行。这里是脚本：

  import pandas as pd 
 
 #read txt文件
 st = pd.read_csv（'interpolated_test.csv'）
 
＃首先由trip_id排序，然后由stop_sequence排序
 sorted_st = st.sort（['trip_id'，'stop_sequence' = [False，True]）
 
＃重置prep中的索引值。 for iteration 
 reindexed = sorted_st.reset_index（drop = True）
 
＃在arrival_time中的每一行，值为hh：mm：ss 
 ['arrival_time']：
如果pd.isnull（i）== False：
＃splice hh： mm：ss 
 hour = int（i [：2]）
 minute = int（i [3：5]）
＃assign hh：mm：ss to numeric value 
 minute_value =（小时* 60）+分钟
 
＃用int值替换当前字符串
＃在Macbook Pro上执行大约655秒，整个stop_times.txt 
＃二次时间
 reindexed = reindexed.replace（i，minute_value）
 
＃插入和写出
 new = reindexed.apply（pd.Series.interpolate）
 print （新）

这里是csv的链接： https://gist.github.com/adampitchie/0192933ed0eba122ba7e

我缩短了csv，所以你可以运行该文件，而不必等待它完成。

这对于任何熟悉熊猫的人来说都应该是低调的，但是我被困住了，非常感谢。

[UPDATE]
所以我试着运行相同的代码与 FULL CSV FILE ，我收到此错误：

  Traceback（最近一次调用）：
在< module>中的文件/Users/tester/Desktop/ETL/interpolate.py，第49行。 
 reindexed [col] .dt.hour * 60 
文件pandas / src / properties.pyx，第34行，在pandas.lib.cache_readonly .__ get__（pandas / lib.c：40664）
文件/Library/Python/2.7/site-packages/pandas/core/series.py，行2513，在dt 
中raise TypeError（只能使用带有datetimelike值的.dt存取器）
 TypeError：只能使用带有datetimelike值的.dt存取器

看起来像 pd.to_datetime（reindexed [col]）不工作。
这里是代码，为了完成：

  import pandas as pd 
 
 st = pd.read_csv（'csv / stop_times.csv'）
 
 sorted_st = st.sort（['trip_id'，'stop_sequence']，ascending = [False，True]）
 
 reindexed = sorted_st.reset_index（drop = True）
 
 for（'arrival_time'，'departure_time'）：
 reindexed [col] = pd.to_datetime [col]）
 reindexed [col] =（
 reindexed [col] .dt.hour * 60 
 + reindexed [col] .dt.minute）
 reindexed [col] = reindexed [col] .interpolate（）
 
 print（reindexed.iloc [:,：3]）

$ b b

解决方案

无论何时，尝试将计算短语计算为整个列而不是行，或逐项。而不是一次处理 reindexed ['arrival_time'] 中的每个值，您可以将整个列转换为 datetime64 s使用 pd.to_datetime 。 A系列 datetime64 s具有 dt 属性，允许您以整数形式访问小时和分钟。所以你可以这样表示整列的计算：

 为col in（'arrival_time'，'departure_time'）： 
 reindexed [col] = pd.to_datetime（reindexed [col]）
 reindexed [col] =（
 reindexed [col] .dt.hour * 60 
 + reindexed [col ] .dt.minute）
 reindexed [col] = reindexed [col] .interpolate（）
 
 print（reindexed.iloc [：5，：3]）

产生

  trip_id arrival_time departure_time 
 0 1423492 647.000000 647.000000 
 1 1423492 649.666667 649.666667 
 2 1423492 652.333333 652.333333 
 3 1423492 655.000000 655.000000 
 4 1423492 655.750000 655.750000

 
 
  调试 TypeError：只能使用.dt存取器datatimelike values  ：
 
 
 确实，正如你所指出的， pd.to_datetime 不会将时间转换为datetime64s。相反，它是
只是返回相同的数据作为字符串。  pd.to_datetime 在尝试将输入转换为数据时遇到错误时返回输入。您可以通过添加 errors ='raise'参数来收集更多关于错误的信息：
  pd.to_datetime（reindexed ['arrival_time']，errors ='raise'）
  
引发
  ValueError：hour必须在0..23 
  
 
 
  so aha  - 时间格式可能有小时数超过23的时间。
 
 
  
  col ='arrival_time'
x = reindexed [col] 
 mask = x.str.extract r'（\d +）:( \d +）:( \d +）'）[0] .astype（'int'）> 23 
  
我们可以看到小时数大于23的行示例：
 在[48]：x [mask] .head（）
 Out [48]：
 42605 26:09:00 
 42610 26:12:00 
 42611 26:20:00 
 42612 26:30:00 
 42613 26:35:00 
名称：arrival_time，dtype：对象
  
  x.str.extract 使用正则表达式模式分割到达时间字符串
  r'（\d + ）:( \d +）:( \d +）'。它返回一个带有三列的DataFrame。
 
 
 这段调试代码提出了一个解决方法。而不是 pd.to_datetime ，
我们可以使用 x.str.extract 来查找小时和分钟：
  import pandas as pd 
 
 st = pd.read_csv（'csv / stop_times.csv' ）
 
 sorted_st = st.sort（['trip_id'，'stop_sequence']，ascending = [False，True]）
 
 reindexed = sorted_st.reset_index（drop = True ）
 
 for（'arrival_time'，'departure_time'）：
 df = reindexed [col] .str.extract（
 r'（？P< hour& \\ d +）:(？P< minute> \d +）:(？P< second> \d +）'。astype（'float'）
 reindexed [col] = df ['hour'] * 60 + df ['minute'] 
 reindexed [col] = reindexed [col] .interpolate（）
 
 print（reindexed.iloc [：5，：3]）
  
产生
  trip_id arrival_time departure_time 
 0 1423492 647.000000 647.000000 
 1 1423492 649.666667 649.666667 
 2 1423492 652.333333 652.333333 
 3 1423492 655.000000 655.000000 
 4 1423492 655.750000 655.750000 
  
 
I am interpolating arrival times for some public transportation data I have. I have a working script, but it seems to be running in quadratic time. Here is the script:
import pandas as pd

#read the txt file
st = pd.read_csv('interpolated_test.csv')

# sort first by trip_id, then by stop_sequence
sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])

# reset the index values in prep. for iteration
reindexed = sorted_st.reset_index(drop=True)

# for each row in 'arrival_time' that has a value of hh:mm:ss
for i in reindexed['arrival_time']:
# for i in range(len(reindexed['arrival_time'])):
    if pd.isnull(i) == False:
        # splice hh:mm:ss
        hour = int(i[:2])
        minute = int(i[3:5])
        # assign hh:mm:ss to numeric value
        minute_value = (hour * 60) + minute

        # replace current string with int value
        # takes ~655s to execute on Macbook Pro w/ entire stop_times.txt
        # runs in quadratic time
        reindexed = reindexed.replace(i,minute_value)

# interpolate and write out
new = reindexed.apply(pd.Series.interpolate)
print(new)
Here is a link to the csv: https://gist.github.com/adampitchie/0192933ed0eba122ba7e

I shortened the csv so you can run the file without waiting for it to finish.

This should be low-hanging fruit for anybody familiar with pandas, but I'm stuck and any help would be appreciated.

[UPDATE]
So I tried running the same code with the FULL CSV FILE, and I get this error:
Traceback (most recent call last):
  File "/Users/tester/Desktop/ETL/interpolate.py", line 49, in <module>
    reindexed[col].dt.hour * 60
  File "pandas/src/properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:40664)
  File "/Library/Python/2.7/site-packages/pandas/core/series.py", line 2513, in dt
    raise TypeError("Can only use .dt accessor with datetimelike values")
TypeError: Can only use .dt accessor with datetimelike values
It looks like pd.to_datetime(reindexed[col]) is not working.
Here is the code, for the sake of completedness:
import pandas as pd

st = pd.read_csv('csv/stop_times.csv')

sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])

reindexed = sorted_st.reset_index(drop=True)

for col in ('arrival_time', 'departure_time'):
    reindexed[col] = pd.to_datetime(reindexed[col])
    reindexed[col] = (
        reindexed[col].dt.hour * 60
        + reindexed[col].dt.minute)
    reindexed[col] = reindexed[col].interpolate()

print(reindexed.iloc[:, :3])

 解决方案 
Whenever you can, try to phrase computations as operations on whole columns rather than rows, or item-by-item. Instead of handling each value in reindexed['arrival_time'] one at a time, you can convert the whole column into datetime64s using pd.to_datetime. A Series of datetime64s has a dt attribute which allows you to access the hour and minutes as integers. So you can express the calculation for the whole column like this:
for col in ('arrival_time', 'departure_time'):
    reindexed[col] = pd.to_datetime(reindexed[col])
    reindexed[col] = (
        reindexed[col].dt.hour * 60
        + reindexed[col].dt.minute)
    reindexed[col] = reindexed[col].interpolate()

print(reindexed.iloc[:5, :3])
yields
    trip_id  arrival_time  departure_time
0   1423492    647.000000      647.000000
1   1423492    649.666667      649.666667
2   1423492    652.333333      652.333333
3   1423492    655.000000      655.000000
4   1423492    655.750000      655.750000




Debugging TypeError: Can only use .dt accessor with datetimelike values:

Indeed, as you pointed out, pd.to_datetime is not converting the times to datetime64s. Instead, it is
just returning the same data as strings. pd.to_datetime returns the input when it encounters an error trying to convert the input to datetimes. You can gather a bit more information about what is going wrong by adding the errors='raise' parameter:
pd.to_datetime(reindexed['arrival_time'], errors='raise')
raises
ValueError: hour must be in 0..23
So aha -- the time format probably has times whose hours exceed 23.

Using 
col = 'arrival_time'
x = reindexed[col]
mask = x.str.extract(r'(\d+):(\d+):(\d+)')[0].astype('int')  > 23
we can see examples of rows where the hours is greater than 23:
In [48]: x[mask].head()
Out[48]: 
42605    26:09:00
42610    26:12:00
42611    26:20:00
42612    26:30:00
42613    26:35:00
Name: arrival_time, dtype: object
The x.str.extract splits the arrival time strings using the regex pattern
r'(\d+):(\d+):(\d+)'. It returns a DataFrame with three columns.

This piece of debugging code suggests a workaround. Instead of pd.to_datetime, 
we could use x.str.extract to find the hours and minutes:
import pandas as pd

st = pd.read_csv('csv/stop_times.csv')

sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])

reindexed = sorted_st.reset_index(drop=True)

for col in ('arrival_time', 'departure_time'):
    df = reindexed[col].str.extract(
        r'(?P<hour>\d+):(?P<minute>\d+):(?P<second>\d+)').astype('float')
    reindexed[col] = df['hour'] * 60 + df['minute']
    reindexed[col] = reindexed[col].interpolate()

print(reindexed.iloc[:5, :3])
yields
   trip_id  arrival_time  departure_time
0  1423492    647.000000      647.000000
1  1423492    649.666667      649.666667
2  1423492    652.333333      652.333333
3  1423492    655.000000      655.000000
4  1423492    655.750000      655.750000


                        
这篇关于麻烦优化python插值脚本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

麻烦优化python插值脚本 [英] Trouble optimizing python interpolation script

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

麻烦优化python插值脚本 [英] Trouble optimizing python interpolation script

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭