根据参数或数据框行下方列中的差异对df进行细分或分组? [英] segmenting or grouping a df based on parameters or differences within columns going down the dataframe rows?

查看:88
本文介绍了根据参数或数据框行下方列中的差异对df进行细分或分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图弄清楚是否有一种方法,我有一个带有多个字段的数据框,并且我想根据特定列的值是否在x的数量之内,将该数据框分段或分组为一个新的数据框。

I was trying to figure out if there was a way in which where I had a dataframe with multiple fields and I wanted to segment or group the dataframe into a new dataframe based on if the values of specific columns were within x amount of each other?

   I.D  |      Created_Time            | Home_Longitude | Home_Latitude | Work_Longitude | Home_Latitude
  Faa1      2019-02-23 20:01:13.362           -77.0364            38.8951    -72.0364      38.8951

以上是原始df具有多行的外观。
我想创建一个新的数据框,其中所有行或ID包含的创建时间都在x分钟之间,并且在另一个房屋x英里内使用Haversine,在另一个作品内x英里内使用Haversine 。

Above is how the original df looks with multiple rows. I want to create a new dataframe where all rows or I.Ds contain created times that are within x amount of minutes of each other, and using haversine within x miles of one another homes, and x miles within one another work.

因此,基本上试图将数据框过滤到一个df中,该df仅包含在创建时间的x分钟内,在另一个房屋内x英里,在x内x英里内的行每个工作列的值。

So Basically trying to filter this dataframe into a df that only contains rows that are within x minutes of order created time, x miles within one another homes and , x miles within each work column value.

推荐答案

我是通过


  1. 计算相对于第一行的距离(以英里为单位)和时间


    • 我的逻辑


      • 如果n行位于第一行的 x分钟/英里之内,则那n行位于之内 x彼此的分钟/英里

  1. calculating the distances (in miles) and time relative to the first row
    • My logic
      • if n rows are within x minutes/miles of the first row, then those n rows are within x minutes/miles of each other

生成一些虚拟数据

  • random co-ordinates
# Generate random Lat-Long points
def newpoint():
   return uniform(-180,180), uniform(-90, 90)
home_points = (newpoint() for x in range(289))
work_points = (newpoint() for x in range(289))

df = pd.DataFrame(home_points, columns=['Home_Longitude', 'Home_Latitude'])
df[['Work_Longitude', 'Work_Latitude']] = pd.DataFrame(work_points)

# Insert `ID` column as sequence of integers
df.insert(0, 'ID', range(289))

# Generate random datetimes, separated by 5 minute intervals
# (you can choose your own interval)
times = pd.date_range('2012-10-01', periods=289, freq='5min')
df.insert(1, 'Created_Time', times)
print(df.head())

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude
0   0 2012-10-01 00:00:00      -48.885981     -39.412351      -68.756244      24.739860
1   1 2012-10-01 00:05:00       58.584893      59.851739     -119.978429     -87.687858
2   2 2012-10-01 00:10:00      -18.623484      85.435248      -14.204142      -3.693993
3   3 2012-10-01 00:15:00      -29.721788      71.671103      -69.833253     -12.446204
4   4 2012-10-01 00:20:00      168.257968     -13.247833       60.979050     -18.393925

使用Haversine距离公式(向量化的Haversine距离公式,以公里为单位

Create Python helper function with haversine distance formula (vectorized haversine distance formula, in km)

def haversine(lat1, lon1, lat2, lon2, to_radians=False, earth_radius=6371):
    """
    slightly modified version: of http://stackoverflow.com/a/29546836/2901002

    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees or in radians)

    All (lat, lon) coordinates must have numeric dtypes and be of equal length.

    """
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

使用Haversine公式计算距离(相对于第一行),然后将km转换为英里

Calculate distances (relative to first row) in km, using haversine formula. Then, convert km to miles

df['Home_dist_miles'] = \
    haversine(df.Home_Longitude, df.Home_Latitude,
                 df.loc[0, 'Home_Longitude'], df.loc[0, 'Home_Latitude'])*0.621371
df['Work_dist_miles'] = \
    haversine(df.Work_Longitude, df.Work_Latitude,
                 df.loc[0, 'Work_Longitude'], df.loc[0, 'Work_Latitude'])*0.621371

计算时差,以分钟为单位(相对于第一行)


  • 这里的虚拟数据,时间差将是5分钟的倍数(但在实际数据中,它们可能是任意值)

df['time'] = df['Created_Time'] - df.loc[0, 'Created_Time']
df['time_min'] = (df['time'].dt.days * 24 * 60 * 60 + df['time'].dt.seconds)/60

应用过滤器(方法1),然后选择满足OP中所述条件的任意两行

Apply filters (method 1) and then select any 2 rows that satisfy the conditions stated in the OP

home_filter = df['Home_dist_miles']<=12000 # within 12,000 miles
work_filter = df['Work_dist_miles']<=8000 # within 8,000 miles
time_filter = df['time_min']<=25 # within 25 minutes
df_filtered = df.loc[(home_filter) & (work_filter) & (time_filter)]

# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude  Home_dist_miles  Work_dist_miles     time  time_min
0   0 2012-10-01 00:00:00     -168.956448     -42.970705       -6.340945     -12.749469         0.000000         0.000000 00:00:00       0.0
4   4 2012-10-01 00:20:00      -73.120352      13.748187      -36.953587      23.528789      6259.078588      5939.425019 00:20:00      20.0

应用过滤器(方法2),然后选择任何满足以下条件的行在OP中

multi_query = """Home_dist_miles<=12000 & \
                Work_dist_miles<=8000 & \
                time_min<=25"""
df_filtered = df.query(multi_query)

# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude  Home_dist_miles  Work_dist_miles     time  time_min
0   0 2012-10-01 00:00:00     -168.956448     -42.970705       -6.340945     -12.749469         0.000000         0.000000 00:00:00       0.0
4   4 2012-10-01 00:20:00      -73.120352      13.748187      -36.953587      23.528789      6259.078588      5939.425019 00:20:00      20.0

这篇关于根据参数或数据框行下方列中的差异对df进行细分或分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆