标记的数据类型Python [英] Labelled datatypes Python

查看:83
本文介绍了标记的数据类型Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在计算一个点和多个线段之间的测地距离.每个线段都有一个唯一的标识号.我想从我的距离函数返回距离,以使它们在本质上联系在一起.我还希望维护功能,例如对距离进行排序,并使用标签或位置为它们建立索引,并同时获取距离数据和标签.有点像带有索引的Pandas系列,但是我不能使用系列,因为数据被返回到Pandas DataFrame中,然后Pandas DataFrame扩展了该系列并弄得一团糟.这是一个示例:

I am computing geodesic distances between a point and multiple line segments. Each line segment has a unique identifying number. I want to return distances from my distances function such that they are both intrinsically tied together. I would also like to maintain functionality, as in sort the distances, and index them with either the label or position, and get back both the distance data and the label. Something like a Pandas Series with an index, but I cannot use a series because the data is returned into a Pandas DataFrame, which then expands the series and makes a mess. Here is an example:

In [1]: '''Note that all this happens inside an apply function of a Pandas Series'''
        labels = [25622, 25621, 25620, 25619, 25618]
        dist = vect_dist_funct(pt, labels) #vect_dist_funct does the computations, and returns distances in meters
        dist
Out[1]: array([296780.2217658355, 296572.4476883276, 296364.21166884096,
               296156.4366241771, 295948.6610171968], dtype=object)

但是我想要的是这样的字典,标签和距离本质上是相互联系的:

What I want however, is something like this dict, where the labels and distances are inherently tied to each other:

{25622 : 296780.2217658355,
 25621 : 296572.4476883276,
 25620 : 296364.21166884096,
 25619 : 296156.4366241771,
 25618 : 295948.6610171968}

但是现在我失去了值的功能.我无法轻松地对它们进行排序,比较或其他任何操作.我查看了 Numpy结构化数组,它们似乎可行,但是如果我无法对距离进行排序并获得最接近的分段的索引,那么对我来说它就没有多大用处.还有其他可以使用的数据类型吗?

But now I have lost functionality of the values. I cannot easily sort them, or compare them, or anything. I looked at Numpy Structured Arrays, and they seem workable, but if I am not able to sort the distances, and get the index of the closest segment, it will not be of much use to me. Is there any other datatype that I can use?

悠久的故事和背景

我正在尝试进行空间连接.通过在RTree中进行搜索,我得到了一个点最可能接近的段的索引(

I am trying to do a spatial join. I get the indexes of the segments a point is most likely closer to by searching in a RTree (example). Those are the indexes in labels. Then I look through the line geometries table to find the line geometry for those selected labels, and compute the distances of the points to each of the line segment.

接下来的步骤涉及对空间连接进行完整性检查.在某些情况下,最近不是最佳联接候选者,并且联接需要根据其他参数进行评估.因此,我的计划是从最接近的部分向外进行工作.这将涉及对距离进行排序,并获取最接近的线段的索引,然后浏览具有该索引的线段表并提取线的其他属性以进行检查.如果可以确认匹配,则该段被接受,否则被拒绝,该算法将移至下一个最接近的段.

Next steps involve sanity checking the spatial join. Nearest is not the best join candidate in some cases, and the join needs to be evaluated on other parameters. Therefore, my plan is to work from closest segment outward. Which would involve sorting on the distances, and getting the indexes of the closest segment, then looking through the segment table with that index and extracting other properties of the line for inspection. If a match can be confirmed, the said segment is accepted, else, it is rejected, and the algorithm would move to the next closest segment.

一种能满足我所有需求的数据类型,而不会破坏段的距离之间的联系.

A data type that does all this is what I am looking for, without breaking the link between the distances the segment from which it was computed.

使用熊猫的问题

这就是函数实际被调用的方式:

So this is how the function is actually being called:

joined = points['geometry'].apply(pointer, centroid=line['centroid'], tree_idx=tree_idx))

然后在pointer内部,发生这种情况:

Then inside pointer, this happens:

def pointer(point, centroid, tree_idx):
    intersect = list(tree_idx.intersection(point.bounds))
    if len(intersect) > 0:
        points = pd.Series([point.coords[0]]*len(intersect)).values
        polygons = centroid.loc[intersect].values
        dist = vect_dist_funct(points, polygons)
        return pd.Series(dist, index=intercept, name='Dist').sort_values()
    else:
        return pd.Series(np.nan, index=[0], name='Dist')

然后,joined看起来像这样:

这是因为未计算所有点(行是点)和所有线(列是线)之间的距离.这太昂贵了(400万个点,每个状态18万行,整个数据集上50个状态).而且,与返回两个Numpy数组相比,此生成到joined的DataFrame合并操作将运行时间增加了7倍.返回两个Numpy数组的问题在于,要始终保持距离和线ID对齐并不容易.

This is because distances between all points (the rows are points) and all lines (the columns are lines) are not computed. That would be too cost prohibitive (4M points, and 180k lines per state, and 50 states on the whole dataset). Also, this DataFrame merge operation to produced joined increases the run time 7 times, compared to when I return two Numpy arrays. The problem with returning two Numpy arrays is that it is not easy to keep the distance and the line IDs aligned all the time.

点,线,tree_idx的示例

请注意,这是列和行中的截断数据集.我只包括相关列,而不包括其余数据:

Note that this is truncated dataset in columns and rows. I am only including the columns of relevance, and not the rest of the data:

点:

                        geometry
id      
88400001394219  0.00    POINT (-105.2363291 39.6988139)
                0.25    POINT (-105.2372017334178 39.69899060448157)
                0.50    POINT (-105.2380177896182 39.69933953105642)
                0.75    POINT (-105.2387202141595 39.69988447162143)
                1.00    POINT (-105.2393222 39.7005405)
88400002400701  0.00    POINT (-104.7102833 39.8318348)
                0.25    POINT (-104.7102827 39.831966625)
                0.50    POINT (-104.7102821 39.83209845)
                0.75    POINT (-104.7102815 39.832230275)
                1.00    POINT (-104.7102809 39.8323621)

因此,这基本上是线上的插值点.线ID是第一级索引,第二级是插值点的百分比.这形成了第一个数据集,我想将第二个数据集的一些属性带到该数据集.

So this is basically interpolated points on lines. The line id is the first level of index, and the second level is the percent where the point was interpolated. This forms the first dataset, the dataset to which I want to bring some attributes from the second dataset.

行:

        geometry                                            centroid
id      
71345   POLYGON ((-103.2077992965318 40.58026765162965...   (-103.20073265160862, 40.576450381964975)
71346   POLYGON ((-103.2069505830457 40.58155121711739...   (-103.19987394433825, 40.57774903464972)
71347   POLYGON ((-103.2061017677045 40.58283487609803...   (-103.19901204453959, 40.57905245493993)
71348   POLYGON ((-103.2052000154291 40.58419853220472...   (-103.19815200508097, 40.58035300329024)
71349   POLYGON ((-103.2043512639656 40.58548197865339...   (-103.19729445792181, 40.58164972491414)
71350   POLYGON ((-103.2035025651746 40.5867652936463,...   (-103.1964362470977, 40.5829473948391)
71351   POLYGON ((-103.2026535431035 40.58804903349249...   (-103.19557847342394, 40.58424434094705)
71352   POLYGON ((-103.201804801526 40.58933229190573,...   (-103.19472966696722, 40.58552767098465)
71353   POLYGON ((-103.2009557884142 40.59061590473365...   (-103.19388484652855, 40.58680427447224)
71354   POLYGON ((-103.2001001699726 40.59190793446012...   (-103.19303392095904, 40.5880882237994)

这是第二个数据集的一部分(此答案开头提到的标签是该数据集的索引).目标是以智能方式将属性从该数据集传输到点数据集.第一步是找到离每个点最近的线.然后,我将比较点数据集和线数据集的一些属性,并确认或拒绝联接,就像我提到的那样.

This is part of the second dataset (the labels mentioned at the beginning of this answer is the index of this dataset). The goal is to transfer attributes from this dataset to the points dataset, in an intelligent manner. The first step of which is to find the nearest line to each of the points. Then I will compare some attributes from the points dataset with the lines dataset, and confirm or reject a join, like I mentioned.

tree_idx:

tree_idx是使用以下代码创建的:

tree_idx is created using the following code:

import rtree
lines_bounds = lines['geometry'].apply(lambda x: x.bounds)
tree_idx = rtree.index.Index()
for i in lines_bounds.index:
    tree_idx.insert(i, lines_bounds.loc[i])

推荐答案

所以我认为您的总体问题是您正在创建DataFrame,其中列标签为intercept值.我认为您想做的是创建一个DataFrame,其中一列包含截距值,而另一列包含距离.我会尽力为您提供我认为会有所帮助的代码,但是如果没有原始数据就很难确定,因此许多人都需要对其进行一些修改才能使其正常工作.

So I think your overall problem is you are creating a DataFrame where the column label is the intercept value. I think what you want to do is create a DataFrame where one column contains the intercept values, while another contains the distances. I will try to give you code that I think will help, but it is hard to be certain without having your original data so you many need to modify it somewhat to get it to work perfectly.

首先,我将修改vect_dist_funct,因此,如果第一个参数是标量,它将创建正确长度的列表,如果第二个参数为空,则返回NaN.

First, I would modify vect_dist_funct so if the first argument is a scalar, it creates the correct-length list, and if the second is empty it returns NaN.

接下来,我将所有有用的值添加为DataFrame的列:

Next I would add all the useful values as columns to the DataFrame:

points['intersect'] = points['geometry'].apply(lambda x: np.array(tree_idx.intersection(x.bounds)))
points['polygons'] = points['intersect'].apply(lambda x: centroid.loc[x].values)
points['coords0'] = points['geometry'].apply(lambda x: x.coords[0])
points['dist'] = points.apply(lambda x: vect_dist_funct(x.coords0, x.polygons), axis=1)

这将为您提供一列包含所有距离的列.如果您确实希望截距值可访问,则可以创建一个仅包含截距和距离的DataFrame,然后将截距设置为另一个多索引级别,以避免出现过多的NaN值:

This will give you a column with all the distances in it. If you really want the intercept values to be accessible, you can then create a DataFrame with just the intercepts and distances, and then put the intercepts as another multiindex level to avoid too many NaN values:

pairs = points.apply(lambda x: pd.DataFrame([x['intersect'], x['dist']], index=['intersect', 'dist']).T.stack(), axis=1)
pairs = pairs.stack(level=0).set_index('intersect', append=True)
pairs.index = pairs.index.droplevel(level=2)

这应该为您提供Series,其中第一个索引是id,第二个索引是百分比,第三个索引是相交,并且值是距离.

This should give you a Series where the first index is the id, the second is the percent, the third is the intersect, and the value is the distance.

这篇关于标记的数据类型Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆