有没有一种根据条件更新数据框列值的更快方法? [英] Is there a faster way to update dataframe column values based on conditions?

查看：80 发布时间：2020/5/24 3:48:29 python pandas dataframe data-processing

本文介绍了有没有一种根据条件更新数据框列值的更快方法?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试处理数据框.这包括创建新列并根据其他列中的值更新其值.更具体地说，我有一个要分类的预定义源".该来源可以分为"source_dtp"，"source_dtot"和"source_cash"三个不同类别.我想基于原始源"列将三个新列添加到数据帧中，这些列由1或0组成.

I am trying to process a dataframe. This includes creating new columns and updating their values based on the values in other columns. More concretely, I have a predefined "source" that I want to classify. This source can fall under three different categories 'source_dtp', 'source_dtot', and 'source_cash'. I want to add three new columns to the dataframe that are comprised of either 1's or 0's based on the original "source" column.

我目前能够做到，只是真的很慢 ...

I am currently able to do this, it's just really slow...

原始列样本:

source
_id                     
AV4MdG6Ihowv-SKBN_nB    DTP
AV4Mc2vNhowv-SKBN_Rn    Cash 1
AV4MeisikOpWpLdepWy6    DTP
AV4MeRh6howv-SKBOBOn    Cash 1
AV4Mezwchowv-SKBOB_S    DTOT
AV4MeB7yhowv-SKBOA5b    DTP

所需的输出:

source_dtp  source_dtot source_cash
_id         
AV4MdG6Ihowv-SKBN_nB    1.0 0.0 0.0
AV4Mc2vNhowv-SKBN_Rn    0.0 0.0 1.0
AV4MeisikOpWpLdepWy6    1.0 0.0 0.0
AV4MeRh6howv-SKBOBOn    0.0 0.0 1.0
AV4Mezwchowv-SKBOB_S    0.0 1.0 0.0
AV4MeB7yhowv-SKBOA5b    1.0 0.0 0.0

这是我目前的方法，但是速度很慢.我更喜欢使用矢量化的方式执行此操作，但我不知道如何-因为条件非常复杂.

This is my current approach, but it's very slow. I would much prefer a vectorized form of doing this but I don't know how - as the condition is very elaborate.

# For 'source' we will use the following classes:
source_cats = ['source_dtp', 'source_dtot', 'source_cash']
# [0, 0, 0] would imply 'other', hence no need for a fourth category

# add new features to dataframe, initializing to nan
for cat in source_cats:
    data[cat] = np.nan

for row in data.itertuples():
    # create series to hold the result per row e.g. [1, 0, 0] for `cash`
    cat = [0, 0, 0]
    index = row[0]
    # to string as some entries are numerical
    source_type = str(data.loc[index, 'source']).lower()
    if 'dtp' in source_type:
        cat[0] = 1
    if 'dtot' in source_type:
        cat[1] = 1
    if 'cash' in source_type:
        cat[2] = 1
    data.loc[index, source_cats] = cat

我正在使用itertuples()，因为事实证明它比interrows()更快.

I am using itertuples() as it proved faster than interrows().

是否有更快的方法来实现与上述相同的功能?

Is there a faster way of achieving the same functionality as above?

这不仅与创建一个热编码有关.归结为根据另一列的值来更新列值.例如.如果我有一个特定的location_id，我想基于该原始ID更新其各自的longitude和latitude列(无需像我上面那样重复进行，因为对于大型数据集来说这确实很慢).

This is not just with regards to creating a one hot encoding. It boils down to updating the column values dependent on the value of another column. E.g. if I have a certain location_id I want to update its respective longitude and latitude columns - based on that original id (without iterating in the way that I do above because it's really slow for large datasets).

有没有一种根据条件更新数据框列值的更快方法? [英] Is there a faster way to update dataframe column values based on conditions?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

有没有一种根据条件更新数据框列值的更快方法? [英] Is there a faster way to update dataframe column values based on conditions?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭