使用 Apache 梁`GroupByKey` 并构造一个新列 - Python [英] Use Apache beam `GroupByKey` and construct a new column - Python

查看:21
本文介绍了使用 Apache 梁`GroupByKey` 并构造一个新列 - Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

来自这个问题:如何对数据进行分组并构造一个新列——python pandas?,我知道如何使用pandas对多列进行分组并构造一个新的唯一ID,但是如果我想使用Apache beam 在 Python 中实现与该问题中描述的相同的事情,我怎样才能实现它,然后将新数据写入换行符分隔的 JSON 格式文件(每一行是一个 unique_id 具有属于该 unique_id 的对象数组)?

假设数据集存储在一个 csv 文件中.

我是 Apache Beam 的新手,这是我现在所拥有的:

导入熊猫导入 apache_beam 作为梁从 apache_beam.dataframe.io 导入 read_csv使用 beam.Pipeline() 作为 p:df = p |read_csv(example.csv", names=cols)agg_df = df.insert(0, 'unique_id',df.groupby(['postcode', 'house_number'], sort=False).ngroup())agg_df.to_csv('test_output')

这给了我一个错误:

NotImplementedError: 'ngroup' 还不支持 (BEAM-9547)

这真的很烦人,我对Apache Beam不是很熟悉,有人可以帮忙吗...

(参考:https://beam.apache.org/documentation/dsls/dataframes/overview/)

解决方案

将连续整数分配给一个集合并不是很适合并行计算的事情.它也不是很稳定.是否有其他标识符(例如元组 (postcode, house_number) 或其散列不适合?)

From this question: How to group data and construct a new column - python pandas?, I know how to groupby multiple columns and construct a new unique id by using pandas, but if I want to use Apache beam in Python to achieve the same thing that is described in that question, how can I achieve it and then write the new data to a newline delimited JSON format file (each line is one unique_id with an array of objects that belong to that unique_id)?

Assuming the dataset is stored in a csv file.

I'm new to Apache beam, here's what I have now:

import pandas
import apache_beam as beam
from apache_beam.dataframe.io import read_csv

with beam.Pipeline() as p:
    df = p | read_csv("example.csv", names=cols)
    agg_df = df.insert(0, 'unique_id',
          df.groupby(['postcode', 'house_number'], sort=False).ngroup())
    agg_df.to_csv('test_output')        

This gave me an error:

NotImplementedError: 'ngroup' is not yet supported (BEAM-9547)

This is really annoying, I'm not very familiar with Apache beam, can someone help please...

(ref: https://beam.apache.org/documentation/dsls/dataframes/overview/)

解决方案

Assigning consecutive integers to a set is not something that's very amenable to parallel computation. It's also not very stable. Is there any reason another identifier (e.g. the tuple (postcode, house_number) or its hash would not be suitable?

这篇关于使用 Apache 梁`GroupByKey` 并构造一个新列 - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆