使用 Apache 梁`GroupByKey` 并构造一个新列 - Python [英] Use Apache beam `GroupByKey` and construct a new column - Python
问题描述
来自这个问题:如何对数据进行分组并构造一个新列——python pandas?,我知道如何使用pandas
对多列进行分组并构造一个新的唯一ID,但是如果我想使用Apache beam
在 Python 中实现与该问题中描述的相同的事情,我怎样才能实现它,然后将新数据写入换行符分隔的 JSON 格式文件(每一行是一个 unique_id
具有属于该 unique_id 的对象数组)?
假设数据集存储在一个 csv 文件中.
我是 Apache Beam 的新手,这是我现在所拥有的:
导入熊猫导入 apache_beam 作为梁从 apache_beam.dataframe.io 导入 read_csv使用 beam.Pipeline() 作为 p:df = p |read_csv(example.csv", names=cols)agg_df = df.insert(0, 'unique_id',df.groupby(['postcode', 'house_number'], sort=False).ngroup())agg_df.to_csv('test_output')
这给了我一个错误:
NotImplementedError: 'ngroup' 还不支持 (BEAM-9547)
这真的很烦人,我对Apache Beam不是很熟悉,有人可以帮忙吗...
(参考:https://beam.apache.org/documentation/dsls/dataframes/overview/)
将连续整数分配给一个集合并不是很适合并行计算的事情.它也不是很稳定.是否有其他标识符(例如元组 (postcode, house_number)
或其散列不适合?)
From this question: How to group data and construct a new column - python pandas?, I know how to groupby multiple columns and construct a new unique id by using pandas
, but if I want to use Apache beam
in Python to achieve the same thing that is described in that question, how can I achieve it and then write the new data to a newline delimited JSON format file (each line is one unique_id
with an array of objects that belong to that unique_id)?
Assuming the dataset is stored in a csv file.
I'm new to Apache beam, here's what I have now:
import pandas
import apache_beam as beam
from apache_beam.dataframe.io import read_csv
with beam.Pipeline() as p:
df = p | read_csv("example.csv", names=cols)
agg_df = df.insert(0, 'unique_id',
df.groupby(['postcode', 'house_number'], sort=False).ngroup())
agg_df.to_csv('test_output')
This gave me an error:
NotImplementedError: 'ngroup' is not yet supported (BEAM-9547)
This is really annoying, I'm not very familiar with Apache beam, can someone help please...
(ref: https://beam.apache.org/documentation/dsls/dataframes/overview/)
Assigning consecutive integers to a set is not something that's very amenable to parallel computation. It's also not very stable. Is there any reason another identifier (e.g. the tuple (postcode, house_number)
or its hash would not be suitable?
这篇关于使用 Apache 梁`GroupByKey` 并构造一个新列 - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!