使用 Apache 梁`GroupByKey` 并构造一个新列 - Python [英] Use Apache beam `GroupByKey` and construct a new column - Python

查看：21 发布时间：2021/11/11 22:45:16 python json csv apache-beam apache-beam-io

本文介绍了使用 Apache 梁`GroupByKey` 并构造一个新列 - Python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

来自这个问题:如何对数据进行分组并构造一个新列——python pandas?，我知道如何使用pandas对多列进行分组并构造一个新的唯一ID，但是如果我想使用Apache beam 在 Python 中实现与该问题中描述的相同的事情，我怎样才能实现它，然后将新数据写入换行符分隔的 JSON 格式文件(每一行是一个 unique_id 具有属于该 unique_id 的对象数组)?

假设数据集存储在一个 csv 文件中.

我是 Apache Beam 的新手，这是我现在所拥有的:

导入熊猫导入 apache_beam 作为梁从 apache_beam.dataframe.io 导入 read_csv使用 beam.Pipeline() 作为 p:df = p |read_csv(example.csv", names=cols)agg_df = df.insert(0, 'unique_id',df.groupby(['postcode', 'house_number'], sort=False).ngroup())agg_df.to_csv('test_output')

这给了我一个错误:

NotImplementedError: 'ngroup' 还不支持 (BEAM-9547)

这真的很烦人，我对Apache Beam不是很熟悉，有人可以帮忙吗...

(参考:https://beam.apache.org/documentation/dsls/dataframes/overview/)

解决方案

将连续整数分配给一个集合并不是很适合并行计算的事情.它也不是很稳定.是否有其他标识符(例如元组 (postcode, house_number) 或其散列不适合?)

From this question: How to group data and construct a new column - python pandas?, I know how to groupby multiple columns and construct a new unique id by using pandas, but if I want to use Apache beam in Python to achieve the same thing that is described in that question, how can I achieve it and then write the new data to a newline delimited JSON format file (each line is one unique_id with an array of objects that belong to that unique_id)?

Assuming the dataset is stored in a csv file.

I'm new to Apache beam, here's what I have now:

import pandas
import apache_beam as beam
from apache_beam.dataframe.io import read_csv

with beam.Pipeline() as p:
    df = p | read_csv("example.csv", names=cols)
    agg_df = df.insert(0, 'unique_id',
          df.groupby(['postcode', 'house_number'], sort=False).ngroup())
    agg_df.to_csv('test_output')

This gave me an error:

NotImplementedError: 'ngroup' is not yet supported (BEAM-9547)

This is really annoying, I'm not very familiar with Apache beam, can someone help please...

(ref: https://beam.apache.org/documentation/dsls/dataframes/overview/)

解决方案

Assigning consecutive integers to a set is not something that's very amenable to parallel computation. It's also not very stable. Is there any reason another identifier (e.g. the tuple (postcode, house_number) or its hash would not be suitable?

这篇关于使用 Apache 梁`GroupByKey` 并构造一个新列 - Python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Apache 梁`GroupByKey` 并构造一个新列 - Python [英] Use Apache beam `GroupByKey` and construct a new column - Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 Apache 梁`GroupByKey` 并构造一个新列 - Python [英] Use Apache beam `GroupByKey` and construct a new column - Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭