在另一个数据帧的转换中创建/访问数据帧 [英] Creating/accessing dataframe inside the transformation of another dataframe

查看:30
本文介绍了在另一个数据帧的转换中创建/访问数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在改造一些现有代码以使用 Spark.我有多个包含不同数据集的数据框.在转换我的主数据帧(或我的主数据集)时,我需要使用来自其他数据帧的数据来完成转换.我也有一种情况(至少在当前结构中),我需要在另一个数据框的转换函数中创建新的数据框.

I'm retrofitting some existing code to use Spark. I've multiple data frames that hold different data sets. While transforming my main dataframe (or my main data set), I need to use data from the other data frames to complete the transformation. I also have a situation (atleast in the current structure) where I need to create new data frames in the transformation function of another data frames.

我正在尝试确定以下内容:

I'm trying to determine the following:

  1. 我可以在另一个数据框的转换函数中访问一个数据框吗?
  2. 可以在数据帧的转换函数内的执行器上创建数据帧吗?

有关如何处理这种情况的提示将非常有帮助.

Pointers on how to deal with such a situation would be very helpful.

推荐答案

这两个问题的答案都是NO:

The answer to both questions is NO:

DataFrames 是分布式集合的驱动程序端抽象.它们不能在任何执行器端转换中使用、创建或引用.

DataFrames are driver-side abstractions of distributed collections. They cannot be used, created, or referenced in any executor-side transformation.

为什么?数据帧(如 RDD 和数据集)只能在活动 SparkSession 的上下文中使用 - 没有它,数据帧不能指向"活动执行器上的分区;SparkSession 应该被认为是与执行程序集群的实时连接".

Why? DataFrames (like RDDs and Datasets) can only be used within the context of an active SparkSession - without it, the DataFrame cannot "point" to its partitions on the active executors; The SparkSession should be thought of as a live "connection" to the cluster of executors.

现在,如果您尝试在另一个转换中使用 DataFrame,则该 DataFrame 必须在驱动程序端序列化,发送到执行程序,然后反序列化em> 那里.但是这个反序列化的实例(在一个单独的 JVM 中)必然会丢失它的 SparkSession - 连接"是从驱动程序到执行程序,而不是我们现在正在运行的这个新执行程序.

Now, if you try using a DataFrame inside another transformation, that DataFrame would have to be serialized on the driver side, sent to the executor(s), and then deserialized there. But this deserialized instance (in a separate JVM) would necessarily lose it's SparkSession - that "connection" was from the driver to the executor, not from this new executor we're now operating in.

那你该怎么办?您有几个选项可以在另一个 DataFrame 中引用一个 DataFrame 的数据,选择正确的选项主要取决于必须混洗(或 - 在执行程序之间传输)的数据量:

So what should you do? You have a few options for referencing one DataFrame's data in another, and choosing the right one is mostly dependent on the amounts of data that would have to be shuffled (or - transferred between executors):

  1. 收集其中一个 DataFrame(如果你能保证它很小!),然后使用生成的本地集合(直接或使用 spark.broadcast) 在任何变换中.

  1. Collect one of the DataFrames (if you can guarantee it's small!), and then use the resulting local collection (either directly or using spark.broadcast) in any transformation.

在一些常见字段上加入两个 DataFrame.这是一个非常常见的解决方案,因为在转换另一个 DataFrame 数据时使用一个 DataFrame 的数据的逻辑通常与基于列的某些子集的正确值的某种查找"有关.这个用例很自然地转化为 JOIN 操作

Join the two DataFrames on some common fields. This is a very common solution, as the logic of using one DataFrame's data when transforming another usually has to do with some kind of "lookup" for the right value based on some subset of the columns. This usecase translates into a JOIN operation rather naturally

使用集合运算符,例如 exceptintersectunion,如果它们提供逻辑运算你在追求.

Use set operators like except, intersect and union, if they provide the logical operation you're after.

这篇关于在另一个数据帧的转换中创建/访问数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆