在另一个数据框的转换内创建/访问数据框 [英] Creating/accessing dataframe inside the transformation of another dataframe

查看:81
本文介绍了在另一个数据框的转换内创建/访问数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在改装一些现有代码以使用Spark.我有多个保存不同数据集的数据框. 在转换我的主数据帧(或我的主数据集)时,我需要使用其他数据帧中的数据来完成转换.我还遇到了一种情况(至少在当前结构中),我需要在另一个数据帧的转换函数中创建新的数据帧.

I'm retrofitting some existing code to use Spark. I've multiple data frames that hold different data sets. While transforming my main dataframe (or my main data set), I need to use data from the other data frames to complete the transformation. I also have a situation (atleast in the current structure) where I need to create new data frames in the transformation function of another data frames.

我正在尝试确定以下内容:

I'm trying to determine the following:

  1. 我可以在另一个数据框的转换功能内访问一个数据框吗?
  2. 能否在数据框的转换功能内的执行程序上创建数据框?

有关如何处理这种情况的指针将非常有帮助.

Pointers on how to deal with such a situation would be very helpful.

推荐答案

两个问题的答案都是:

DataFrames是分布式集合的驱动程序端抽象.它们不能在任何执行者端转换中使用,创建或引用.

DataFrames are driver-side abstractions of distributed collections. They cannot be used, created, or referenced in any executor-side transformation.

为什么? DataFrame(如RDD和Dataset)只能在活动的SparkSession上下文中使用-没有它,DataFrame不能指向"活动执行器上的分区.应该将SparkSession视为与执行程序群集的实时连接".

Why? DataFrames (like RDDs and Datasets) can only be used within the context of an active SparkSession - without it, the DataFrame cannot "point" to its partitions on the active executors; The SparkSession should be thought of as a live "connection" to the cluster of executors.

现在,如果您尝试在另一个转换中使用DataFrame,则该DataFrame必须在驱动程序端进行序列化,发送给执行程序,然后进行反序列化在那里.但是,这个反序列化的实例(在单独的JVM中)必然会丢失它的SparkSession-连接"是从驱动程序到执行程序的,而不是我们现在正在使用的新执行程序的.

Now, if you try using a DataFrame inside another transformation, that DataFrame would have to be serialized on the driver side, sent to the executor(s), and then deserialized there. But this deserialized instance (in a separate JVM) would necessarily lose it's SparkSession - that "connection" was from the driver to the executor, not from this new executor we're now operating in.

那你应该怎么做? 您有几种选择,可以在另一种中引用一个DataFrame的 数据,选择正确的数据主要取决于必须重新整理(或-在执行程序之间传输)的数据量:

So what should you do? You have a few options for referencing one DataFrame's data in another, and choosing the right one is mostly dependent on the amounts of data that would have to be shuffled (or - transferred between executors):

  1. 收集一个DataFrame(如果可以保证很小的话),然后在任何转换中使用生成的本地集合(直接使用或使用spark.broadcast).

  1. Collect one of the DataFrames (if you can guarantee it's small!), and then use the resulting local collection (either directly or using spark.broadcast) in any transformation.

加入两个通用字段上的两个数据框.这是一个非常常见的解决方案,因为在转换另一个数据框时使用一个DataFrame的数据的逻辑通常与基于列的某些子集的某种查找"以获取正确的值有关.此用例很自然地转换为JOIN操作

Join the two DataFrames on some common fields. This is a very common solution, as the logic of using one DataFrame's data when transforming another usually has to do with some kind of "lookup" for the right value based on some subset of the columns. This usecase translates into a JOIN operation rather naturally

使用集合运算符(例如exceptintersectunion),前提是它们提供了您要遵循的逻辑运算.

Use set operators like except, intersect and union, if they provide the logical operation you're after.

这篇关于在另一个数据框的转换内创建/访问数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆