rdd在pyspark数据帧中是什么意思 [英] What does rdd mean in pyspark dataframe

查看：72 发布时间：2020/9/4 21:50:38 pyspark pyspark-sql

本文介绍了rdd在pyspark数据帧中是什么意思的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是pyspark的新手.我想知道rdd在pyspark数据帧中是什么意思.

I am new to to pyspark. I am wondering what does rdd mean in pyspark dataframe.

weatherData = spark.read.csv('weather.csv', header=True, inferSchema=True)

这两行代码具有相同的输出.我想知道rdd会带来什么影响

These two line of the code has the same output. I am wondering what the effect of having rdd

weatherData.collect()
weatherData.rdd.collect()

推荐答案

数据框是表或二维数组状结构，其中每一列包含一个变量的度量，每一行包含一个个案.

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

因此，DataFrame由于其表格格式而具有其他元数据，这使得Spark可以在最终查询中运行某些优化.

So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.

另一方面，RDD仅仅是一个 R 弹性 D 分配的 D 资产集，更多地是一个黑箱无法对其进行优化的数据不受约束.

An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.

但是，您可以通过.rdd方法从DataFrame到RDD，并且可以通过.toDF()从RDD到DataFrame(如果RDD为表格格式).方法

However, you can go from a DataFrame to an RDD via its .rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the .toDF() method

通常，由于内置的查询优化功能，建议尽可能使用DataFrame.

In general, it is recommended to use a DataFrame where possible due to the built in query optimization.

这篇关于rdd在pyspark数据帧中是什么意思的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

rdd在pyspark数据帧中是什么意思 [英] What does rdd mean in pyspark dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

rdd在pyspark数据帧中是什么意思 [英] What does rdd mean in pyspark dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭