Sparklyr：如何在列表列中将自己的列列入Spark表中的列？ [英] Sparklyr: how to explode a list column into their own columns in Spark table?

查看：459 发布时间：2017/7/13 20:42:12 r apache-spark dplyr tidyr sparklyr

本文介绍了Sparklyr：如何在列表列中将自己的列列入Spark表中的列？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的问题与此处中的问题类似，但我在执行答案时遇到问题，我无法所以，我有一个大的CSV文件，其中包含一个嵌套的数据，它包含2个由空格分隔的列（例如，第一列是Y，第二列）是X）。列X本身也是逗号分隔值。

  21.66 2.643227,1.2698358,2.6338573,1.8812188,3.8708665 ... 
 35.15 3.422151，-0.59515584,2.4994135，-0.19701914,4.0771823，... 
 15.22 2.8302398,1.9080592，-0.68780196,3.1878228,4.6600842 ... 
 ...

我想使用 sparklyr 将此CSV读取到两个不同的Spark表中。

到目前为止，这是我一直在做的：

使用 spark_read_csv 将所有CSV内容导入到Spark数据表中

df = spark_read_csv sc，path =path，name =simData，delimiter =，header =false，infer_schema =false）

结果是一个名为 simData 的Spark表，具有2列： C0 和 C1

使用 dplyr 先选择&第二列，然后将它们注册为分别为Y和X的新表

simY < - df％>％select（C0）％ >％sdf_register（simY）

simX< - df％>％select（C1）％>％sdf_register（simX）

在 simX 使用 ft_regex_tokenizer 函数，关于这里。

ft_regex_tokenizer（input_DF，input.col =COL，output.col =ResultCols，pattern ='\\ ## ＃'）

但是当我尝试 / code>它使用 dplyr ：

 来源：查询[6 x 1] 
数据库：spark connection master = yarn-client app = sparklyr local = FALSE 
 
结果
< list> 
 1< list [789]> 
 2< list [789]> 
 3< list [789]> 
 4< list [789]> 
 5< list [789]> 
 6< list [789]>

我想把它变成一个新的Spark表，并将类型转换为double。有没有办法做到这一点？我已经将收集数据转换为R（使用 dplyr ），转换为矩阵，然后为每一行执行 strsplit ，但我认为这不是一个解决方案，因为CSV大小可以达到40GB。

 
 
 编辑：Spark版本是1.6.0 
解决方案
让我们说你的数据看起来像这样
  library（dplyr）
库（sparklyr）
 
 df<  -  data.frame（text = c 1.0,2.0,3.0，4.0,5.0,6.0））
 sdf<  -  copy_to（sc，df，df，overwrite = TRUE）
  / pre> 
 
 ，您已经创建了一个 spark_connection ，您可以按照
  n < -  3 
 
＃Hive 
＃中的数组访问没有函数语法，所以我们必须建立[]表达式
＃CAST（... AS double）可以在sparklyr / dplyr中使用as.numeric 
 exprs<  -  lapply（
 0：（n  -  1），
函数（i）粘贴（CAST（bits [，i，] AS double） AS x，i，sep =））
 
 sdf％>％
＃转换为Spark DataFrame 
 spark_dataframe（）％>％
＃使用表达式与拆分
 invoke（selectExpr，list（split（text，'，'）AS bits））％>％
＃选择单个列
 invoke selectExpr，exprs）％>％
＃在tranore中注册表（Spark 1.x中的registerTempTable）
 invoke（createOrReplaceTempView，expanding_df）
  
并使用 dplyr :: tbl 取回 sparklyr 对象：
  tbl（sc，expanding_df）
  
 
 
 
 资料来源：query [2 x 3] 
数据库：spark connection master = local [8] app = sparklyr local = TRUE 
 
 x0 x1 x2 
< dbl> < DBL> < DBL> 
 1 1 2 3 
 2 4 5 6 
  
 
My question is similar with the one in here, but I'm having problems implementing the answer, and I cannot comment in that thread.


So, I have a big CSV file that contains a nested data, which contains 2 columns separated by whitespace (say first column is Y, second column is X). Column X itself is also a comma-separated value.
21.66 2.643227,1.2698358,2.6338573,1.8812188,3.8708665,...
35.15 3.422151,-0.59515584,2.4994135,-0.19701914,4.0771823,...
15.22 2.8302398,1.9080592,-0.68780196,3.1878228,4.6600842,...
...
I want to read this CSV into 2 different Spark tables using sparklyr. 

So far this is what I've been doing:

Use spark_read_csv to import all CSV contents into Spark data table

df = spark_read_csv(sc, path = "path", name = "simData", delimiter = " ", header = "false", infer_schema = "false")

The result is a Spark table named simData with 2 columns: C0 and C1
Use dplyr to select first & second column, and then register them as new tables named Y and X respectively

simY <- df %>% select(C0) %>% sdf_register("simY")

simX <- df %>% select(C1) %>% sdf_register("simX")
Split the value in simX using ft_regex_tokenizer function, with regards the answer written in here. 

ft_regex_tokenizer(input_DF, input.col = "COL", output.col = "ResultCols", pattern = '\\###')
But when I try to head it using dplyr:
Source:   query [6 x 1]
Database: spark connection master=yarn-client app=sparklyr local=FALSE

        Result
        <list>
1 <list [789]>
2 <list [789]>
3 <list [789]>
4 <list [789]>
5 <list [789]>
6 <list [789]>
I want to turn this into a new Spark table and convert the type to double. Is there any way to do this? 
I've considered to collect the data into R (using dplyr), convert to matrix, and then do strsplit for each row, but I think this is not a solution because the CSV size can go up to 40GB.

EDIT: Spark version is 1.6.0
 解决方案 
Let's say your data look like this
library(dplyr)
library(sparklyr)

df <- data.frame(text = c("1.0,2.0,3.0", "4.0,5.0,6.0"))
sdf <- copy_to(sc, df, "df", overwrite = TRUE)
and you've already created a spark_connection you can do following
n <- 3

# There is no function syntax for array access in Hive
# so we have to build [] expressions
# CAST(... AS double) could be handled in sparklyr / dplyr with as.numeric
exprs <- lapply(
  0:(n - 1), 
  function(i) paste("CAST(bits[", i, "] AS double) AS x", i, sep=""))

sdf %>%
  # Convert to Spark DataFrame
  spark_dataframe() %>% 
  # Use expression with split and explode
  invoke("selectExpr", list("split(text, ',') AS  bits")) %>%
  # Select individual columns
  invoke("selectExpr", exprs) %>%
  # Register table in the metastore ("registerTempTable" in Spark 1.x)
  invoke("createOrReplaceTempView", "exploded_df")
And use dplyr::tbl to get back sparklyr object:
tbl(sc, "exploded_df")


Source:   query [2 x 3]
Database: spark connection master=local[8] app=sparklyr local=TRUE

     x0    x1    x2
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6


                        
这篇关于Sparklyr：如何在列表列中将自己的列列入Spark表中的列？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Sparklyr：如何在列表列中将自己的列列入Spark表中的列？ [英] Sparklyr: how to explode a list column into their own columns in Spark table?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

Sparklyr：如何在列表列中将自己的列列入Spark表中的列？ [英] Sparklyr: how to explode a list column into their own columns in Spark table?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭