Sparklyr:如何在列表列中将自己的列列入Spark表中的列? [英] Sparklyr: how to explode a list column into their own columns in Spark table?
问题描述
我的问题与此处中的问题类似,但我在执行答案时遇到问题,我无法所以,我有一个大的CSV文件,其中包含一个嵌套的数据,它包含2个由空格分隔的列(例如,第一列是Y,第二列)是X)。列X本身也是逗号分隔值。
21.66 2.643227,1.2698358,2.6338573,1.8812188,3.8708665 ...
35.15 3.422151,-0.59515584,2.4994135,-0.19701914,4.0771823,...
15.22 2.8302398,1.9080592,-0.68780196,3.1878228,4.6600842 ...
...
我想使用 sparklyr
将此CSV读取到两个不同的Spark表中。
到目前为止,这是我一直在做的:
-
使用
spark_read_csv
将所有CSV内容导入到Spark数据表中
df = spark_read_csv sc,path =path,name =simData,delimiter =,header =false,infer_schema =false)
结果是一个名为
simData
的Spark表,具有2列:C0
和C1
-
使用
dplyr
先选择&第二列,然后将它们注册为分别为Y和X的新表
simY < - df%>%select(C0)% >%sdf_register(simY)
simX< - df%>%select(C1) %>%sdf_register(simX)
-
在
simX
使用ft_regex_tokenizer
函数,关于这里。
ft_regex_tokenizer(input_DF,input.col =COL,output.col =ResultCols,pattern ='\\ ## #')
但是当我尝试 / code>它使用
dplyr
:
来源:查询[6 x 1]
数据库:spark connection master = yarn-client app = sparklyr local = FALSE
结果
< list>
1< list [789]>
2< list [789]>
3< list [789]>
4< list [789]>
5< list [789]>
6< list [789]>
我想把它变成一个新的Spark表,并将类型转换为double。有没有办法做到这一点?
我已经将收集
数据转换为R(使用 dplyr
),转换为矩阵,然后为每一行执行 strsplit
,但我认为这不是一个解决方案,因为CSV大小可以达到40GB。
编辑:Spark版本是1.6.0
让我们说你的数据看起来像这样
library(dplyr)
/ pre>
库(sparklyr)
df< - data.frame(text = c 1.0,2.0,3.0,4.0,5.0,6.0))
sdf< - copy_to(sc,df,df,overwrite = TRUE)
,您已经创建了一个
spark_connection
,您可以按照n < - 3
#Hive
#中的数组访问没有函数语法,所以我们必须建立[]表达式
#CAST(... AS double)可以在sparklyr / dplyr中使用as.numeric
exprs< - lapply(
0:(n - 1),
函数(i)粘贴(CAST(bits [,i,] AS double) AS x,i,sep =))
sdf%>%
#转换为Spark DataFrame
spark_dataframe()%>%
#使用表达式与拆分
invoke(selectExpr,list(split(text,',')AS bits))%>%
#选择单个列
invoke selectExpr,exprs)%>%
#在tranore中注册表(Spark 1.x中的registerTempTable)
invoke(createOrReplaceTempView,expanding_df)
并使用
dplyr :: tbl
取回sparklyr
对象:tbl(sc,expanding_df)
资料来源:query [2 x 3]
数据库:spark connection master = local [8] app = sparklyr local = TRUE
x0 x1 x2
< dbl> < DBL> < DBL>
1 1 2 3
2 4 5 6
My question is similar with the one in here, but I'm having problems implementing the answer, and I cannot comment in that thread.
So, I have a big CSV file that contains a nested data, which contains 2 columns separated by whitespace (say first column is Y, second column is X). Column X itself is also a comma-separated value.
21.66 2.643227,1.2698358,2.6338573,1.8812188,3.8708665,... 35.15 3.422151,-0.59515584,2.4994135,-0.19701914,4.0771823,... 15.22 2.8302398,1.9080592,-0.68780196,3.1878228,4.6600842,... ...
I want to read this CSV into 2 different Spark tables using
sparklyr
.So far this is what I've been doing:
Use
spark_read_csv
to import all CSV contents into Spark data table
df = spark_read_csv(sc, path = "path", name = "simData", delimiter = " ", header = "false", infer_schema = "false")
The result is a Spark table named
simData
with 2 columns:C0
andC1
Use
dplyr
to select first & second column, and then register them as new tables named Y and X respectively
simY <- df %>% select(C0) %>% sdf_register("simY")
simX <- df %>% select(C1) %>% sdf_register("simX")
Split the value in
simX
usingft_regex_tokenizer
function, with regards the answer written in here.
ft_regex_tokenizer(input_DF, input.col = "COL", output.col = "ResultCols", pattern = '\\###')
But when I try to
head
it usingdplyr
:Source: query [6 x 1] Database: spark connection master=yarn-client app=sparklyr local=FALSE Result <list> 1 <list [789]> 2 <list [789]> 3 <list [789]> 4 <list [789]> 5 <list [789]> 6 <list [789]>
I want to turn this into a new Spark table and convert the type to double. Is there any way to do this? I've considered to
collect
the data into R (usingdplyr
), convert to matrix, and then dostrsplit
for each row, but I think this is not a solution because the CSV size can go up to 40GB.EDIT: Spark version is 1.6.0
解决方案Let's say your data look like this
library(dplyr) library(sparklyr) df <- data.frame(text = c("1.0,2.0,3.0", "4.0,5.0,6.0")) sdf <- copy_to(sc, df, "df", overwrite = TRUE)
and you've already created a
spark_connection
you can do followingn <- 3 # There is no function syntax for array access in Hive # so we have to build [] expressions # CAST(... AS double) could be handled in sparklyr / dplyr with as.numeric exprs <- lapply( 0:(n - 1), function(i) paste("CAST(bits[", i, "] AS double) AS x", i, sep="")) sdf %>% # Convert to Spark DataFrame spark_dataframe() %>% # Use expression with split and explode invoke("selectExpr", list("split(text, ',') AS bits")) %>% # Select individual columns invoke("selectExpr", exprs) %>% # Register table in the metastore ("registerTempTable" in Spark 1.x) invoke("createOrReplaceTempView", "exploded_df")
And use
dplyr::tbl
to get backsparklyr
object:tbl(sc, "exploded_df")
Source: query [2 x 3] Database: spark connection master=local[8] app=sparklyr local=TRUE x0 x1 x2 <dbl> <dbl> <dbl> 1 1 2 3 2 4 5 6
这篇关于Sparklyr:如何在列表列中将自己的列列入Spark表中的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!