在R中使用SparklyR更改嵌套的列名称 [英] Changing nested column names using SparklyR in R

查看:79
本文介绍了在R中使用SparklyR更改嵌套的列名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经提到了这里提到的所有链接:

1)

现在

如果我重命名第一级列,那么它将起作用.

例如,

  df%>%重命名(ent =实体) 

但是当我运行第二个深度嵌套列时,它不会重命名.

  df%>%重命名(e_hashtags =实体.hashtags) 

显示错误:

  .f(.x [[i]],...)中的错误:找不到对象'entities.hashtags' 

问题

我的问题是,如何也将第3到第4深层嵌套列重命名?

请参考上述数据库架构.

解决方案

Spark本身不支持重命名单个嵌套字段.您必须铸造或重建整个结构.为简单起见,我们假设数据如下所示:

  cat('{"contributors":"foo","coordinates":"bar","entities":{"hashtags":["foo","bar"],"media":缺少}}',文件="/tmp/example.json)df<-spark_read_json(sc,"df","/tmp/example.json",overwrite = TRUE)df%>%spark_dataframe()%>%invoke("schema")%>%invoke("treeString")%&%;%cat() 

  root|-贡献者:字符串(nullable = true)|-坐标:字符串(nullable = true)|-实体:struct(nullable = true)||-主题标签:数组(可为空= true)|||-元素:字符串(containsNull = true)||-媒体:字符串(nullable = true) 

具有简单的字符串表示形式:

  df%>%spark_dataframe()%&%;%invoke("schema")%&%;%invoke("simpleString")%&%;%cat(sep ="\ n") 

  struct< contributors:string,coordinates:string,entities:struct< hashtags:array< string>,media:string>> 

使用强制转换时,您必须使用匹配的类型描述来定义表达式:

  expr_cast<-invoke_static(sc,"org.apache.spark.sql.functions","expr","CAST(实体AS struct< e_hashtags:array< string>,media:string>)")df_cast<-df%>%spark_dataframe()%&%;%invoke("withColumn","entities",expr_cast)%&%;%sdf_register()df_cast%>%spark_dataframe()%>%invoke("schema")%&%;%invoke("treeString")%&%;%cat() 

  root|-贡献者:字符串(nullable = true)|-坐标:字符串(nullable = true)|-实体:struct(nullable = true)||-e_hashtags:数组(可为空= true)|||-元素:字符串(containsNull = true)||-媒体:字符串(nullable = true) 

要重建结构,您必须匹配所有组件:

  expr_struct<-invoke_static(sc,"org.apache.spark.sql.functions","expr",结构(entities.hashtags AS e_hashtags,entities.media)")df_struct<-df%>%spark_dataframe()%&%;%invoke("withColumn","entities",expr_struct)%>%sdf_register()df_struct%>%spark_dataframe()%>%invoke("schema")%&%;%invoke("treeString")%&%;%cat() 

  root|-贡献者:字符串(nullable = true)|-坐标:字符串(nullable = true)|-实体:struct(nullable = false)||-e_hashtags:数组(可为空= true)|||-元素:字符串(containsNull = true)||-媒体:字符串(nullable = true) 

I have referred to all the links mentioned here:

1) Link-1 2) Link-2 3) Link-3 4) Link-4

Following R code has been written by using Sparklyr Package. It reads huge JSON file and creates database schema.

sc <- spark_connect(master = "local", config = conf, version = '2.2.0') # Connection
sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, 
                              memory = FALSE, overwrite = TRUE) # reads JSON file
sample_tbl <- sdf_schema_viewer(sample_tbl) # to create db schema
df <- tbl(sc,"example") # to create lookup table

It has created following database schema

Now,

If I rename first level column, then it works.

For example,

df %>% rename(ent = entities)

But when I run 2nd deep level nested column then it doesn't rename.

df %>% rename(e_hashtags = entities.hashtags)

It shows error:

Error in .f(.x[[i]], ...) : object 'entities.hashtags' not found

Question

My question is, how to rename 3rd to 4th deep level nested column also?

Please refer database schema mentioned above.

解决方案

Spark as such doesn't support renaming individual nested fields. You have to either cast or rebuild a whole structure. For simplicity let's assume that data looks as follows:

cat('{"contributors": "foo", "coordinates": "bar", "entities": {"hashtags": ["foo", "bar"], "media": "missing"}}',  file = "/tmp/example.json")
df <- spark_read_json(sc, "df", "/tmp/example.json", overwrite=TRUE)

df %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- media: string (nullable = true)

with simple string representation:

df %>% 
  spark_dataframe() %>% 
  invoke("schema") %>% 
  invoke("simpleString") %>% 
  cat(sep = "\n")

struct<contributors:string,coordinates:string,entities:struct<hashtags:array<string>,media:string>>

With cast you have to define expression using matching type description:

expr_cast <- invoke_static(
  sc, "org.apache.spark.sql.functions", "expr",
  "CAST(entities AS struct<e_hashtags:array<string>,media:string>)"
)

df_cast <- df %>% 
  spark_dataframe() %>% 
  invoke("withColumn", "entities", expr_cast) %>% 
  sdf_register()

df_cast %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- e_hashtags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- media: string (nullable = true)

To rebuild structure you have to match all components:

expr_struct <- invoke_static(
  sc, "org.apache.spark.sql.functions", "expr",
  "struct(entities.hashtags AS e_hashtags, entities.media)"
)

df_struct <- df %>% 
  spark_dataframe() %>% 
  invoke("withColumn", "entities", expr_struct) %>% 
  sdf_register()

df_struct %>% spark_dataframe() %>% invoke("schema") %>% invoke("treeString") %>% cat()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: string (nullable = true)
 |-- entities: struct (nullable = false)
 |    |-- e_hashtags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- media: string (nullable = true)

这篇关于在R中使用SparklyR更改嵌套的列名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆