如何使用Sparklyr包来平整不同数据类型的数据? [英] How to flatten the data of different data types by using Sparklyr package?

查看:63
本文介绍了如何使用Sparklyr包来平整不同数据类型的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简介

R代码来创建数据库架构. [提供了可复制的代码和数据库]

R code is written by using Sparklyr package to create database schema. [Reproducible code and database is given]

现有结果

root
|-- contributors : string
|-- created_at : string
|-- entities (struct)
|     |-- hashtags (array) : [string]
|     |-- media (array)
|     |     |-- additional_media_info (struct)
|     |     |       |-- description : string
|     |     |       |-- embeddable : boolean
|     |     |       |-- monetizable : bollean
|     |     |-- diplay_url : string
|     |     |-- id : long
|     |     |-- id_str : string
|     |-- urls (array)     
|-- extended_entities (struct)
|-- retweeted_status (struct)
|-- user (struct)

我想按如下所示扁平化该结构,

I want to flatten this structure as below,

预期结果

root
|-- contributors : string
|-- created_at : string
|-- entities (struct)
|-- entities.hashtags (array) : [string]
|-- entities.media (array)
|-- entities.media.additional_media_info (struct)
|-- entities.media.additional_media_info.description : string
|-- entities.media.additional_media_info.embeddable : boolean
|-- entities.media.additional_media_info.monetizable : bollean
|-- entities.media.diplay_url : string
|-- entities.media.id : long
|-- entities.media.id_str : string
|-- entities.urls (array)     
|-- extended_entities (struct)
|-- retweeted_status (struct)
|-- user (struct)

数据库导航至: 数据178 KB .然后将编号的项目复制到名为"example"的文本文件中.保存到在工作目录中创建的名为"../example.json/"的目录.

Database Navigate to: Data-178 KB . Then copy the numbered items to a text file named "example". Save to a directory named "../example.json/" created in your working directory.

编写R代码以重现以下示例,

The R code is written to reproduce the example as below,

退出代码

library(sparklyr)
library(dplyr)
library(devtools)
  devtools::install_github("mitre/sparklyr.nested")
# If Spark is not installed, then also need:
# spark_install(version = "2.2.0")
library(sparklyr.nested)
library(testthat)
library(jsonlite)

Sys.setenv(SPARK_HOME="/usr/lib/spark")    
conf <- spark_config()
conf$'sparklyr.shell.executor-memory' <- "20g"
conf$'sparklyr.shell.driver-memory' <- "20g"
conf$spark.executor.cores <- 16
conf$spark.executor.memory <- "20G"
conf$spark.yarn.am.cores  <- 16
conf$spark.yarn.am.memory <- "20G"
conf$spark.executor.instances <- 8
conf$spark.dynamicAllocation.enabled <- "false"
conf$maximizeResourceAllocation <- "true"
conf$spark.default.parallelism <- 32
sc <- spark_connect(master = "local", config = conf, version = '2.2.0') # Connection             
 sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, memory = FALSE, overwrite = TRUE) 
 sdf_schema_viewer(sample_tbl) # to create db schema

付出的努力

使用的jsonlite.但是它无法读取大文件,也无法读取大块文件.花了非结束时间.因此,我转向Sparklyr,因为它确实很奇怪,并在几秒钟内读取了10亿条记录.我已经进行了进一步的研究以将记录展平到深层嵌套级别(因为通过使用flatten()函数在jsonlite包中进行展平).但是,在Sparklyr中没有可用的功能.在Sparklyr中只能进行1级展平.

Used jsonlite. But it is unable to read big file and within chunks also. It took non ending time. So, I turned towards Sparklyr as it does wonder and read 1 billion records within few seconds. I have done further study for flattening the records upto deep nesting level (because flattening is done in jsonlite package by using flatten() function). But, in Sparklyr, there is no such feature available. Only 1st level flattening could possible in Sparklyr.

我想整理不同数据类型的数据,并希望以CSV文件格式输出.

I want to flatten the data of different data types and want the output in CSV file.

推荐答案

好的,所以这是取消所有嵌套的一种可能方法.

Alright, so this is one possible way to unnest it all.

您可以使用架构信息来创建所有嵌套名称.例如,entities.media.additional_media_info,则可以使用SQL选择它们.

You can use the schema information to create all of the nested names. For example, entities.media.additional_media_info, then you can just use SQL to select them.

这有点费力,也许不能一概而论,但是行之有效

This is a bit labour intensive, and may not generalise, but it works

我想这也应该很快,因为它只是一个SELECT语句.

I would like to think this should be quick too, as it is only a SELECT statement.

columns_to_flatten <- sdf_schema_json(sample_tbl, simplify = T) %>%
  # using rlist package for ease of use
  rlist::list.flatten(use.names = T) %>%
  # get names
  names() %>%
  # remove contents of brackets and whitespace
  gsub("\\(.*?\\)|\\s", "", .) %>%
  # add alias for column names, dot replaced with double underscore
  # this avoids duplicate names that would otherwise occur with singular
  {paste(., "AS", gsub("\\.", "__", .))} %>%
  # required, otherwise doesn't seem to work
  sub("variants", "variants[0]", .)

# construct query
sql_statement <- paste("SELECT",
                       paste(columns_to_flatten, collapse = ", "),
                       "FROM example")

# execute on spark cluster, save as table in cluster
spark_session(sc) %>%
  sparklyr::invoke("sql", sql_statement) %>%
  sparklyr::invoke("createOrReplaceTempView", "flattened_example")


tbl(sc, "flattened_example") %>%
  sdf_schema_viewer()

生成的SQL看起来很简单,很长:

The SQL generated looks like this, rather simple, just long:

SELECT contributors AS contributors, coordinates AS coordinates, created_at AS created_at, display_text_range AS display_text_range, entities.hashtags.indices AS entities__hashtags__indices, entities.hashtags.text AS entities__hashtags__text, entities.media.additional_media_info.description AS entities__media__additional_media_info__description, entities.media.additional_media_info.embeddable AS entities__media__additional_media_info__embeddable, entities.media.additional_media_info.monetizable AS entities__media__additional_media_info__monetizable, entities.media.additional_media_info.title AS entities__media__additional_media_info__title, entities.media.display_url AS entities__media__display_url, entities.media.expanded_url AS entities__media__expanded_url, entities.media.id AS entities__media__id, entities.media.id_str AS entities__media__id_str, entities.media.indices AS entities__media__indices, entities.media.media_url AS entities__media__media_url, entities.media.media_url_https AS entities__media__media_url_https, entities.media.sizes.large.h AS entities__media__sizes__large__h, entities.media.sizes.large.resize AS entities__media__sizes__large__resize, entities.media.sizes.large.w AS entities__media__sizes__large__w FROM example

这篇关于如何使用Sparklyr包来平整不同数据类型的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆