R提取数据框内的列表 [英] R extracting lists within dataframes

查看:116
本文介绍了R提取数据框内的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

解析嵌入在数据框内变量中的列表的最佳方法是什么?

What is the best way to parse lists embedded in variables within a dataframe?

在R中解析json时(我通常使用jsonlite包),我经常以包含(其他列表或数据帧)列表的数据帧列结尾.一个简单的示例是解析Twitter流数据,其中将坐标作为纬度和经度列表返回.一个更复杂的示例(也是我目前正在努力解决的一个示例)是医生的JSON,它将地址解析为数据帧列表.这是一些说明结构的示例数据(顺便说一下,这是公共数据):

When parsing json in R (I typically use the jsonlite package), I frequently end up with data frame columns containing lists (of other lists or data frames). A trivial example of this is parsing Twitter stream data where the coordinates are returned as as a list of latitude and longitude. A more complex example (and the one I am currently wrestling with) is a JSON of doctors that parses the addresses into a list of dataframes. Here is some example data illustrating the structure (this is public data, by the way):

> str(df)
Classes ‘tbl_df’ and 'data.frame':  2 obs. of  2 variables:
 $ addresses:List of 2
  ..$ :'data.frame':    1 obs. of  6 variables:
  .. ..$ address  : chr "Department of Palliative Care"
  .. ..$ address_2: chr "2525 Cumberland Parkway, SE"
  .. ..$ city     : chr "Atlanta"
  .. ..$ state    : chr "GA"
  .. ..$ zip      : chr "30305"
  .. ..$ phone    : chr "4043650966"
  ..$ :'data.frame':    2 obs. of  6 variables:
  .. ..$ address  : chr  "5445 Meridian Mark Road" "3619 South Fulton Avenue"
  .. ..$ address_2: chr  "Suite 370" ""
  .. ..$ city     : chr  "Atlanta" "Hapeville"
  .. ..$ state    : chr  "GA" "GA"
  .. ..$ zip      : chr  "30342" "30354"
  .. ..$ phone    : chr  "4047652020" "4047652020"
 $ npi      : chr  "1497831390" "1578667986"

jsonlite具有将嵌入的数据帧提取到各个变量的功能(拼合),但不适用于列表.

jsonlite has a function (flatten) for extracting embedded data frames to individual variables, but it does not work on lists.

在Twitter示例中,我可以使用for循环将列表项提取到同一数据框中的变量中:

In the Twitter example, I can extract the list items to variables in the same dataframe using a for loop:

for (i in 1:nrow(df)){
  #sometimes coordinates is blank, so check
  if (length(df2$coordinates.coordinates[[i]]>0)){
    df2[i,"coordinates.lon"]<- df2$coordinates.coordinates[[i]][1]
    df2[i,"coordinates.lat"]<- df2$coordinates.coordinates[[i]][2]
  }

在Doctor示例中,由于每个Doctor可以具有多个地址,因此我需要创建一个新的数据集.

In the Doctor example, since each Doctor can have multiple addresses, I need to create a new dataset.

library(dplyr)
addresses = data.frame()
for (i in 1:nrow(df)){
  x<-df$addresses[[i]]
  #need an identifier
  x$id <-df[[i,"npi"]]
  addresses <-bind_rows(addresses, x)
}

虽然这两个示例都可以工作,但它们都是a)速度较慢,b)并非"R"做事方式(据我所知).

While both of these examples work, they are both a) slow and b) not the "R" way of doing things (as I understand it).

所以,我的问题是:从数据帧变量中提取列表的更好,更快,更"R"的方法是什么?

So, my question is: what's a better, faster, more "R" way of extracting lists from data frame variables?

推荐答案

感谢Richard Scriven. tidr中的unnest完全满足了我的需求.

Thanks to Richard Scriven. unnest in tidr gave me exactly what I needed.

这篇关于R提取数据框内的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆