R提取数据框内的列表 [英] R extracting lists within dataframes
问题描述
解析嵌入在数据框内变量中的列表的最佳方法是什么?
What is the best way to parse lists embedded in variables within a dataframe?
在R中解析json时(我通常使用jsonlite包),我经常以包含(其他列表或数据帧)列表的数据帧列结尾.一个简单的示例是解析Twitter流数据,其中将坐标作为纬度和经度列表返回.一个更复杂的示例(也是我目前正在努力解决的一个示例)是医生的JSON,它将地址解析为数据帧列表.这是一些说明结构的示例数据(顺便说一下,这是公共数据):
When parsing json in R (I typically use the jsonlite package), I frequently end up with data frame columns containing lists (of other lists or data frames). A trivial example of this is parsing Twitter stream data where the coordinates are returned as as a list of latitude and longitude. A more complex example (and the one I am currently wrestling with) is a JSON of doctors that parses the addresses into a list of dataframes. Here is some example data illustrating the structure (this is public data, by the way):
> str(df)
Classes ‘tbl_df’ and 'data.frame': 2 obs. of 2 variables:
$ addresses:List of 2
..$ :'data.frame': 1 obs. of 6 variables:
.. ..$ address : chr "Department of Palliative Care"
.. ..$ address_2: chr "2525 Cumberland Parkway, SE"
.. ..$ city : chr "Atlanta"
.. ..$ state : chr "GA"
.. ..$ zip : chr "30305"
.. ..$ phone : chr "4043650966"
..$ :'data.frame': 2 obs. of 6 variables:
.. ..$ address : chr "5445 Meridian Mark Road" "3619 South Fulton Avenue"
.. ..$ address_2: chr "Suite 370" ""
.. ..$ city : chr "Atlanta" "Hapeville"
.. ..$ state : chr "GA" "GA"
.. ..$ zip : chr "30342" "30354"
.. ..$ phone : chr "4047652020" "4047652020"
$ npi : chr "1497831390" "1578667986"
jsonlite具有将嵌入的数据帧提取到各个变量的功能(拼合),但不适用于列表.
jsonlite has a function (flatten) for extracting embedded data frames to individual variables, but it does not work on lists.
在Twitter示例中,我可以使用for循环将列表项提取到同一数据框中的变量中:
In the Twitter example, I can extract the list items to variables in the same dataframe using a for loop:
for (i in 1:nrow(df)){
#sometimes coordinates is blank, so check
if (length(df2$coordinates.coordinates[[i]]>0)){
df2[i,"coordinates.lon"]<- df2$coordinates.coordinates[[i]][1]
df2[i,"coordinates.lat"]<- df2$coordinates.coordinates[[i]][2]
}
在Doctor示例中,由于每个Doctor可以具有多个地址,因此我需要创建一个新的数据集.
In the Doctor example, since each Doctor can have multiple addresses, I need to create a new dataset.
library(dplyr)
addresses = data.frame()
for (i in 1:nrow(df)){
x<-df$addresses[[i]]
#need an identifier
x$id <-df[[i,"npi"]]
addresses <-bind_rows(addresses, x)
}
虽然这两个示例都可以工作,但它们都是a)速度较慢,b)并非"R"做事方式(据我所知).
While both of these examples work, they are both a) slow and b) not the "R" way of doing things (as I understand it).
所以,我的问题是:从数据帧变量中提取列表的更好,更快,更"R"的方法是什么?
So, my question is: what's a better, faster, more "R" way of extracting lists from data frame variables?
推荐答案
感谢Richard Scriven. tidr
中的unnest
完全满足了我的需求.
Thanks to Richard Scriven. unnest
in tidr
gave me exactly what I needed.
这篇关于R提取数据框内的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!