通过在 R 中拆分文本数据,将数据框中每个主题的一行转换为多行 [英] Convert one row to multiple rows per subject in a data frame by splitting text data in R
问题描述
我有一个包含患者标识符的数据集和一个包含医学发现摘要的文本字段(每位患者 1 行).我想通过拆分文本字段来创建每个患者具有多行的数据集,以便摘要的每个句子都位于不同的行上.随后,我想对每一行进行文本解析,寻找某些关键字和否定词.数据框结构的一个例子是(字母代表句子):
I have a dataset with a patient identifier and a text field with a summary of medical findings (1 row per patient). I would like to create a dataset with multiple rows per patients by splitting the text field so that each sentence of the summary falls on a different line. Subsequently, I would like to text parse each line looking for certain keywords and negation terms. An example of the structure of the data frame is (the letters represent the sentences):
ID 摘要
1 啊啊啊.BB.
2天.呃.ff.G.h
3 一.j
4 千
ID Summary
1 aaaaa. bb. c
2 d. eee. ff. g. h
3 i. j
4 k
我想在."处拆分文本字段将其转换为:
I would like to split the text field at the "." to convert it to:
ID 摘要
1 啊啊啊
1 bb
1个
2 d
2 ee
2 ff
2克
2小时
3 我
3j
4 千
ID Summary
1 aaaaa
1 bb
1 c
2 d
2 eee
2 ff
2 g
2 h
3 i
3 j
4 k
创建初始数据框的R代码:
R code to create the initial data frame:
ID <- c(1, 2, 3, 4)
Summary <- c("aaaaa. bb. c", "d. eee. ff. g. h", "i. j", "k")
df <- data.frame(cbind(ID, Summary))
df$ID <- as.numeric(df$ID)
df$Summary <- as.character(df$Summary)
以下以前的帖子提供了一个很好的解决方案:在 R 中分解(分解)列中的文本数据?
The following previous posting provides a nice solution: Breaking up (melting) text data in a column in R?
我使用了适用于此示例数据集的帖子中的以下代码:
I used the following code from that posting which works for this sample dataset:
dflong <- by(df, df$ID, FUN = function(x) {
sentence = unlist(strsplit(x$Summary, "[.]"))
data.frame(ID = x$ID, Summary = sentence)
})
dflong2<- do.call(rbind,dflong)
但是,当我尝试应用到更大的数据集(>200,000 行)时,我收到错误消息:
data.frame(ID = x$ID, Summary = sentence) 中的错误:参数暗示不同的行数:1, 0
However, when I try to apply to my larger dataset (>200,000 rows), I get the error message:
Error in data.frame(ID = x$ID, Summary = sentence) : arguments imply differing number of rows: 1, 0
我缩小了数据框以在较小的数据集上对其进行测试,但只要行数大于 57,我仍然会收到此错误消息.
I reduced the data frame down to test it on a smaller dataset and I still get this error message any time the number of rows is >57.
是否有另一种方法可以处理更多行?任何建议表示赞赏.谢谢.
Is there another approach to take that can handle a larger number of rows? Any advice is appreciated. Thank you.
推荐答案
使用data.table
:
library(data.table)
dt = data.table(df)
dt[, strsplit(Summary, ". ", fixed = T), by = ID]
# ID V1
# 1: 1 aaaaa
# 2: 1 bb
# 3: 1 c
# 4: 2 d
# 5: 2 eee
# 6: 2 ff
# 7: 2 g
# 8: 2 h
# 9: 3 i
#10: 3 j
#11: 4 k
有很多方法可以解决@agstudy 关于空Summary
的评论,但这里有一个有趣的方法:
There are many ways to address @agstudy's comment about empty Summary
, but here's a fun one:
dt[, c(tmp = "", # doesn't matter what you put here, will delete in a sec
# the point of having this is to force the size of the output table
# which data.table will kindly fill with NA's for us
Summary = strsplit(Summary, ". ", fixed = T)), by = ID][,
tmp := NULL]
这篇关于通过在 R 中拆分文本数据,将数据框中每个主题的一行转换为多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!