在R中的数据表中将文本列拆分为参差不齐的多个新列 [英] Splitting text column into ragged multiple new columns in a data table in R

查看:16
本文介绍了在R中的数据表中将文本列拆分为参差不齐的多个新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 20000 多行和一列的数据表.每列中的字符串具有不同数量的单词.我想拆分单词并将它们中的每一个放在一个新列中.我知道如何逐字逐句:

I have a data table containing 20000+ rows and one column. The string in each column has different number of words. I want to split the words and put each of them in a new column. I know how I can do it word by word:

Data [ , Word1 := as.character(lapply(strsplit(as.character(Data$complaint), split=" "), "[", 1))]

(Data是我的数据表,complaint是列名)

(Data is my data table and complaint is the name of the column)

显然,这效率不高,因为每一行中的每个单元格都有不同的单词数.

Obviously, this is not efficient because each cell in each row has different number of words.

您能告诉我一个更有效的方法吗?

Could you please tell me about a more efficient way to do this?

推荐答案

从我的splitstackshape"包中查看 cSplit.它适用于 data.framedata.table(但总是返回 data.table).

Check out cSplit from my "splitstackshape" package. It works on either data.frames or data.tables (but always returns a data.table).

假设KFB的样本数据至少能稍微代表你的实际数据,你可以试试:

Assuming KFB's sample data is at least slightly representative of your actual data, you can try:

library(splitstackshape)
cSplit(df, "x", " ")
#     x_1      x_2         x_3 x_4
# 1: This       is interesting  NA
# 2: This actually          is not

<小时>

另一个(惊人的)选项是使用 stri_split_fixedsimplify = TRUE (来自stringi")(这显然被认为很快就会进入splitstackshape"代码):


Another (blazing) option is to use stri_split_fixed with simplify = TRUE (from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):

library(stringi)
stri_split_fixed(df$x, " ", simplify = TRUE)
#      [,1]   [,2]       [,3]          [,4] 
# [1,] "This" "is"       "interesting" NA   
# [2,] "This" "actually" "is"          "not"

这篇关于在R中的数据表中将文本列拆分为参差不齐的多个新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆