在R中操作data.table对象内的char向量 [英] Manipulate char vectors inside a data.table object in R

查看：273 发布时间：2017/3/12 13:10:21 r string data.table strsplit

本文介绍了在R中操作data.table对象内的char向量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个新的仍然使用data.table和理解所有的微妙。
我查看了文档和其他示例中的SO，但找不到我想要的，所以请帮助！

我有一个数据。表格基本上是一个char向量（每个条目都是一个句子）

  DT = c（I love you loves me）
 DT = as.data.table（DT）
 colnames（DT）<  - text
 setkey（DT，text）
 
 ＃> DT 
＃text 
＃1：我爱你
＃2：她爱我

b $ b

我想做的是能够在DT对象中执行一些基本的字符串操作。例如，添加一个新列，其中我将有一个char向量，其中每个条目都是来自text列中的字符串的WORD。

喜欢有一个新的列charvec，其中

 > DT [1] $ charvec 
 [1]I爱你

当然，我想做的是data.table的方式，超快，因为我需要做这样的事情上fils是> 1Go文件，并使用更复杂和计算重的函数所以没有使用APPLY，LAPPLY和MAPPLY

我最近尝试做的事情如下：

 myfun1 < -  function（sentence）{strsplit（sentence，）} 
 DU1 < -  DT = text] 
 DU2 < -  DU1 [，list（charvec = list（V1）），by = text] 
＃> DU2 
＃text charvec 
＃我爱你我，爱你，你
＃2：她爱我她爱，我

例如，为了创建一个删除每个句子的第一个单词的函数，我这样做了

  myfun2 < （1）{ -  1}} 
 DV1 <-DU2 [，myfun2（charvec），by = text] 
 DV2 < -  DV1 [ list（V1）），by = text] 
＃> DV2 
＃text charvec 
＃1：我爱你爱，你
＃2：她爱我爱，我

麻烦的是，在列charvec中，我有一个列表而不是一个向量...

 > str（DU2 [1] $ charvec）
＃1的列表
＃$：chr [1：3]Iloveyou

1）我怎么能做我想要的？
其他类型的函数，我想使用的是子集化的字符向量，或应用一些哈希，它等。

2）我在一行而不是两行到达DU2或DV2？
3）我不太理解data.table的语法。为什么在[..]中使用命令 list（），列V1消失了？
4）在另一个线程，我读了一些关于函数 cSplit 。

。是什么好吗？它是一个适合于data.table对象的函数？

非常感谢

UPDATE

感谢@Ananda Mahto
也许我应该让自己更清楚我的最终目标
我有一个巨大的文件，作为字符串。
作为该项目的第一步，我想对每个句子的前5个词进行散列。 10,000,000句话甚至不会进入我的记忆，所以我首先分成10个文件的1,000,000句，这将是一个10x 1Go文件。
下面的代码需要几分钟的笔记本电脑上的一个单一的文件。
library（data.table）;文库（摘要）; num_row = 1000000 DT < - fread（sentence.txt，nrows = num_row，header = FALSE，sep =\t，colClasses =character） DT = as.data.table（DT） colnames（DT）< - text setkey（DT，text） rawdata< - DT hash2 < - function（word）{#using library（digest） as.numeric（paste（0x，digest（word，algo =murmur32），sep =）） }
然后，
print（system.time（{ colnames（rawdata）< - sentence rawdata< - lapply（rawdata，strsplit，） sentence_begin< - lapply（rawdata $ sentence，function（x）{x [2：6]}） hash_list< - sapply（sentence_begin，hash2）＃ rawdata） }））##结束打印system.time用于加载数据
我知道我在这里R推到极限，但我努力寻找更快的实现，我在考虑data.table功能...因此我的所有问题

这里是一个不包括lapply的实现，但它实际上更慢了！
print（system.time（{ myFun1 < - function（sentence）{strsplit（sentence，）} DU1 <-DT [，myfun1（text），by = text] DU2 <-DU1 [ charvec = list（V1）），by = text] myfun2 < - function（l）{l [[1]] [2：6]} DV1 <-DU2 [，myfun2（charvec），by = text] DV2 < - DV1 [，list（charvec = list（V1）），by = text] rebuildsentence< S）{ paste（S，collapse =）} myfun3< - function（l）{hash2（rebuildsentence（l [[1]]））} DW1 < - DV2 [，myfun3（charvec），by = text] }））#end of system.time
在这个实现中使用数据文件，没有lapply，所以我希望哈希会更快。然而，因为在每一列我有一个列表而不是一个字符向量，这可能会显着减慢整个事情。

使用上面的第一个代码c $ c> lapply / sapply ）在我的笔记本电脑上花了超过1小时。我希望加快一个更有效的数据结构？人们使用Python，Java等...在几秒钟内做一个类似的工作。

当然，另一条路是找到一个更快的哈希函数，但我假设 digest 已经优化。
解决方案
我不太确定你是什么，但你可以尝试 csplit_l 从我的splitstackshape包到达您的列表列：
b $ b DU < - cSplit_l（DT，DT，）
可以编写一个类似下面的函数从列表中删除值：
RemovePos< - function（inList，pos = 1 ）{ lapply（inList，function（x）x [-c（pos [pos <= length（x）]）]） } / pre>

使用示例：
DU [ （DT_list，1）），by = DT] ＃DT V1 ＃1：我爱你爱，你＃2：她爱我爱，我 DU [ ，list（RemovePos（DT_list，2）），by = DT] ＃DT V1 ＃1：我爱你我，你＃2：她爱我，我 DU [，list（RemovePos（DT_list，c（1,2））），by = DT] ＃DT V1 ＃1：我爱你你＃2：她爱我

更新

根据你对lapply的厌恶，也许你可以尝试以下的东西：
## make a copy of yourtextcolumn DT [，vals：= text] ##使用`cSplit`创建一个长数据集。 ##添加一个列以指示单词在文本中的位置。 DTL < - cSplit（DT，vals，，long）[，ind：= sequence（.N），by = text] [] DTL ＃ text vals ind ＃1：我爱你我1 ＃2：我爱你爱2 ＃3：我爱你你3 ＃4：她爱我1 ＃5：她爱我我爱2 ＃6：她爱我我3 ##现在，你可以轻松地提取值 DTL [ind = = 1] ＃text vals ind ＃1：我爱你我1 ＃2：她爱我她1 DTL [ind％in％c ）] ＃text vals ind ＃1：我爱你我1 ＃2：我爱你你3 ＃3：她爱我她1 ＃4：她爱我我3

我不知道你得到什么类型的时间，但正如我在评论中提到的，你可以尝试使用正则表达式，

下面是一个示例....

要播放的数据：
library（data.table） DT < - data.table（ text = c（这是一个有很多单词的句子，这是一个有更多单词的句子，单词和单词甚至一些单词但是，我不知道你想如何处理标点符号，只是一句话，为了容易乘法。）） DT2 < - rbindlist（复制（10000 / nrow（DT），DT，FALSE）） DT3 < b
测试gsub模式，从每个句子中提取5个字....
##正则表达式提取前五个字 - 这应该工作.... patt< - ^（（?: \\ S + \\s +）{4} \\S +）。* ##查看一些计时 system.time（temp < ＃用户系统已经过＃0.03 0.00 0.03 system.time（temp2 < - DT3 [，gsub（patt，\\1，text），\\1，text）]）＃用户系统已过＃3 0 3 head（temp）＃[1]这是一个句子与这是一个带有字和单词甚至＃[4]的句子但是，我不知道如何只是一个句子，这是一个句子与 b $ b
我猜你想要做什么....
##我假设你想要这样的东西.... ##在我的系统上花了一分钟。 ## ...但注意创建temp2（无摘要）的系统时间 ##不确定我是否正确解释您的散列要求.... 系统。时间（out < - DT3 [ ，firstFive：= gsub（patt，\\1，text）] [ ，firstFiveHash：= hash2（firstFive），by = 1：nrow （DT3）] []）＃用户系统已过＃62.14 0.05 62.20 头（out）＃文本第一个第一个FiveHash ＃这是一个有很多话的句子。这是一句话4179639471 ＃2：这是一个有更多字的句子。这是一句话4179639471 ＃3：单词和单词，甚至一些单词。单词和单词甚至2556713080 ＃4：但是，我不知道你想如何处理标点符号...但是，我不知道如何3765680401 ＃5：再多一句，便于乘法。再多一句话，为298317689 ＃6：这是一个有很多字的句子。这是一句话，4179639471
I'm a bit new still to using data.table and understanding all its subtleties. I've looked in the doc and in other examples in SO but couldn't find what I want, so please help !
I have a data.table which is basically a char vector (each entry being a sentence) DT=c("I love you","she loves me") DT=as.data.table(DT) colnames(DT) <- "text" setkey(DT,text) # > DT # text # 1: I love you # 2: she loves me What I'd like to do, is to be able to perform some basic string operations inside the DT object. For example, add a new column where I would have a char vector for which each entry is a WORD from the string in the "text" column. so I'd like to have for example a new column charvec where > DT[1]$charvec [1] "I" "love "you" Of course, I would like to do it the data.table way, ultra-fast, because I need to do this kind of things on fils which are >1Go file, and use more complex and computation-heavy functions. So no use of APPLY, LAPPLY, and MAPPLY My closest attempt to do something which looks like it is as follow: myfun1 <- function(sentence){strsplit(sentence," ")} DU1 <- DT[,myfun1(text),by=text] DU2 <- DU1[,list(charvec=list(V1)),by=text] # > DU2 # text charvec # 1: I love you I,love,you # 2: she loves me she,loves,me For example, to make a function which removes the first word of each sentence, I did this myfun2 <- function(l){l[[1]][-1]} DV1 <- DU2[,myfun2(charvec),by=text] DV2 <- DV1[,list(charvec=list(V1)),by=text] # > DV2 # text charvec # 1: I love you love,you # 2: she loves me loves,me the trouble is, in the column charvec, i've got a list and not a vector... > str(DU2[1]$charvec) # List of 1 # $ : chr [1:3] "I" "love" "you" 1) how can i get to do what i want ? other kind of functions i'm thinking to use is subsetting the char vector, or applying some hash to it, etc.. 2) BTW, can I get to DU2 or DV2 in one line instead of two lines ? 3) i don't understand well the syntax for data.table. why is it that with the command list() inside the [..], the column V1 vanishes ? 4) on another thread, i read a bit about the function cSplit. . is it any good ? is it a function adapted to data.table objects ? thanks very much UPDATE thanks to @Ananda Mahto Perhaps i should make myself more clear of my ultimate objective I have a huge file of 10,000,000 sentences stored as string. As a first step for that project, I want to make a hash of the first 5 words of each sentence. 10,000,000 sentences wouldn't even get in my memory, so i did first split into 10 files of 1,000,000 sentences, that would be around a 10x 1Go files. the following code takes several minutes on my laptop just for a single file. library(data.table); library(digest); num_row=1000000 DT <- fread("sentences.txt",nrows=num_row,header=FALSE,sep="\t",colClasses="character") DT=as.data.table(DT) colnames(DT) <- "text" setkey(DT,text) rawdata <- DT hash2 <- function(word){ #using library(digest) as.numeric(paste("0x",digest(word,algo="murmur32"),sep="")) } then, print(system.time({ colnames(rawdata) <- "sentence" rawdata <- lapply(rawdata,strsplit," ") sentences_begin <- lapply(rawdata$sentence,function(x){x[2:6]}) hash_list <- sapply(sentences_begin,hash2) # remove(rawdata) })) ## end of print system.time for loading the data I know I'm pushing here R to its limits, but i'm struggling to find faster implementations, and i was thinking about data.table features...hence all my questions Here is an implementation excluding lapply, but its actually slower ! print(system.time({ myfun1 <- function(sentence){strsplit(sentence," ")} DU1 <- DT[,myfun1(text),by=text] DU2 <- DU1[,list(charvec=list(V1)),by=text] myfun2 <- function(l){l[[1]][2:6]} DV1 <- DU2[,myfun2(charvec),by=text] DV2 <- DV1[,list(charvec=list(V1)),by=text] rebuildsentence <- function(S){ paste(S,collapse=" ") } myfun3 <- function(l){hash2(rebuildsentence(l[[1]]))} DW1 <- DV2[,myfun3(charvec),by=text] })) #end of system.time In this implementation with data file, no lapply, so i hoped the hashing would be faster. However because in every column i have a list instead of a char vector, this may slow significantly (?) the whole thing. Using the first code above (with lapply/sapply) took more than 1 hour on my laptop. I hoped to speed that with a more efficient data structure ?. People using Python, Java etc... do a similar job in a few seconds. Of course, another road would be to find a faster hash function but I assumed the one in digest package was already optimized. 解决方案 I'm not really sure what you're after, but you can try cSplit_l from my "splitstackshape" package to get to your list column: library(splitstackshape) DU <- cSplit_l(DT, "DT", " ") Then, you can write a function like the following to remove values from the list column: RemovePos <- function(inList, pos = 1) { lapply(inList, function(x) x[-c(pos[pos <= length(x)])]) } Example usage: DU[, list(RemovePos(DT_list, 1)), by = DT] # DT V1 # 1: I love you love,you # 2: she loves me loves,me DU[, list(RemovePos(DT_list, 2)), by = DT] # DT V1 # 1: I love you I,you # 2: she loves me she,me DU[, list(RemovePos(DT_list, c(1, 2))), by = DT] # DT V1 # 1: I love you you # 2: she loves me me Update Based on your loathe of `lapply, maybe you can try something like the following: ## make a copy of your "text" column DT[, vals := text] ## Use `cSplit` to create a "long" dataset. ## Add a column to indicate the word's position in the text. DTL <- cSplit(DT, "vals", " ", "long")[, ind := sequence(.N), by = text][] DTL # text vals ind # 1: I love you I 1 # 2: I love you love 2 # 3: I love you you 3 # 4: she loves me she 1 # 5: she loves me loves 2 # 6: she loves me me 3 ## Now, you can extract values easily DTL[ind == 1] # text vals ind # 1: I love you I 1 # 2: she loves me she 1 DTL[ind %in% c(1, 3)] # text vals ind # 1: I love you I 1 # 2: I love you you 3 # 3: she loves me she 1 # 4: she loves me me 3 Update 2 I don't know what type of timings you are getting, but as I mentioned in a comment, you can perhaps try using regular expressions so that you don't have to split and then paste the string back together. Here's a sample.... Set up some data to play with: library(data.table) DT <- data.table( text = c("This is a sentence with a lot of words.", "This is a sentence with some more words.", "Words and words and even some more words.", "But, I don't know how you want to deal with punctuation...", "Just one more sentence, for easy multiplication.") ) DT2 <- rbindlist(replicate(10000/nrow(DT), DT, FALSE)) DT3 <- rbindlist(replicate(1000000/nrow(DT), DT, FALSE)) Test the gsub pattern to extract 5 words from each sentence.... ## Regex to extract first five words -- this should work.... patt <- "^((?:\\S+\\s+){4}\\S+).*" ## Check out some of the timings system.time(temp <- DT2[, gsub(patt, "\\1", text)]) # user system elapsed # 0.03 0.00 0.03 system.time(temp2 <- DT3[, gsub(patt, "\\1", text)]) # user system elapsed # 3 0 3 head(temp) # [1] "This is a sentence with" "This is a sentence with" "Words and words and even" # [4] "But, I don't know how" "Just one more sentence, for" "This is a sentence with" My guess at what you're looking to do.... ## I'm assuming you want something like this.... ## Takes about a minute on my system. ## ... but note the system time for the creation of "temp2" (without digest) ## Not sure if I interpreted your hash requirement correctly.... system.time(out <- DT3[ , firstFive := gsub(patt, "\\1", text)][ , firstFiveHash := hash2(firstFive), by = 1:nrow(DT3)][]) # user system elapsed # 62.14 0.05 62.20 head(out) # text firstFive firstFiveHash # 1: This is a sentence with a lot of words. This is a sentence with 4179639471 # 2: This is a sentence with some more words. This is a sentence with 4179639471 # 3: Words and words and even some more words. Words and words and even 2556713080 # 4: But, I don't know how you want to deal with punctuation... But, I don't know how 3765680401 # 5: Just one more sentence, for easy multiplication. Just one more sentence, for 298317689 # 6: This is a sentence with a lot of words. This is a sentence with 4179639471 这篇关于在R中操作data.table对象内的char向量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在R中操作data.table对象内的char向量 [英] Manipulate char vectors inside a data.table object in R

问题描述

更新

Update

Update 2

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在R中操作data.table对象内的char向量 [英] Manipulate char vectors inside a data.table object in R

问题描述

更新

Update

Update 2

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭