在R中操作data.table对象内的char向量 [英] Manipulate char vectors inside a data.table object in R

查看:273
本文介绍了在R中操作data.table对象内的char向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个新的仍然使用data.table和理解所有的微妙。
我查看了文档和其他示例中的SO,但找不到我想要的,所以请帮助!



我有一个数据。表格基本上是一个char向量(每个条目都是一个句子)

  DT = c(I love you loves me)
DT = as.data.table(DT)
colnames(DT)< - text
setkey(DT,text)

#> DT
#text
#1:我爱你
#2:她爱我


b $ b

我想做的是能够在DT对象中执行一些基本的字符串操作。例如,添加一个新列,其中我将有一个char向量,其中每个条目都是来自text列中的字符串的WORD。



喜欢有一个新的列charvec,其中

 > DT [1] $ charvec 
[1]I爱你

当然,我想做的是data.table的方式,超快,因为我需要做这样的事情上fils是> 1Go文件,并使用更复杂和计算重的函数所以没有使用APPLY,LAPPLY和MAPPLY



我最近尝试做的事情如下:



myfun1 < - function(sentence){strsplit(sentence,)}
DU1 < - DT = text]
DU2 < - DU1 [,list(charvec = list(V1)),by = text]
#> DU2
#text charvec
#我爱你我,爱你,你
#2:她爱我她爱,我

例如,为了创建一个删除每个句子的第一个单词的函数,我这样做了

  myfun2 < (1){ -  1}} 
DV1 <-DU2 [,myfun2(charvec),by = text]
DV2 < - DV1 [ list(V1)),by = text]
#> DV2
#text charvec
#1:我爱你爱,你
#2:她爱我爱,我

麻烦的是,在列charvec中,我有一个列表而不是一个向量...

 > str(DU2 [1] $ charvec)
#1的列表
#$:chr [1:3]Iloveyou

1)我怎么能做我想要的?
其他类型的函数,我想使用的是子集化的字符向量,或应用一些哈希,它等。



2)我在一行而不是两行到达DU2或DV2?
3)我不太理解data.table的语法。为什么在[..]中使用命令 list(),列V1消失了?
4)在另一个线程,我读了一些关于函数 cSplit



。是什么好吗?它是一个适合于data.table对象的函数?



非常感谢



UPDATE



感谢@Ananda Mahto
也许我应该让自己更清楚我的最终目标
我有一个巨大的文件,作为字符串。
作为该项目的第一步,我想对每个句子的前5个词进行散列。 10,000,000句话甚至不会进入我的记忆,所以我首先分成10个文件的1,000,000句,这将是一个10x 1Go文件。
下面的代码需要几分钟的笔记本电脑上的一个单一的文件。

  library(data.table);文库(摘要); 
num_row = 1000000
DT < - fread(sentence.txt,nrows = num_row,header = FALSE,sep =\t,colClasses =character)
DT = as.data.table(DT)
colnames(DT)< - text
setkey(DT,text)
rawdata< - DT

hash2 < - function(word){#using library(digest)
as.numeric(paste(0x,digest(word,algo =murmur32),sep =))
}

然后,

  print(system.time({

colnames(rawdata)< - sentence
rawdata< - lapply(rawdata,strsplit,)

sentence_begin< - lapply(rawdata $ sentence,function(x){x [2:6]})
hash_list< - sapply(sentence_begin,hash2)
# rawdata)
}))##结束打印system.time用于加载数据

我知道我在这里R推到极限,但我努力寻找更快的实现,我在考虑data.table功能...因此我的所有问题



这里是一个不包括lapply的实现,但它实际上更慢了!

  print(system.time({
myFun1 < - function(sentence){strsplit(sentence,)}
DU1 <-DT [,myfun1(text),by = text]
DU2 <-DU1 [ charvec = list(V1)),by = text]

myfun2 < - function(l){l [[1]] [2:6]}
DV1 <-DU2 [,myfun2(charvec),by = text]
DV2 < - DV1 [,list(charvec = list(V1)),by = text]

rebuildsentence< S){
paste(S,collapse =)}

myfun3< - function(l){hash2(rebuildsentence(l [[1]]))}

DW1 < - DV2 [,myfun3(charvec),by = text]

}))#end of system.time

在这个实现中使用数据文件,没有lapply,所以我希望哈希会更快。然而,因为在每一列我有一个列表而不是一个字符向量,这可能会显着减慢整个事情。



使用上面的第一个代码c $ c> lapply / sapply )在我的笔记本电脑上花了超过1小时。我希望加快一个更有效的数据结构?人们使用Python,Java等...在几秒钟内做一个类似的工作。



当然,另一条路是找到一个更快的哈希函数,但我假设 digest 已经优化。

解决方案

我不太确定你是什么,但你可以尝试 csplit_l 从我的splitstackshape包到达您的列表列:

  b $ b DU < -  cSplit_l(DT,DT,)

可以编写一个类似下面的函数从列表中删除值:

  RemovePos<  -  function(inList,pos = 1 ){
lapply(inList,function(x)x [-c(pos [pos <= length(x)])])
}
/ pre>

使用示例:

  DU [ (DT_list,1)),by = DT] 
#DT V1
#1:我爱你爱,你
#2:她爱我爱,我
DU [ ,list(RemovePos(DT_list,2)),by = DT]
#DT V1
#1:我爱你我,你
#2:她爱我,我
DU [,list(RemovePos(DT_list,c(1,2))),by = DT]
#DT V1
#1:我爱你你
#2:她爱我






更新



根据你对lapply的厌恶,也许你可以尝试以下的东西:

  ## make a copy of yourtextcolumn 
DT [,vals:= text]

##使用`cSplit`创建一个长数据集。
##添加一个列以指示单词在文本中的位置。
DTL < - cSplit(DT,vals,,long)[,ind:= sequence(.N),by = text] []
DTL
# text vals ind
#1:我爱你我1
#2:我爱你爱2
#3:我爱你你3
#4:她爱我1
#5:她爱我我爱2
#6:她爱我我3

##现在,你可以轻松地提取值
DTL [ind = = 1]
#text vals ind
#1:我爱你我1
#2:她爱我她1
DTL [ind%in%c )]
#text vals ind
#1:我爱你我1
#2:我爱你你3
#3:她爱我她1
#4:她爱我我3








我不知道你得到什么类型的时间,但正如我在评论中提到的,你可以尝试使用正则表达式,



下面是一个示例....



要播放的数据:

  library(data.table)
DT < - data.table(
text = c(这是一个有很多单词的句子,
这是一个有更多单词的句子,
单词和单词甚至一些单词
但是,我不知道你想如何处理标点符号,
只是一句话,为了容易乘法。)


DT2 < - rbindlist(复制(10000 / nrow(DT),DT,FALSE))
DT3 < b

测试gsub模式,从每个句子中提取5个字....

  ##正则表达式提取前五个字 - 这应该工作.... 
patt< - ^((?: \\ S + \\s +){4} \\S +)。*

##查看一些计时
system.time(temp <
#用户系统已经过
#0.03 0.00 0.03
system.time(temp2 < - DT3 [,gsub(patt,\\1,text) ,\\1,text)])
#用户系统已过
#3 0 3
head(temp)
#[1]这是一个句子与这是一个带有字和单词甚至
#[4]的句子但是,我不知道如何只是一个句子,这是一个句子与 b $ b

我猜你想要做什么....

  ##我假设你想要这样的东西.... 
##在我的系统上花了一分钟。
## ...但注意创建temp2(无摘要)的系统时间
##不确定我是否正确解释您的散列要求....
系统。时间(out < - DT3 [
,firstFive:= gsub(patt,\\1,text)] [
,firstFiveHash:= hash2(firstFive),by = 1:nrow (DT3)] [])
#用户系统已过
#62.14 0.05 62.20

头(out)
#文本第一个第一个FiveHash
#这是一个有很多话的句子。这是一句话4179639471
#2:这是一个有更多字的句子。这是一句话4179639471
#3:单词和单词,甚至一些单词。单词和单词甚至2556713080
#4:但是,我不知道你想如何处理标点符号...但是,我不知道如何3765680401
#5:再多一句,便于乘法。再多一句话,为298317689
#6:这是一个有很多字的句子。这是一句话,4179639471


I'm a bit new still to using data.table and understanding all its subtleties. I've looked in the doc and in other examples in SO but couldn't find what I want, so please help !

I have a data.table which is basically a char vector (each entry being a sentence)

DT=c("I love you","she loves me")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)

# > DT
#            text
# 1:   I love you
# 2: she loves me

What I'd like to do, is to be able to perform some basic string operations inside the DT object. For example, add a new column where I would have a char vector for which each entry is a WORD from the string in the "text" column.

so I'd like to have for example a new column charvec where

> DT[1]$charvec
[1] "I" "love "you"

Of course, I would like to do it the data.table way, ultra-fast, because I need to do this kind of things on fils which are >1Go file, and use more complex and computation-heavy functions. So no use of APPLY, LAPPLY, and MAPPLY

My closest attempt to do something which looks like it is as follow:

myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
# > DU2
#            text      charvec
# 1:   I love you   I,love,you
# 2: she loves me she,loves,me

For example, to make a function which removes the first word of each sentence, I did this

myfun2 <- function(l){l[[1]][-1]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
# > DV2
#            text  charvec
# 1:   I love you love,you
# 2: she loves me loves,me

the trouble is, in the column charvec, i've got a list and not a vector...

> str(DU2[1]$charvec)
# List of 1
# $ : chr [1:3] "I" "love" "you"

1) how can i get to do what i want ? other kind of functions i'm thinking to use is subsetting the char vector, or applying some hash to it, etc..

2) BTW, can I get to DU2 or DV2 in one line instead of two lines ? 3) i don't understand well the syntax for data.table. why is it that with the command list() inside the [..], the column V1 vanishes ? 4) on another thread, i read a bit about the function cSplit.

. is it any good ? is it a function adapted to data.table objects ?

thanks very much

UPDATE

thanks to @Ananda Mahto Perhaps i should make myself more clear of my ultimate objective I have a huge file of 10,000,000 sentences stored as string. As a first step for that project, I want to make a hash of the first 5 words of each sentence. 10,000,000 sentences wouldn't even get in my memory, so i did first split into 10 files of 1,000,000 sentences, that would be around a 10x 1Go files. the following code takes several minutes on my laptop just for a single file.

library(data.table); library(digest);
num_row=1000000
DT <- fread("sentences.txt",nrows=num_row,header=FALSE,sep="\t",colClasses="character")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)
rawdata <- DT

hash2 <- function(word){ #using library(digest)
        as.numeric(paste("0x",digest(word,algo="murmur32"),sep=""))
}

then,

print(system.time({ 

        colnames(rawdata) <- "sentence"
        rawdata <- lapply(rawdata,strsplit," ")

        sentences_begin <- lapply(rawdata$sentence,function(x){x[2:6]})
        hash_list <- sapply(sentences_begin,hash2)
        # remove(rawdata)
})) ## end of print system.time for loading the data

I know I'm pushing here R to its limits, but i'm struggling to find faster implementations, and i was thinking about data.table features...hence all my questions

Here is an implementation excluding lapply, but its actually slower !

print(system.time({
myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]

myfun2 <- function(l){l[[1]][2:6]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]

rebuildsentence <- function(S){
        paste(S,collapse=" ") }

myfun3 <- function(l){hash2(rebuildsentence(l[[1]]))}

DW1 <- DV2[,myfun3(charvec),by=text]

})) #end of system.time

In this implementation with data file, no lapply, so i hoped the hashing would be faster. However because in every column i have a list instead of a char vector, this may slow significantly (?) the whole thing.

Using the first code above (with lapply/sapply) took more than 1 hour on my laptop. I hoped to speed that with a more efficient data structure ?. People using Python, Java etc... do a similar job in a few seconds.

Of course, another road would be to find a faster hash function but I assumed the one in digest package was already optimized.

解决方案

I'm not really sure what you're after, but you can try cSplit_l from my "splitstackshape" package to get to your list column:

library(splitstackshape)
DU <- cSplit_l(DT, "DT", " ")

Then, you can write a function like the following to remove values from the list column:

RemovePos <- function(inList, pos = 1) {
  lapply(inList, function(x) x[-c(pos[pos <= length(x)])])
}

Example usage:

DU[, list(RemovePos(DT_list, 1)), by = DT]
#              DT       V1
# 1:   I love you love,you
# 2: she loves me loves,me
DU[, list(RemovePos(DT_list, 2)), by = DT]
#              DT     V1
# 1:   I love you  I,you
# 2: she loves me she,me
DU[, list(RemovePos(DT_list, c(1, 2))), by = DT]
#              DT  V1
# 1:   I love you you
# 2: she loves me  me


Update

Based on your loathe of `lapply, maybe you can try something like the following:

## make a copy of your "text" column
DT[, vals := text]

## Use `cSplit` to create a "long" dataset. 
## Add a column to indicate the word's position in the text.
DTL <- cSplit(DT, "vals", " ", "long")[, ind := sequence(.N), by = text][]
DTL
#            text  vals ind
# 1:   I love you     I   1
# 2:   I love you  love   2
# 3:   I love you   you   3
# 4: she loves me   she   1
# 5: she loves me loves   2
# 6: she loves me    me   3

## Now, you can extract values easily
DTL[ind == 1]
#            text vals ind
# 1:   I love you    I   1
# 2: she loves me  she   1
DTL[ind %in% c(1, 3)]
#            text vals ind
# 1:   I love you    I   1
# 2:   I love you  you   3
# 3: she loves me  she   1
# 4: she loves me   me   3


Update 2

I don't know what type of timings you are getting, but as I mentioned in a comment, you can perhaps try using regular expressions so that you don't have to split and then paste the string back together.

Here's a sample....

Set up some data to play with:

library(data.table)
DT <- data.table(
  text = c("This is a sentence with a lot of words.",
           "This is a sentence with some more words.",
           "Words and words and even some more words.",
           "But, I don't know how you want to deal with punctuation...",
           "Just one more sentence, for easy multiplication.")
)

DT2 <- rbindlist(replicate(10000/nrow(DT), DT, FALSE))
DT3 <- rbindlist(replicate(1000000/nrow(DT), DT, FALSE))

Test the gsub pattern to extract 5 words from each sentence....

## Regex to extract first five words -- this should work....
patt <- "^((?:\\S+\\s+){4}\\S+).*"

## Check out some of the timings
system.time(temp <- DT2[, gsub(patt, "\\1", text)])
#    user  system elapsed 
#    0.03    0.00    0.03 
system.time(temp2 <- DT3[, gsub(patt, "\\1", text)])
#    user  system elapsed 
#       3       0       3 
head(temp)
# [1] "This is a sentence with"     "This is a sentence with"     "Words and words and even"   
# [4] "But, I don't know how"       "Just one more sentence, for" "This is a sentence with" 

My guess at what you're looking to do....

## I'm assuming you want something like this....
## Takes about a minute on my system. 
## ... but note the system time for the creation of "temp2" (without digest)
## Not sure if I interpreted your hash requirement correctly....
system.time(out <- DT3[
  , firstFive := gsub(patt, "\\1", text)][
  , firstFiveHash := hash2(firstFive), by = 1:nrow(DT3)][])
#    user  system elapsed 
#   62.14    0.05   62.20 

head(out)
#                                                          text                   firstFive firstFiveHash
# 1:                    This is a sentence with a lot of words.     This is a sentence with    4179639471
# 2:                   This is a sentence with some more words.     This is a sentence with    4179639471
# 3:                  Words and words and even some more words.    Words and words and even    2556713080
# 4: But, I don't know how you want to deal with punctuation...       But, I don't know how    3765680401
# 5:           Just one more sentence, for easy multiplication. Just one more sentence, for     298317689
# 6:                    This is a sentence with a lot of words.     This is a sentence with    4179639471

这篇关于在R中操作data.table对象内的char向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆