使用 tm 函数时保留唯一标识符(例如,记录 ID) - 不适用于大量数据? [英] Retaining unique identifiers (e.g., record id) when using tm functions - doesn't work with lot's of data?

查看:20
本文介绍了使用 tm 函数时保留唯一标识符(例如,记录 ID) - 不适用于大量数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理非结构化文本 (Facebook) 数据,并对其进行预处理(例如,去除标点符号、去除停用词、词干提取).我需要在预处理时保留记录(即 Facebook 帖子)ID.我有一个适用于数据子集的解决方案,但所有数据都失败了(N = 127K 帖子).我试过对数据进行分块,但这也不起作用.我认为这与我使用变通方法并依赖行名称有关.例如,它似乎适用于前 ~15K 的帖子,但是当我继续进行子集化时,它失败了.我意识到我的代码不够优雅,所以很高兴学习更好/完全不同的解决方案 - 我所关心的只是在我去 V Corpus 然后再回来时保留 ID.我是 tm 包的新手,尤其是 readTabular 函数.(注意:我在制作 VCorpus 之前运行了降低和删除 Words 的方法,因为我最初认为这是问题的一部分).

I am working with unstructured text (Facebook) data, and am pre-processing it (e.g., stripping punctuation, removing stop words, stemming). I need to retain the record (i.e., Facebook post) ids while pre-processing. I have a solution that works on a subset of the data but fails with all the data (N = 127K posts). I have tried chunking the data, and that doesn't work either. I think it has something to do with me using a work-around, and relying on row names. For example, it appears to work with the first ~15K posts but when I keep subsetting, it fails. I realize my code is less than elegant so happy to learn better/completely different solutions - all I care about is keeping the IDs when I go to V Corpus and then back again. I'm new to the tm package and the readTabular function in particular. (Note: I ran the to lower and remove Words before making the VCorpus as I originally thought that was part of the issue).

工作代码如下:

fb = data.frame(RecordContent = c("I'm dating a celebrity! Skip to 2:02 if you, like me, don't care about the game.",
                                "Photo fails of this morning. Really Joe?", 
                                "This piece has been almost two years in the making. Finally finished! I'm antsy for October to come around... >:)"),
                                FromRecordId = c(682245468452447, 737891849554475, 453178808037464),
                                stringsAsFactors = F)

删除标点符号小写

fb$RC = tolower(gsub("[[:punct:]]", "", fb$RecordContent)) 
fb$RC2 = removeWords(fb$RC, stopwords("english"))

第 1 步:创建特殊的阅读器函数以保留记录 ID

myReader = readTabular(mapping=list(content="RC2", id="FromRecordId"))

第 2 步:制作我的语料库.使用 DataframeSource 和自定义阅读器功能读入数据,其中每个 FB 帖子都是一个文档"

corpus.test = VCorpus(DataframeSource(fb),      readerControl=list(reader=myReader))

第 3 步:清洁和去除

 corpus.test2 = corpus.test %>% 
tm_map(removeNumbers) %>% 
tm_map(stripWhitespace) %>% 
tm_map(stemDocument, language = "english") %>% 
as.VCorpus()

第 4 步:将语料库重新转换为字符向量.行名称现在是 ID

fb2 = data.frame(unlist(sapply(corpus.test2, `[`, "content")), stringsAsFactors = F)

第 5 步:为以后的合并创建新的 ID 变量,命名变量,并准备合并回原始数据集

fb2$ID = row.names(fb2)
fb2$RC.ID = gsub(".content", "", fb2$ID)
colnames(fb2)[1] = "RC.stem"
fb3 = select(fb2, RC.ID, RC.stem)
row.names(fb3) = NULL

推荐答案

我认为 ids are 在默认情况下由 tm 模块存储和保留.您可以使用

I think the ids are being stored and retained by default, by the tm module. You can fetch them all (in a vectorized manner) with

meta(corpus.test, "id")

$`682245468452447`
[1] "682245468452447"

$`737891849554475`
[1] "737891849554475"

$`453178808037464`
[1] "453178808037464"

我建议阅读 tm::meta() 函数的文档,但它不是很好.

I'd recommend to read the documentation of the the tm::meta() function, but it's not very good.

您还可以添加任意元数据(作为键值对)到语料库中的每个集合项,以及集合级别的元数据.

You can also add arbitrary metadata (as key-value pairs) to each collection item in the corpus, as well as collection-level metadata.

这篇关于使用 tm 函数时保留唯一标识符(例如,记录 ID) - 不适用于大量数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆