使用 R 进行 CPU 和内存高效的 NGram 提取 [英] CPU-and-memory efficient NGram extraction with R

查看:41
本文介绍了使用 R 进行 CPU 和内存高效的 NGram 提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个算法,可以从 50000 个街道地址的列表中提取 NGrams(bigrams、trigrams、...直到 5-grams).我的目标是为每个地址设置一个布尔向量,表示地址中是否存在 NGram.因此每个地址都会用一个属性向量来表征,然后我可以对这些地址进行聚类.算法是这样工作的:我从二元组开始,我计算 (az 和 0-9 和/和制表) 的所有组合:例如:aa,ab,ac,...,a8,a9,a/,a ,ba,布,...然后我对每个地址执行一个循环,并为所有二元组提取信息 0 或 1(二元组不存在或不存在).之后,我计算出现最多的二元组.等等 ...我的问题是算法运行所需的时间.另一个问题:当超过 10000 NGrams 时,R 达到其最大容量.这很明显,因为 50000*10000 矩阵很大.我需要你的想法来优化算法或改变它.谢谢.

I wrote an algorithm which extract NGrams (bigrams, trigrams, ... till 5-grams) from a list of 50000 street addresses. My goal is to have for each address a boolean vector representing whether the NGrams are present or not in the address. Therefor each address will be characterized by a vector of attributes, and then I can carry out a clustering on the addresses. The algo works that way : I start with the bi-grams, I calculate all the combinations of (a-z and 0-9 and / and tabulation) : for example : aa,ab,ac,...,a8,a9,a/,a ,ba,bb,... Then I carry out a loop for each address and extract for all the bigrams the information 0 or 1 (bi-gram not present or present). Afterward, I calculate for the bigrams that occur the most the trigrams. And so on ... My problem is the time that the algo takes to run. Another problem : R reach its maximal capacity when there are more than 10000 NGrams. It's obvious because a 50000*10000 matrice is huge. I need your ideas to optimize the algo or to change it. Thank you.

推荐答案

试用 quanteda 包,使用此方法.如果您只想要标记化文本,请将 dfm( 替换为 tokenize(.

Try the quanteda package, using this method. If you just want tokenized texts, replace the dfm( with tokenize(.

非常很想知道它是如何处理您的 50,000 个街道地址的.我们付出了很多努力使 dfm() 变得非常快速和健壮.

I'd be very interested to know how it works on your 50,000 street addresses. We've put a lot of effort into making dfm() very fast and robust.

myDfm <- dfm(c("1780 wemmel", "2015 schlemmel"), what = "character", 
             ngram = 1:5, concatenator = "", 
             removePunct = FALSE, removeNumbers = FALSE, 
             removeSeparators = FALSE, verbose = FALSE)
t(myDfm) # for easier viewing
#         docs
# features text1 text2
#           1     1
# s         0     1
# sc        0     1
# sch       0     1
# schl      0     1
# w         1     0
# we        1     0
# wem       1     0
# wemm      1     0
# 0         1     1
# 0         1     0
# 0 w       1     0
# 0 we      1     0
# 0 wem     1     0
# 01        0     1
# 015       0     1
# 015       0     1
# 015 s     0     1
# 1         1     1
# 15        0     1
# 15        0     1
# 15 s      0     1
# 15 sc     0     1
# 17        1     0
# 178       1     0
# 1780      1     0
# 1780      1     0
# 2         0     1
# 20        0     1
# 201       0     1
# 2015      0     1
# 2015      0     1
# 5         0     1
# 5         0     1
# 5 s       0     1
# 5 sc      0     1
# 5 sch     0     1
# 7         1     0
# 78        1     0
# 780       1     0
# 780       1     0
# 780 w     1     0
# 8         1     0
# 80        1     0
# 80        1     0
# 80 w      1     0
# 80 we     1     0
# c         0     1
# ch        0     1
# chl       0     1
# chle      0     1
# chlem     0     1
# e         2     2
# el        1     1
# em        1     1
# emm       1     1
# emme      1     1
# emmel     1     1
# h         0     1
# hl        0     1
# hle       0     1
# hlem      0     1
# hlemm     0     1
# l         1     2
# le        0     1
# lem       0     1
# lemm      0     1
# lemme     0     1
# m         2     2
# me        1     1
# mel       1     1
# mm        1     1
# mme       1     1
# mmel      1     1
# s         0     1
# sc        0     1
# sch       0     1
# schl      0     1
# schle     0     1
# w         1     0
# we        1     0
# wem       1     0
# wemm      1     0
# wemme     1     0

这篇关于使用 R 进行 CPU 和内存高效的 NGram 提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆