用Julia生成ngram [英] Generate ngrams with Julia

查看:91
本文介绍了用Julia生成ngram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要在Julia中生成单词bigrams,我可以简单地浏览原始列表和删除第一个元素的列表,例如:

julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
 "the"  
 "lazy" 
 "fox"  
 "jumps"
 "over" 
 "the"  
 "brown"
 "dog"  

julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog") 

要生成三字母组,我可以使用相同的collect(zip(...))惯用法来获取:

julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox")  
 ("lazy","fox","jumps")
 ("fox","jumps","over")
 ("jumps","over","the")
 ("over","the","brown")
 ("the","brown","dog") 

但是我必须手动添加第3个列表以进行浏览,是否有一种惯用的方式,使得我可以执行 n -gram的任何顺序?

例如我想避免这样做以提取5克分子:

julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

解决方案

对于任何长度的n-gram,这都是一个干净的内衬.

ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))

它使用生成器理解将元素数量k迭代到drop.然后,使用splat(...)运算符将Drop s解压缩为zip,最后将collect s Zip分解为Array.

julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog") 

julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

如您所见,这与您的解决方案非常相似-仅添加了一个简单的理解即可对drop的元素数量进行迭代,从而使长度可以动态化.

To generate word bigrams in Julia, I could simply zip through the original list and a list that drops the first element, e.g.:

julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
 "the"  
 "lazy" 
 "fox"  
 "jumps"
 "over" 
 "the"  
 "brown"
 "dog"  

julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog") 

To generate a trigram I could use the same collect(zip(...)) idiom to get:

julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox")  
 ("lazy","fox","jumps")
 ("fox","jumps","over")
 ("jumps","over","the")
 ("over","the","brown")
 ("the","brown","dog") 

But I have to manually add in the 3rd list to zip through, is there an idiomatic way such that I can do any order of n-gram?

e.g. I'll like to avoid doing this to extract 5-gram:

julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

解决方案

Here's a clean one-liner for n-grams of any length.

ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))

It uses a generator comprehension to iterate over the number of elements, k, to drop. Then, using the splat (...) operator, it unpacks the Drops into zip, and finally collects the Zip into an Array.

julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog") 

julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

As you can see, this is very similar to your solution - only a simple comprehension was added to iterate over the number of elements to drop, so that the length could be dynamic.

这篇关于用Julia生成ngram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆