用Julia生成ngram [英] Generate ngrams with Julia
问题描述
要在Julia中生成单词bigrams,我可以简单地浏览原始列表和删除第一个元素的列表,例如:
julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
"the"
"lazy"
"fox"
"jumps"
"over"
"the"
"brown"
"dog"
julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
要生成三字母组,我可以使用相同的collect(zip(...))
惯用法来获取:
julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox")
("lazy","fox","jumps")
("fox","jumps","over")
("jumps","over","the")
("over","the","brown")
("the","brown","dog")
但是我必须手动添加第3个列表以进行浏览,是否有一种惯用的方式,使得我可以执行 n -gram的任何顺序? >
例如我想避免这样做以提取5克分子:
julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")
对于任何长度的n-gram,这都是一个干净的内衬.
ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))
它使用生成器理解将元素数量k
迭代到drop
.然后,使用splat(...
)运算符将Drop
s解压缩为zip
,最后将collect
s Zip
分解为Array
.
julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")
如您所见,这与您的解决方案非常相似-仅添加了一个简单的理解即可对drop
的元素数量进行迭代,从而使长度可以动态化.
To generate word bigrams in Julia, I could simply zip through the original list and a list that drops the first element, e.g.:
julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
"the"
"lazy"
"fox"
"jumps"
"over"
"the"
"brown"
"dog"
julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
To generate a trigram I could use the same collect(zip(...))
idiom to get:
julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox")
("lazy","fox","jumps")
("fox","jumps","over")
("jumps","over","the")
("over","the","brown")
("the","brown","dog")
But I have to manually add in the 3rd list to zip through, is there an idiomatic way such that I can do any order of n-gram?
e.g. I'll like to avoid doing this to extract 5-gram:
julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")
Here's a clean one-liner for n-grams of any length.
ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))
It uses a generator comprehension to iterate over the number of elements, k
, to drop
. Then, using the splat (...
) operator, it unpacks the Drop
s into zip
, and finally collect
s the Zip
into an Array
.
julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")
As you can see, this is very similar to your solution - only a simple comprehension was added to iterate over the number of elements to drop
, so that the length could be dynamic.
这篇关于用Julia生成ngram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!