quanteda kwic 正则表达式操作 [英] quanteda kwic regex operation
问题描述
进一步编辑原始问题.
问题源于期望正则表达式与grep"或某些编程语言相同或几乎相同.下面是我所期望的,并且它没有发生的事实产生了我的问题(使用 cygwin):
Further edit to original question.
Question originated by expectation that regexes would work identically or nearly to "grep" or to some programming language. This below is what I expected and the fact that it did not happen generated my question (using cygwin):
echo "regex unusual operation will deport into a different" > out.txt
grep "will * dep" out.txt
"regex unusual operation will deport into a different"
原始问题
尝试关注 https://github.com/kbenoit/ITAUR/blob/master/README.md
看到大家用这个包都觉得很好,就来学习一下广达.
在 demo.R 中,第 22 行我找到了这一行:
Originary question
Trying to follow https://github.com/kbenoit/ITAUR/blob/master/README.md
to learn Quanteda after seeing that everybody that uses this package finds it very good.
In demo.R, line 22 I find the line:
kwic(immigCorpus, "deport", window = 3)
它的输出是 -
[BNP, 157] The BNP will | deport | all foreigners convicted
[BNP, 1946] . 2. | Deport | all illegal immigrants
[BNP, 1952] immigrants We shall | deport | all illegal immigrants
[BNP, 2585] Criminals We shall | deport | all criminal entrants
尝试/学习我执行的基础知识
To try/learn the basics I execute
kwic(immigCorpus, "will *depo", window = 3, valuetype = "regex")
期待获得
[BNP, 157] The BNP will | deport | all foreigners convicted
但我明白了:
kwic object with 0 rows
类似的尝试
kwic(immigCorpus, ".*will *depo.*", window = 3, valuetype = "regex")
得到相同的结果:
kwic object with 0 rows
这是为什么?代币化?如果是这样,我应该如何编写正则表达式?
Why is that? Tokenization? if so how should I write the regex?
PS 感谢这个伟大的包裹
PS Thanks for this great package
推荐答案
您正在尝试将短语与您的模式相匹配.默认情况下,pattern
参数被视为以空格分隔的关键字列表,并针对此列表执行搜索.因此,您可以使用
You are trying to match a phrase with your pattern. By default, the pattern
argument is treated as a space separated list of keywords, and the search is performed against this list. So, you may get your expected result using
> kwic(immigCorpus, phrase("will deport"), window = 3)
[BNP, 156:157] - The BNP | will deport | all foreigners convicted
valuetype = "regex"
如果您使用正则表达式是有意义的.例如.同时获得 shall
和 will deport
使用
> kwic(immigCorpus, phrase("(will|shall) deport"), window = 3, valuetype = "regex")
[BNP, 156:157] - The BNP | will deport | all foreigners convicted
[BNP, 1951:1952] illegal immigrants We | shall deport | all illegal immigrants
[BNP, 2584:2585] Foreign Criminals We | shall deport | all criminal entrants
请参阅此kwic
文档.
See this kwic
documentation.
这篇关于quanteda kwic 正则表达式操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!