Quanteda Kwic正则表达式操作 [英] quanteda kwic regex operation

查看:147
本文介绍了Quanteda Kwic正则表达式操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

进一步编辑原始问题.
这个问题源于对正则表达式与"grep"或某种编程语言相同或几乎相同的期望.以下是我所期望的,并且没有发生这一事实引起了我的问题(使用cygwin):

Further edit to original question.
Question originated by expectation that regexes would work identically or nearly to "grep" or to some programming language. This below is what I expected and the fact that it did not happen generated my question (using cygwin):

echo "regex unusual operation will deport into a different" > out.txt
grep "will * dep" out.txt
"regex unusual operation will deport into a different"


原始问题
尝试遵循 https://github.com/kbenoit/ITAUR/blob/master /README.md
在看到每个使用此软件包的人都觉得很好之后,可以学习Quanteda.
demo.R 中,找到第22行:


Originary question
Trying to follow https://github.com/kbenoit/ITAUR/blob/master/README.md
to learn Quanteda after seeing that everybody that uses this package finds it very good.
In demo.R, line 22 I find the line:

kwic(immigCorpus, "deport", window = 3)  

其输出为-

[BNP, 157]        The BNP will | deport | all foreigners convicted  
[BNP, 1946]                . 2. | Deport | all illegal immigrants    
[BNP, 1952] immigrants We shall | deport | all illegal immigrants  
[BNP, 2585]  Criminals We shall | deport | all criminal entrants  

尝试/学习我执行的基础知识

To try/learn the basics I execute

kwic(immigCorpus, "will *depo", window = 3, valuetype = "regex")

期望得到

[BNP, 157]        The BNP will | deport | all foreigners convicted

但是我得到了

kwic object with 0 rows

类似的尝试

kwic(immigCorpus, ".*will *depo.*", window = 3, valuetype = "regex")

获得相同的结果:

kwic object with 0 rows

那是为什么?令牌化?如果是这样,我应该如何编写正则表达式?

Why is that? Tokenization? if so how should I write the regex?

PS感谢您提供的精美套餐

PS Thanks for this great package

推荐答案

您正在尝试将短语与您的模式匹配.默认情况下, pattern 参数被视为由空格分隔的关键字列表,并且针对该列表执行搜索.因此,您可以使用

You are trying to match a phrase with your pattern. By default, the pattern argument is treated as a space separated list of keywords, and the search is performed against this list. So, you may get your expected result using

> kwic(immigCorpus, phrase("will deport"), window = 3)
[BNP, 156:157] - The BNP | will deport | all foreigners convicted

如果使用正则表达式,则 valuetype = "regex" 是有意义的.例如.同时使用shallwill deport

A valuetype = "regex" makes sense if you are using a regex. E.g. to get both shall and will deport use

> kwic(immigCorpus, phrase("(will|shall) deport"), window = 3, valuetype = "regex")

   [BNP, 156:157]             - The BNP | will deport  | all foreigners convicted
 [BNP, 1951:1952] illegal immigrants We | shall deport | all illegal immigrants  
 [BNP, 2584:2585]  Foreign Criminals We | shall deport | all criminal entrants 

请参见kwic文档.

See this kwic documentation.

这篇关于Quanteda Kwic正则表达式操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆