将 TAG 格式转换为用于 Corpus 的 Regex [英] Translate from TAG format to Regex for Corpus

查看:45
本文介绍了将 TAG 格式转换为用于 Corpus 的 Regex的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用一个名为 AntConc 的语料库语言学工具,在该工具中,您有一个文档,其中每个单词都被标记为词性(名词、形容词等),您可以使用特定命令来提取匹配项.例如,如果我要查找一个名词(标记为 NN),我会使用 *_NN 并且它会查找文档中的每个名词.

我需要将我的 *_TAG 语法翻译成 python 正则表达式,但我不知道该怎么做.例如,我有一个短语:*_PP$ *_NN *_DT *_JJ *_NN(这可以转换为所有格代词、名词、限定词、形容词、名词;它会找到诸如她的声音和完全重复")以 TAG 格式.

如何将这样的东西改成正则表达式?现在,我只需要那些基本的东西.稍后我会担心弄清楚如何做或"和如果这个那么这个"等等.

如果您需要有关标签的更多信息,请尝试搜索 POS 标签 CLAWS,它应该会给您一个列表.

非常感谢您的帮助!

解决方案

于是我做了一些研究,发现

要执行不确定数量的未知单词",您可以这样做:

(?:[\w-]+\W+)*?

所以匹配单词[\w-]+的部分和位于\W+之间的部分被包装成一个非捕获组(?:...) 并且该组被认为使用 * 出现 0 次或更多次,但使用 ? 出现的次数尽可能少,以避免 <强>贪婪.您可以在此处查看并删除或添加 X 以查看它仍然会匹配.

I'm working with a corpus linguistics tool called AntConc, where you have a document where every word is tagged as a part of speech (noun, adjective, etc), and you use specific commands to pull out matches. For example, if I was looking for a noun (which is tagged NN), I would use *_NN and it would find every noun in the document.

I need to translate my *_TAG syntax into python regex, and I have no idea how to do that. For example, I have a phrase: *_PP$ *_NN *_DT *_JJ *_NN (this translates to possessive pronoun, noun, determiner, adjective, noun; it would find things like "her voice an exact duplicate") in TAG format.

How does one go about changing things like that to regex? For now, I'll take just that basic stuff. Later I'll worry about figuring out how to do "or" and "if this then this" and whatnot.

If you need more info about the tags, try searching for POS tags CLAWS, which should give you a list.

Thanks so much for your help!

解决方案

So I did some research and found this PDF file describing the notion of embedded tags and non-embedded tags. You are looking to find the embedded tags. So if I'm correct the input would be like this right?

her_PP$ voice_NN an_DT exact_JJ duplicate_NN

Only then in a larger body of text and you don't know the actual words, you just know the _XX tags.

In a regex, you have to be more specific then *. What you want in the place of the * is 1 or more of any character that is part of a word (letters, but could also contain hyphens maybe?). That makes this for the noun:

[\w-]+_NN

This means a character class [...] of word characters \w and the hyphen -, repeated one or more times +, followed by _NN.

For the possessive pronoun, it has a $ in there which has a special meaning in regexes, if you want the character $ and not its special meaning, you need to escape it with a preceding \ like so:

[\w-]+_PP\$

Lastly you want to consider which characters are allowed in between the words. Could be just white-space like spaces, tabs and enters, which would be \s+. Could also be "any character that isn't a word character" to allow periods, commas, quotes, colons, etc. That would be \W+ (note the upper case W to be the opposite of the lowercase \w).

Combined this would amount to this:

[\w-]+_PP\$\W+[\w-]+_NN\W+[\w-]+_DT\W+[\w-]+_JJ\W+[\w-]+_NN

Debuggex Demo

To do "an undetermined amount of unknown words" you would do this:

(?:[\w-]+\W+)*?

So the part that matches the word [\w-]+ and the part that goes in between \W+ are wrapped into a non-capturing group (?:...) and that group is said to occur 0 or more times with the * but as few times as possible with ? to avoid greediness. You can see it here and remove or add an X to see it will still match.

这篇关于将 TAG 格式转换为用于 Corpus 的 Regex的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆