如何为“not"、“no"后面的字符串中的否定词添加标签和“从不" [英] How to add tags to negated words in strings that follow "not", "no" and "never"
问题描述
如何将标签 NEG_
添加到 not
、no
和 never
之后的所有单词,直到字符串中的下一个标点符号(用于情感分析)?我认为可以使用正则表达式,但我不确定如何使用.
How do I add the tag NEG_
to all words that follow not
, no
and never
until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.
输入:这永远不会奏效,他想.他打得不太好,所以他必须多练习.
期望的输出:NEG_going NEG_to NEG_work 从来都不是,他想.他没有NEG_play NEG_so NEG_well,所以他必须多练习一些.
知道如何解决这个问题吗?
Any idea how to solve this?
推荐答案
为了弥补 Python 的 re
正则表达式引擎缺乏一些 Perl 能力,你可以在 re 中使用 lambda 表达式.sub
函数来创建一个动态替换:
To make up for Python's re
regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub
function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
将打印(此处演示)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
<小时>
说明
第一步是选择您感兴趣的字符串部分.这是通过
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
你的否定关键字(\b
是一个词边界,(?:...)
一个非捕获组),后跟字母和空格(\w
是 [0-9a-zA-Z_]
,\s
是各种空格),直到既不是字母也不是空格的东西(充当标点符号).
Your negative keyword (\b
is a word boundary, (?:...)
a non capturing group), followed by alpahnum and spaces (\w
is [0-9a-zA-Z_]
, \s
is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
请注意,此处标点符号是强制性的,但您可以安全地删除 [^\w\s]
以匹配字符串的结尾.
Note that the punctuation is mandatory here, but you could safely remove [^\w\s]
to match end of string as well.
现在您正在处理 永远不会工作,
类型的字符串.只需选择前面有空格的单词
Now you're dealing with never going to work,
kind of strings. Just select the words preceded by spaces with
(\s+)(\w+)
用你想要的替换它们
\1NEG_\2
这篇关于如何为“not"、“no"后面的字符串中的否定词添加标签和“从不"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!