用双引号替换单引号并排除某些元素 [英] Replace single quotes with double with exclusion of some elements

查看:108
本文介绍了用双引号替换单引号并排除某些元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 double 替换字符串中的所有单引号,但出现的情况除外,例如n't"、'll"、'm"等

I want to replace all single quotes in the string with double with the exception of occurrences such as "n't", "'ll", "'m" etc.

input="the stackoverflow don\'t said, \'hey what\'"
output="the stackoverflow don\'t said, \"hey what\""

代码 1:(@https://stackoverflow.com/users/918959/antti-haapala)

def convert_regex(text): 
     return re.sub(r"(?<!\w)'(?!\w)|(?<!\w)'(?=\w)|(?<=\w)'(?!\w)", '"', text)

有 3 种情况: ' 前面没有,后面也没有字母数字字符;or 前面没有,但后面跟着一个字母数字字符;或 前面而不是字母数字字符.

There are 3 cases: ' is NOT preceded and is NOT followed by a alphanumeric character; or is not preceded, but followed by an alphanumeric character; or is preceded and not followed by an alphanumeric character.

问题:这不适用于以撇号结尾的单词,即大多数所有格复数,它也不适用于非正式的以撇号开头的缩写.

Issue: That doesn't work on words that end in an apostrophe, i.e. most possessive plurals, and it also doesn't work on informal abbreviations that start with an apostrophe.

代码 2:(@https://stackoverflow.com/users/953482/kevin)

def convert_text_func(s):
    c = "_" #placeholder character. Must NOT appear in the string.
    assert c not in s
    protected = {word: word.replace("'", c) for word in ["don't", "it'll", "I'm"]}
    for k,v in protected.iteritems():
        s = s.replace(k,v)
    s = s.replace("'", '"')
    for k,v in protected.iteritems():
        s = s.replace(v,k)
    return s

词集太大,无法指定,例如如何指定人等.请帮忙.

Too large set of words to specify, as how can one specify persons' etc. Please help.

编辑 1:我正在使用@anubhava 的精彩回答.我正面临这个问题.有时,该方法会失败的语言翻译.代码=

Edit 1: I am using @anubhava's brillant answer. I am facing this issue. Sometimes, there language translations which the approach fail. Code=

text=re.sub(r"(?<!s)'(?!(?:t|ll|e?m|s|d|ve|re|clock)\b)", '"', text)

问题:

在文本中,'Kumbh melas' melas 是印地语到英语的翻译,而不是复数所有格名词.

In text, 'Kumbh melas' melas is a Hindi to English translation not plural possessive nouns.

Input="Similar to the 'Kumbh melas', celebrated by the banks of the holy rivers of India,"
Output=Similar to the "Kumbh melas', celebrated by the banks of the holy rivers of India,
Expected Output=Similar to the "Kumbh melas", celebrated by the banks of the holy rivers of India,

我正在寻找可能以某种方式修复它的条件.人工干预是最后的选择.

I am looking maybe to add a condition that somehow fixes it. Human-level intervention is the last option.

编辑 2:天真而漫长的修复方法:

Edit 2: Naive and long approach to fix:

def replace_translations(text):
    d = enchant.Dict("en_US")
    words=tokenize_words(text)
    punctuations=[x for x in string.punctuation]
    for i,word in enumerate(words):
        print i,word
        if(i!=len(words) and word not in punctuations and d.check(word)==False and words[i+1]=="'"):
            text=text.replace(words[i]+words[i+1],words[i]+"\"")
    return text

是否有我遗漏的极端情况或有更好的方法?

Are there any corner cases I am missing or are there any better approaches?

推荐答案

第一次尝试

你也可以使用这个正则表达式:

First attempt

You can also use this regex:

(?:(?<!\w)'((?:.|\n)+?'?)'(?!\w))

REGEX101 中的演示

这个正则表达式匹配整个句子/单词,从头到尾都有引号,但也将引用的内容保存在第1组内,所以你可以用"\1"替换匹配的部分.

This regex match whole sentence/word with both quoting marks, from beginning and end, but also campure the content of quotation inside group nr 1, so you can replace matched part with "\1".

  • (?<!\w) - 对非单词字符进行否定回溯,以排除诸如you'll"等单词,但允许正则表达式匹配后面的引号\n,:,;,.- 等字符.假设引用前总是有空格是有风险的.
  • ' - 单引号,
  • (?:.|\n)+?'?) - 非捕获组:一个或多个任何字符或新行(以匹配多行句子)与惰性 quantifire(以避免从第一个到最后一个单引号匹配),然后是可选的单引号唱,如果有两个连续
  • '(?!\w) - 单引号,后跟非单词字符,以排除像我是"、你是"等文本,其中引号是甜菜字,
  • (?<!\w) - negative lookbehind for non-word character, to exclude words like: "you'll", etc., but to allow the regex to match quatations after characters like \n,:,;,. or -,etc. The assumption that there will always be a whitespace before quotation is risky.
  • ' - single quoting mark,
  • (?:.|\n)+?'?) - non capturing group: one or more of any character or new line (to match multiline sentences) with lazy quantifire (to avoid matching from first to last single quoting mark), followed by optional single quoting sing, if there would be two in a row
  • '(?!\w) - single quotes, followed by non-word character, to exclude text like "i'm", "you're" etc. where quoting mark is beetwen words,

然而,在以 s 结尾的单词之后匹配带有撇号的句子仍然存在问题,例如:'the classes' hours'.我认为当 s 后跟 ' 应该被视为引号结束时,或者作为或 s 和撇号时,用正则表达式是不可能区分的.但我想出了一种解决这个问题的有限方法,使用正则表达式:

However it still has problem with matching sentences with apostrophes occurs after word ending with s, like: 'the classes' hours'. I think it is impossible to distinguish with regex when s followed by ' should be treated as end of quotation, or as or s with apostrophes. But I figured out a kind of limited work around for this problem, with regex:

(?:(?<!\w)'((?:.|\n)+?'?)(?:(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w))))

REGEX101 中的演示

Python 实现

对于带有 s' 的情况,还有其他替代方法:(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w) 其中:

with additional alternative for cases with s': (?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w) where:

  • (?<!s)'(?!\w) - 如果 ' 之前没有 s,则匹配为正则表达式以上(第一次尝试),
  • (?<=s)'(?!([^']|\w'\w)+'(?!\w) - 如果有 s' 之前,仅当没有其他 ' 后跟非单词时,才结束对这个 ' 的匹配在后面的文本中,结束之前或另一个 ' 之前的字符(但只有 ' 前面是 s 以外的字母,或下一个引用的开头).\w'\w 是在这样的匹配中包含一个 ' ,它位于字母之间,例如 i'm 等.
  • (?<!s)'(?!\w) - if there is no s before ', match as regex above (first attempt),
  • (?<=s)'(?!([^']|\w'\w)+'(?!\w) - if there is s before ', end a match on this ' only if there is no other ' followed by non-word character in following text, before end or before another ' (but only ' preceded by letter other than s, or opening of next quotaion). The \w'\w is to include in such match a ' wich are between letters, like in i'm, etc.

这个正则表达式应该匹配错误,只有在一行中有几个 s' 案例.尽管如此,这远非完美的解决方案.

this regex should match wrong only it there is couple s' cases in a row. Still, it is far from perfect solution.

此外,使用 \w 总是有可能在 sybol 或非 [a-zA-Z_0-9]'> 但还是字母字符,像一些本地语言字符,然后它会被视为一个quatation的开始.可以通过将 (?<!\w)(?!\w) 替换为 (?<!\p{L}) 来避免(?!\p{L}) 或类似 (?<=^|[,.?!)\s]) 等., 积极环顾可以在句子中出现在quatation之前的字符.但是,列表可能很长.

Also, using \w there is always chance that ' would occur after sybol or non-[a-zA-Z_0-9] but still letter character, like some local language character, and then it will be treated as beginning of a quatation. It could be avoided by replacing (?<!\w) and (?!\w) with (?<!\p{L}) and (?!\p{L}) or something like (?<=^|[,.?!)\s]), etc., positive lookaround for characters wich can occour in sentence before quatation. However a list could be quite long.

这篇关于用双引号替换单引号并排除某些元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆