文本挖掘 R 包正则表达式处理替换智能卷曲引号 [英] Text Mining R Package & Regex to handle Replace Smart Curly Quotes
问题描述
我有一堆像下面这样的文本,带有不同的智能引号 - 用于单引号和双引号.我所知道的包最终只能删除这些字符,但我希望将它们替换为普通引号.
I've got a bunch of texts like this below with different smart quotes - for single and double quotes. All I could end up with the packages I'm aware of is to remove those characters but I want them to replaced with the normal quotes.
textclean::replace_non_ascii("You don‘t get "your" money’s worth")
接收到的输出:你的钱不值钱"
预期输出:你没有得到你的"钱的价值"
如果有人能用正则表达式一次性替换所有这样的引号,也将不胜感激.
Also would appreciate if someone's got the regex to replace every such quotes in one shot.
谢谢!
推荐答案
使用两个 gsub
操作:1) 替换双引号,2) 替换单引号:
Use two gsub
operations: 1) to replace double curly quotes, 2) to replace single quotes:
> gsub("[""]", "\"", gsub("[‘’]", "'", text))
[1] "You don't get \"your\" money's worth"
查看在线 R 演示.已在 Linux 和 Windows 上进行测试,效果相同.
See the online R demo. Tested in both Linux and Windows, and works the same.
["]
结构是一个正字符类 匹配类中定义的任何单个字符.
The [""]
construct is a positive character class that matches any single char defined in the class.
要规范化所有类似于双引号的字符,您可能需要使用
To normalize all chars similar to double quotes, you might want to use
> sngl_quot_rx = "[ʻʼʽ٬‘’‚‛՚︐]"
> dbl_quot_rx = "[«»"""„‟≪≫《》〝〞〟\"″‶]"
> res = gsub(dbl_quot_rx, "\"", gsub(sngl_quot_rx, "'", `Encoding<-`(text, "UTF8")))
> cat(res, sep="\n")
You don't get "your" money's worth
这里,[«»"„"≪≫《》〝‖〟″‶]
匹配
« 00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
» 00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
" 05F4 HEBREW PUNCTUATION GERSHAYIM
" 201C LEFT DOUBLE QUOTATION MARK
" 201D RIGHT DOUBLE QUOTATION MARK
„ 201E DOUBLE LOW-9 QUOTATION MARK
‟ 201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
≪ 226A MUCH LESS-THAN
≫ 226B MUCH GREATER-THAN
《 300A LEFT DOUBLE ANGLE BRACKET
》 300B RIGHT DOUBLE ANGLE BRACKET
〝 301D REVERSED DOUBLE PRIME QUOTATION MARK
〞 301E DOUBLE PRIME QUOTATION MARK
〟 301F LOW DOUBLE PRIME QUOTATION MARK
" FF02 FULLWIDTH QUOTATION MARK
″ 2033 DOUBLE PRIME
‶ 2036 REVERSED DOUBLE PRIME
[ʻʼʽ٬''‚‛՚︐]
用于规范一些类似于单引号的字符:
The [ʻʼʽ٬‘’‚‛՚︐]
is used to normalize some chars similar to single quotes:
ʻ 02BB MODIFIER LETTER TURNED COMMA
ʼ 02BC MODIFIER LETTER APOSTROPHE
ʽ 02BD MODIFIER LETTER REVERSED COMMA
٬ 066C ARABIC THOUSANDS SEPARATOR
‘ 2018 LEFT SINGLE QUOTATION MARK
’ 2019 RIGHT SINGLE QUOTATION MARK
‚ 201A SINGLE LOW-9 QUOTATION MARK
‛ 201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
՚ 055A ARMENIAN APOSTROPHE
︐ FE10 PRESENTATION FORM FOR VERTICAL COMMA
这篇关于文本挖掘 R 包正则表达式处理替换智能卷曲引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!