文本挖掘 R 包正则表达式处理替换智能卷曲引号 [英] Text Mining R Package & Regex to handle Replace Smart Curly Quotes

查看:72
本文介绍了文本挖掘 R 包正则表达式处理替换智能卷曲引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆像下面这样的文本,带有不同的智能引号 - 用于单引号和双引号.我所知道的包最终只能删除这些字符,但我希望将它们替换为普通引号.

I've got a bunch of texts like this below with different smart quotes - for single and double quotes. All I could end up with the packages I'm aware of is to remove those characters but I want them to replaced with the normal quotes.

textclean::replace_non_ascii("You don‘t get "your" money’s worth")

接收到的输出:你的钱不值钱"

预期输出:你没​​有得到你的"钱的价值"

如果有人能用正则表达式一次性替换所有这样的引号,也将不胜感激.

Also would appreciate if someone's got the regex to replace every such quotes in one shot.

谢谢!

推荐答案

使用两个 gsub 操作:1) 替换双引号,2) 替换单引号:

Use two gsub operations: 1) to replace double curly quotes, 2) to replace single quotes:

> gsub("[""]", "\"", gsub("[‘’]", "'", text))
[1] "You don't get \"your\" money's worth"

查看在线 R 演示.已在 Linux 和 Windows 上进行测试,效果相同.

See the online R demo. Tested in both Linux and Windows, and works the same.

["] 结构是一个正字符类 匹配类中定义的任何单个字符.

The [""] construct is a positive character class that matches any single char defined in the class.

要规范化所有类似于双引号的字符,您可能需要使用

To normalize all chars similar to double quotes, you might want to use

> sngl_quot_rx = "[ʻʼʽ٬‘’‚‛՚︐]"
> dbl_quot_rx = "[«»"""„‟≪≫《》〝〞〟\"″‶]"
> res = gsub(dbl_quot_rx, "\"", gsub(sngl_quot_rx, "'", `Encoding<-`(text, "UTF8"))) 
> cat(res, sep="\n")
You don't get "your" money's worth

这里,[«»"„"≪≫《》〝‖〟″‶]匹配

«   00AB  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
»   00BB  RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
"   05F4  HEBREW PUNCTUATION GERSHAYIM
"   201C  LEFT DOUBLE QUOTATION MARK
"   201D  RIGHT DOUBLE QUOTATION MARK
„   201E  DOUBLE LOW-9 QUOTATION MARK
‟   201F  DOUBLE HIGH-REVERSED-9 QUOTATION MARK
≪  226A  MUCH LESS-THAN
≫  226B  MUCH GREATER-THAN
《  300A  LEFT DOUBLE ANGLE BRACKET
》  300B  RIGHT DOUBLE ANGLE BRACKET
〝  301D  REVERSED DOUBLE PRIME QUOTATION MARK
〞  301E  DOUBLE PRIME QUOTATION MARK
〟  301F  LOW DOUBLE PRIME QUOTATION MARK
"  FF02  FULLWIDTH QUOTATION MARK
″   2033  DOUBLE PRIME
‶   2036  REVERSED DOUBLE PRIME

[ʻʼʽ٬''‚‛՚︐] 用于规范一些类似于单引号的字符:

The [ʻʼʽ٬‘’‚‛՚︐] is used to normalize some chars similar to single quotes:

ʻ  02BB  MODIFIER LETTER TURNED COMMA
ʼ  02BC  MODIFIER LETTER APOSTROPHE
ʽ  02BD  MODIFIER LETTER REVERSED COMMA
٬  066C  ARABIC THOUSANDS SEPARATOR
‘  2018  LEFT SINGLE QUOTATION MARK
’  2019  RIGHT SINGLE QUOTATION MARK
‚  201A  SINGLE LOW-9 QUOTATION MARK
‛  201B  SINGLE HIGH-REVERSED-9 QUOTATION MARK
՚   055A  ARMENIAN APOSTROPHE
︐  FE10  PRESENTATION FORM FOR VERTICAL COMMA

这篇关于文本挖掘 R 包正则表达式处理替换智能卷曲引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆