为什么用于标点符号的gsub(或regexp)没有得到所有的标点符号? [英] Why R gsub (or regexp) for punctuation doesn't get all punctuation?

查看:264
本文介绍了为什么用于标点符号的gsub(或regexp)没有得到所有的标点符号?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力清理基于文本的数据文件,无法弄清楚为什么 gsub([[:punct:]],,X1)没有给出所有标点符号的匹配。不幸的是,我不能在这里复制的问题,这让我认为这是一个字符编码问题 - 标点符号有一个明显不同于标准ASCII的外观。

I am working on cleaning up a text-based data file and cannot figure out how why the gsub("[[:punct:]]", "", X1) is not giving a match for all punctuation. Unfortunately, I cannot replicate the problem here, which makes me think it is a character encoding issue -- the punctuation in question have an appearance that is obviously different from standard ASCII.

这是一个问题,我可以解决读取文件后,或者我需要做的事情在前端?例如,Hadley的 post 的编码问题让我想,我需要指定编码语句时,我读取的文件。但是,我从一个文件夹中读取了一堆不同的txt文件,所以我不确定最好的解决方案。基本上,我只想保留所有字母[A-Za-z],并排除一切。 (也就是说, gsub([^ A-Za-z],,X1)不起作用!)

Is this a problem that I can solve after reading in the files, or do I have to do something at the front end? For example, Hadley's post on an encoding issue makes me think that I need to specifying the encoding statement when I read the files. However, I am reading a bunch of different txt files from a folder, so I am not sure the best solution. Basically, I just want to retain all letters [A-Za-z] and exclude everything else. (That said, gsub([^A-Za-z], "", X1) doesn't work either!)

任何处理此问题的建议都将非常感谢!

Any suggestions on handling this problem would be greatly appreciated!

推荐答案

标点符号可能是ascii范围。默认情况下 [[:punct:]] 只包含ascii标点符号。但是可以使用(* UCP)指令将类扩展为unicode。但这并不足够,你需要通知正则表达式引擎它必须读取目标字符串作为utf编码的字符串(* UTF) (否则为多字节编码字符将被视为几个一字节字符)。所以:

Probably the punctuation character is out of the ascii range. By default [[:punct:]] contains only ascii punctuation characters. But you can extend the class to unicode with the (*UCP) directive. But this doesn't suffice, you need to inform the regex engine that it must read the target string as an utf encoded string with (*UTF) (otherwise a multibyte encoded character will be seen as several one byte characters). So:

gsub("(*UCP)(*UTF)[[:punct:]]", "", X1, perl=T)



注意:这两个指令只存在于perl模式,

Note: these two directives exist only in perl mode and must be at the very begining of the pattern.

注意2:您可以这样做:

Note2: you can do the same like this:

gsub("(*UTF)\\pP+", "", X1, perl=T)


b $ b

因为 \pP 是所有unicode标点字符的缩写,所以(* UCP)无用。

Because \pP is a shorthand for all unicode punctation characters, (*UCP) becomes useless.

这篇关于为什么用于标点符号的gsub(或regexp)没有得到所有的标点符号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆