如何编写自定义 removePunctuation() 函数以更好地处理 Unicode 字符? [英] How to write custom removePunctuation() function to better deal with Unicode chars?

查看:32
本文介绍了如何编写自定义 removePunctuation() 函数以更好地处理 Unicode 字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 tm 文本挖掘 R 包的源代码中,文件 transform.R,有removePunctuation()函数,目前定义为:

In the source code of the tm text-mining R-package, in file transform.R, there is the removePunctuation() function, currently defined as:

function(x, preserve_intra_word_dashes = FALSE)
{
    if (!preserve_intra_word_dashes)
        gsub("[[:punct:]]+", "", x)
    else {
        # Assume there are no ASCII 1 characters.
        x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x)
        x <- gsub("[[:punct:]]+", "", x)
        gsub("\1", "-", x, fixed = TRUE)
    }
}

我需要解析和挖掘来自科学会议的一些摘要(从他们的网站获取为 UTF-8).摘要包含一些需要删除的 unicode 字符,尤其是在单词边界处.有通常的 ASCII 标点符号,但也有一些 Unicode 破折号、Unicode 引号、数学符号......

I need to parse and mine some abstracts from a science conference (fetched from their website as UTF-8). The abstracts contain some unicode characters that need to be removed, particularly at word boundaries. There are the usual ASCII punctuation characters, but also a few Unicode Dashes, Unicode Quotes, Math Symbols...

文本中也有网址,其中标点符号需要保留字内标点符号.tm内置的removePunctuation()函数太激进了.

There are also URLs in the text, and there the punctuation the intra-word punctuation characters need to be preserved. tm's built-in removePunctuation() function is too radical.

所以我需要一个自定义的 removePunctuation() 函数来根据我的要求进行删除.

So I need a custom removePunctuation() function to do removal according to my requirements.

我的自定义 Unicode 函数现在看起来像这样,但它没有按预期工作.我很少使用 R,所以在 R 中完成工作需要一些时间,即使是最简单的任务.

My custom Unicode function looks like this now, but it does not work as expected. I am using R only rarely, so getting things done in R takes some time, even for the simplest tasks.

我的功能:

corpus <- tm_map(corpus, rmPunc =  function(x){ 
# lookbehinds 
# need to be careful to specify fixed-width conditions 
# so that it can be used in lookbehind

x <- gsub('(.*?)(?<=^[[:punct:]’"":±</>]{5})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’"":±</>]{4})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’"":±</>]{3})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’"":±</>]{2})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’"":±</>])([[:alnum:]])'," \\2", x, perl=TRUE) ; 
# lookaheads (can use variable-width conditions) 
x <- gsub('(.*?)(?=[[:alnum:]])([[:punct:]’"":±]+)$',"\1 ", x, perl=TRUE) ;

# remove all strings that consist *only* of punct chars 
gsub('^[[:punct:]’"":±</>]+$',"", x, perl=TRUE) ;

}

它没有按预期工作.我认为,它根本没有任何作用.标点符号仍在术语文档矩阵内,请参阅:

It does not work as expected. I think, it doesn't do anything at all. The punctuation is still inside the terms-document matrix, see:

 head(Terms(tdm), n=30)

  [1] "<></>"                      "---"                       
  [3] "--,"                        ":</>"                      
  [5] ":()"                        "/)."                       
  [7] "/++"                        "/++,"                      
  [9] "..,"                        "..."                       
 [11] "...,"                       "..)"                       
 [13] ""","                        "(|)"                       
 [15] "(/)"                        "(.."                       
 [17] "(..,"                       "()=(|=)."                  
 [19] "(),"                        "()."                       
 [21] "(&)"                        "++,"                       
 [23] "(0°"                        "0.001),"                   
 [25] "0.003"                      "=0.005)"                   
 [27] "0.006"                      "=0.007)"                   
 [29] "000km"                      "0.01)" 
...

所以我的问题是:

  1. 为什么对我的 function(){} 的调用没有达到预期的效果?我的怎么可以功能有待提高?
  2. 是 Unicode 正则表达式模式类,例如 if\P{ASCII}\P{PUNCT} 在 R 的 perl 兼容正则中支持表达?我认为它们不是(默认情况下)PCRE::" 仅支持带有 \p 的各种 Unicode 属性不完整,但支持最重要的属性."
  1. Why doesn't the call to my function(){} have the desired effect? How can my function be improved?
  2. Are Unicode regex pattern classes such as if \P{ASCII} or \P{PUNCT} supported in R's perl-compatible regular expressions? I think they aren't (by default) by PCRE:: " Only the support for various Unicode properties with \p is incomplete, though the most important ones are supported."

推荐答案

尽管我喜欢 Susana 的回答,但它正在破坏 tm 较新版本中的 Corpus(不再是纯文本文档并破坏元)

As much as I like Susana's answer it is breaking the Corpus in newer versions of tm (No longer a PlainTextDocument and destroying the meta)

你会得到一个 list 和以下错误:

You will get a list and the following error:

Error in UseMethod("meta", x) : 
no applicable method for 'meta' applied to an object of class "character"

使用

tm_map(your_corpus, PlainTextDocument)

将返回您的语料库,但 $meta 已损坏(特别是文档 ID 将丢失.

will give you back your corpus but with broken $meta (in particular document ids will be missing.

解决方案

使用content_transformer

toSpace <- content_transformer(function(x,pattern)
    gsub(pattern," ", x))
your_corpus <- tm_map(your_corpus,toSpace,"„")

来源:使用 R 实践数据科学,文本挖掘,Graham.Williams@togaware.com http://onepager.togaware.com/

Source: Hands-On Data Science with R, Text Mining, Graham.Williams@togaware.com http://onepager.togaware.com/

此函数删除所有非字母数字(即 UTF-8 表情符号等)

This function removes everything that is not alpha numeric (i.e. UTF-8 emoticons etc.)

removeNonAlnum <- function(x){
  gsub("[^[:alnum:]^[:space:]]","",x)
}

这篇关于如何编写自定义 removePunctuation() 函数以更好地处理 Unicode 字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆