怎么去掉标点符号? [英] How to remove punctuation?

查看:18
本文介绍了怎么去掉标点符号?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Python 中使用 NLTK 的标记器.

论坛上已经有很多关于删除标点符号的答案.但是,它们都没有同时解决以下所有问题:

  1. 连续多个符号.例如,这句话:他说,就是这样."因为有一个逗号后跟引号,分词器不会删除句子中的 .".分词器会给出 ['He', 'said', ',"', 'that', 's', 'it.'] 而不是 ['He','said', 'that', 's', 'it'].其他一些示例包括 '...'、'--'、'!?'、',"' 等.
  2. 去掉句尾的符号.即句子:Hello World.分词器将给出 ['Hello', 'World.'] 而不是 ['Hello', 'World'].请注意世界"一词末尾的句点.其他一些示例包括在任何字符的开头、中间或结尾处使用--"、、".
  3. 删除前后带有符号的字符.即 '*u*', '''','""'

是否有一种优雅的方法可以解决这两个问题?

解决方案

如果你想一次性对你的字符串进行标记化,我认为你唯一的选择是使用 nltk.tokenize.RegexpTokenizer.以下方法将允许您在完全删除标点符号之前使用标点符号作为标记来删除字母表中的字符(如第三个要求中所述).换句话说,这种方法将在去除所有标点符号之前删除 *u*.

然后,解决此问题的一种方法是像这样标记差距:

<预><代码>>>>从 nltk.tokenize 导入 RegexpTokenizer>>>s = '''他说,就是这样."*你*你好,世界.'''>>>toker = RegexpTokenizer(r'((?<=[^ws])w(?=[^ws])|(W))+', gaps=True)>>>toker.tokenize(s)['He', 'said', 'that', 's', 'it', 'Hello', 'World'] # 根据你的第三个要求省略 *u*

这应该满足您在上面指定的所有三个标准.但是请注意,此标记生成器不会返回诸如 "A" 之类的标记.此外,我只对以标点符号开头和 开头的单个字母进行标记.否则,去."不会返回令牌.您可能需要以其他方式对正则表达式进行细微调整,具体取决于您的数据是什么样的以及您的期望是什么.

I am using the tokenizer from NLTK in Python.

There are whole bunch of answers for removing punctuations on the forum already. However, none of them address all of the following issues together:

  1. More than one symbol in a row. For example, the sentence: He said,"that's it." Because there's a comma followed by quotation mark, the tokenizer won't remove ." in the sentence. The tokenizer will give ['He', 'said', ',"', 'that', 's', 'it.'] instead of ['He','said', 'that', 's', 'it']. Some other examples include '...', '--', '!?', ',"', and so on.
  2. Remove symbol at the end of the sentence. i.e. the sentence: Hello World. The tokenizer will give ['Hello', 'World.'] instead of ['Hello', 'World']. Notice the period at the end of the word 'World'. Some other examples include '--',',' in the beginning, middle, or end of any character.
  3. Remove characters with symbols in front and after. i.e. '*u*', '''','""'

Is there an elegant way of solving both problems?

解决方案

If you want to tokenize your string all in one shot, I think your only choice will be to use nltk.tokenize.RegexpTokenizer. The following approach will allow you to use punctuation as a marker to remove characters of the alphabet (as noted in your third requirement) before removing the punctuation altogether. In other words, this approach will remove *u* before stripping all punctuation.

One way to go about this, then, is to tokenize on gaps like so:

>>> from nltk.tokenize import RegexpTokenizer
>>> s = '''He said,"that's it." *u* Hello, World.'''
>>> toker = RegexpTokenizer(r'((?<=[^ws])w(?=[^ws])|(W))+', gaps=True)
>>> toker.tokenize(s)
['He', 'said', 'that', 's', 'it', 'Hello', 'World']  # omits *u* per your third requirement

This should meet all three of the criteria you specified above. Note, however, that this tokenizer will not return tokens such as "A". Furthermore, I only tokenize on single letters that begin and end with punctuation. Otherwise, "Go." would not return a token. You may need to nuance the regex in other ways, depending on what your data looks like and what your expectations are.

这篇关于怎么去掉标点符号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆