Python正则表达式nltk网站提取 [英] Python Regular Expression nltk website extraction

查看:114
本文介绍了Python正则表达式nltk网站提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我以前从未处理过正则表达式,我正尝试使用Python和NLTK预处理一些原始文本. 当我尝试使用:标记文档时:

Hi I have never had to deal with regex before and I'm trying to preprocess some raw text with Python and NLTK. when I tried to tokenize the document using :

tokens = nltk.regexp_tokenize(corpus, sentence_re)
sentence_re = r'''(?x)  # set flag to allow verbose regexps
  ([A-Z])(\.[A-Z])+\.?  # abbreviations, e.g. U.S.A.
| \w+(-\w+)*            # words with optional internal hyphens
| \$?\d+(\.\d+)?%?      # currency and percentages, e.g. $12.40, 82%
| \#?\w+|\@?\w+         # hashtags and @ signs
| \.\.\.                # ellipsis
| [][.,;"'?()-_`]       # these are separate tokens
| ?:http://|www.)[^"\' ]+ # websites
'''

它不能将所有网站都当作一个令牌:

its not able to take all of the website as one single token:

print toks[:50]
['on', '#Seamonkey', '(', 'SM', ')', '-', 'I', 'had', 'a', 'short', 'chirp',   'exchange', 'with', '@angie1234p', 'at', 'the', '18thDec', ';', 'btw', 'SM', 'is', 'faster', 'has', 'also', 'an', 'agile', '...', '1', '/', '2', "'", '...', 'user', 'community', '-', 'http', ':', '/', '/', 'bit', '.', 'ly', '/', 'XnF5', '+', 'ICR', 'http', ':', '/', '/']

非常感谢您的帮助.非常感谢!

any help is greatly appreicated. Thanks so much!

-弗洛里

推荐答案

在此令牌生成器中,RegularExpressions用于指定要从文本中提取的令牌的外观. 我有点困惑您使用了上面的许多正则表达式中的哪一个,但是对于非空白标记的非常简单的标记化,您可以使用:

In this tokenizer RegularExpressions are used to specify how the Tokens you want to extract from the text can look like. I'm a bit confused which of the many regular expressions above you used, but for a very simple tokenization to non-whitespace tokens you could use:

>>> corpus = "this is a sentence. and another sentence. my homepage is http://test.com"
>>> nltk.regexp_tokenize(corpus, r"\S+")
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

等效于:

>>> corpus.split()
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

另一种方法可能是使用nltk函数send_tokenize()和nltk.word_tokenize():

another approach could be using the nltk functions sent_tokenize() and nltk.word_tokenize():

>>> sentences = nltk.sent_tokenize(corpus)
>>> sentences
['this is a sentence.', 'and another sentence.', 'my homepage is http://test.com']
>>> for sentence in sentences:
    print nltk.word_tokenize(sentence)
['this', 'is', 'a', 'sentence', '.']
['and', 'another', 'sentence', '.']
['my', 'homepage', 'is', 'http', ':', '//test.com']

尽管如果您的文本包含很多网站网址,这可能不是最佳选择.可以在此处.

though if your text contains lots of website-urls this might not be the best choice. information about the different tokenizers in the NLTK can be found here.

如果您只想从语料库中提取URL,则可以使用如下正则表达式:

if you just want to extract URLs from the corpus you could use a regular expression like this:

nltk.regexp_tokenize(corpus, r'(http://|https://|www.)[^"\' ]+')

希望这会有所帮助.如果这不是您想要的答案,请尝试更精确地解释您想要做的事情以及您想要令牌的精确程度(例如,您希望拥有的示例输入/输出),我们可以为您提供帮助找到正确的正则表达式.

Hope this helps. If this was not the answer you were looking for, please try to explain a bit more precisely what you want to do and how exactely you want your tokens look like (e.g. an example input/output you would like to have) and we can help finding the right regular expression.

这篇关于Python正则表达式nltk网站提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆