RegEx Tokenizer:将文本拆分为单词,数字,标点和空格(请勿删除任何内容) [英] RegEx Tokenizer: split text into words, digits, punctuation, and spacing (do not delete anything)

查看:80
本文介绍了RegEx Tokenizer:将文本拆分为单词,数字,标点和空格(请勿删除任何内容)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几乎

I almost found the answer to this question in this thread (samplebias's answer); however I need to split a phrase into words, digits, punctuation marks, and spaces/tabs. I also need this to preserve the order in which each of these things occurs (which the code in that thread already does).

所以,我发现的是这样的:

So, what I've found is something like this:

    from nltk.tokenize import *
    txt = "Today it's   07.May 2011. Or 2.999."
    regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+')
    ['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.']

但这是我需要产生的那种清单:

But this is the kind of list I need to yield:

    ['Today', ' ', 'it', "'s", ' ', '\t', '07.May', ' ', '2011', '.', ' ', 'Or', ' ', '2.999', '.']

Regex一直是我的弱点之一,因此经过几个小时的研究,我仍然感到困惑.谢谢!!

Regex has always been one of my weakpoints so after a couple hours of research I'm still stumped. Thank you!!

推荐答案

我认为类似这样的方法应该对您有用.该正则表达式可能比需要的更多,但是您的要求有些含糊,并且与您提供的预期输出不完全匹配.

I think that something like this should work for you. There is probably more in that regex than there needs to be, but your requirements are somewhat vague and don't exactly match up with the expected output you provided.

>>> txt = "Today it's \t07.May 2011. Or 2.999."
>>> p = re.compile(r"\d+|[-'a-z]+|[ ]+|\s+|[.,]+|\S+", re.I)
>>> slice_starts = [m.start() for m in p.finditer(txt)] + [None]
>>> [txt[s:e] for s, e in zip(slice_starts, slice_starts[1:])]
['Today', ' ', "it's", ' ', '\t', '07', '.', 'May', ' ', '2011', '.', ' ', 'Or', ' ', '2', '.', '999', '.']

这篇关于RegEx Tokenizer:将文本拆分为单词,数字,标点和空格(请勿删除任何内容)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆