使用re.split分割字符串 [英] splitting strings using re.split

查看:285
本文介绍了使用re.split分割字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有多个字符串(> 1000)的形式:

  \r\\\
Senor Sisig\\\
主席\\\
Cupkates\\\
L绿色循环\桑拿\\\
Seoul在车轮\\\
Kasa印度\\\\\\\\\\\\\\\\\\\\\\\,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, b

字符串可能在'\'


$ b $之前有一个空格b

如何分割这些字符串(以有效的方式),以避免得到任何空白或重复(空格)的元素?



我使用的是:

  re.split(r'\\ \\ r | \\\
',str)

编辑:
一些更多的例子: / p>

  \r\\\
The Creme Brulee Cart \\C立即开始\\\\
KoJa Kitchen\r在\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ \\\
Kinder的卡车\\\\
Blue西贡\
\r \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ \
Street狗卡车\\\\Kinder的卡车\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ n one ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig ig谢谢!



解决方案

您的示例不会在 \\\
之前显示任何空格,除了单个可选 \r



如果这是你想要处理的,而不是在 \r \\\
,拆分成可能的 \r 和一个确定的 \\\

  re.split(r \r?\\\
,s)

当然这是假设你没有任何裸露的 \r 而不需要 \\\
来处理。如果您需要处理 \r \r\\\
\\ \\ n 均等(类似于Python的通用换行支持...):

  re.split \r | \\\
|(\r\\\
),s)

或者,更简单地:

  re.split(r(\r | \\\
)+,s)

如果要删除前导空格,制表符,多个 \r 等,您可以在正则表达式中执行此操作,也可以在每个结果中调用 lstrip

  map(str.lstrip,re.split(r\r | \\\
,s))

...但是可以留下空的元素。你可以过滤掉这些,但是最好只是将任何以 \\\
结尾的空格分成几个:

  re.split(r\s * \\\
,s)

在开始和结束时仍然会留下空白元素,因为您的字符串以换行符开头和结尾,这就是 re.split 应该做的如果要消除它们,您可以在解析之前将 strip 该字符串,或在解析后折腾结束值:

  re.split(r\s * \\\
,s.strip())
re.split(r\s * \\\
,s)[1:-1]

我认为这些最后两个之一就是你想要的...但这只是一个猜测,基于您提供的有限信息。如果没有,那么其他人之一(以及其解释)应该足以让你写出你真正想要的东西。






从你的新例子中,看起来你真正想要分割的是任何运行的空格,包括至少一个 \\\
。您的输入可能在开始和结束时都可能没有换行符(您的第一个例子有两个,您的第二个例子在开始时为 \r\\\
结束...),你想忽略它们,如果它。所以:

  re.split(r\s * \\\
\s *,s.strip() )






但是,在这一点上,值得问一下为什么要将其解析为字符串而不是文本文件。假设你从一些文件或类似文件的对象中获取这些文件,而不是这样:

  with open(path,'r'作为f:
s = f.read()
results = re.split(regexpr,s.strip())

...这样的东西可能会更加易读,而且速度还不够快(可能不如最佳的正则表达式那么快,但仍然如此之快),任何浪费的字符串处理时间都会被淹没实际文件读取时间无论如何):

  with open(path,'r')as f:
results =过滤器(无,map(str.strip,f))

特别是如果你只想迭代在这种情况下(假设Python 3.x或使用 ifilter imap itertools if 2.x)此版本不需要将整个文件读入内存并在开始执行工作之前处理它。


I have multiple strings (>1000) of the form:

\r\nSenor Sisig\nThe Chairman\nCupkates\nLittle Green Cyclo\nSanguchon\nSeoul on Wheels\nKasa Indian\n\nGo Streatery\nWhip Out!\nLiba Falafel\nGrilled Cheese Bandits\r\n

The strings may have a whitespace before the '\n'

How do I split these strings (in an efficient way) so as to avoid getting any empty or duplicate (the whitespace case) elements?

I was using:

re.split(r'\r|\n', str)

EDIT: some more examples:

\r\nThe Creme Brulee Cart \r\nCurry Up Now\r\nKoJa Kitchen\r\nAn the Go\r\nPacific Puffs\r\nEbbett's Good to Go\r\nFiveten Burger\r\nGo Streatery\r\nHiyaaa\r\nSAJJ\r\nKinder's Truck\r\nBlue Saigon\r
\r\nThe Chairman\r\nSanguchon\r\nSeoul on Wheels\r\nGo Streatery\r\nStreet Dog Truck\r\nKinder's Truck\r\nYummi BBQ\r\nLexie's Frozen Custard\r\nDrewski's Hot Rod Kitchen\r
\n An the Go \n Cheese Gone Wild \n Cupkates \n Curry Up Now \n Fins on the Hoof\n KoJa Kitchen\n Lobsta Truck \n Oui Chef \n Sanguchon\n Senor Sisig \n The Chairman \n The Rib Whip 

thanks!

解决方案

Your example doesn't show any "whitespace before the \n" except for a single optional \r.

If that's all you're trying to handle, instead of splitting on either \r or \n, split on a possible \r and a definite \n:

re.split(r"\r?\n", s)

Of course that's assuming you don't have any bare \r without \n to handle. If you need to handle \r, \r\n, and \n all equally (similar to Python's universal newline support…):

re.split(r"\r|\n|(\r\n)", s)

Or, more simply:

re.split(r"(\r|\n)+", s)

If you want to remove leading spaces, tabs, multiple \r, etc., you could do that in the regexp, or just call lstrip on each result:

map(str.lstrip, re.split(r"\r|\n", s))

… but that can leave you with empty elements. You could filter those out, but it's probably better to just split on any run of whitespace that ends with a \n instead:

re.split(r"\s*\n", s)

That will still leave empty elements at the start and end, because your string starts and ends with newlines, and that's what re.split is supposed to do. If you want to eliminate them, you can either strip the string before parsing, or toss the end values after parsing:

re.split(r"\s*\n", s.strip())
re.split(r"\s*\n", s)[1:-1]

I think one of these last two is exactly what you want… but that's really just a guess based on the limited information you gave. If not, then one of the others (along with its explanation) should hopefully be enough for you to write what you really want.


From your new examples, it looks like what you really want to split on is any run of whitespace that includes at least one \n. And your input may or may not have newlines at the start and end (your first example has both, your second has \r\n at the start but nothing at the end…), and you want to ignore them if it does. So:

re.split(r"\s*\n\s*", s.strip())


However, at this point, it might be worth asking why you're trying to parse this as a string instead of as a text file. Assuming you got these from some file or file-like object, instead of this:

with open(path, 'r') as f:
    s = f.read()
    results = re.split(regexpr, s.strip())

… something like this might be a lot more readable, and more than fast enough (maybe not as fast as the optimal regexp, but still so fast that any wasted string-processing time is swamped by the actual file reading time anyway):

with open(path, 'r') as f:
    results = filter(None, map(str.strip, f))

Especially if you just want to iterate over this list once, in which case (assuming either Python 3.x, or using ifilter and imap from itertools if 2.x) this version doesn't have to read the whole file into memory and process it before you start doing your actual work.

这篇关于使用re.split分割字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆