在Python中分割字符串的最有效方法 [英] Most efficient way to split strings in Python
问题描述
我当前的Python项目将需要大量的字符串拆分来处理传入的软件包.由于我将在一个非常慢的系统上运行它,因此我想知道实现此目标的最有效方法是什么.字符串的格式如下:
My current Python Project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this:
Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5
说明:此特定示例来自一个列表,其中前两个项目是标题和日期,而项目3至项目5将被邀请(这些项目的数量可以是从0到n的任何数字,其中n是服务器上的注册用户数.
Explanation: This particular example would come from a list where the first two Items are a title and a date, while Item 3 to Item 5 would be invited people (The number of those can be anything from zero to n, where n is the number of registered users on the server).
从我所看到的,我有以下选择:
From what I see, I have the following options:
- 重复使用
split()
- 使用正则表达式和Regex函数
- 我还没有想到的其他Python函数(可能有一些)
在该示例中,
解决方案1包括在|
处进行拆分,然后在<>
处对结果列表的最后一个元素进行拆分,而解决方案2可能会生成如下正则表达式:
Solution 1 would include splitting at |
and then splitting the last element of the resulting list at <>
for this example, while solution 2 would probably result in a regular expression like:
((.+)|)+((.+)(<>)?)+
好的,这个正则表达式太可怕了,我自己也能看到.它也未经测试.但是你明白了.
Okay, this RegEx is horrible, I can see that myself. It is also untested. But you get the idea.
现在,我正在寻找以下方式:a)花费最少的时间,b)理想地使用最少的内存.如果只有这两种可能之一,我宁愿花更少的时间.对于具有更多用|
分隔的项目的字符串和完全缺少<>
的字符串,理想的解决方案也适用.至少基于正则表达式的解决方案可以做到这一点
Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for Strings that have more Items seperated with |
and strings that completely lack the <>
. At least the Regular Expression-based Solution would do that
我的理解是split()
将使用更多的内存(因为您基本上得到了两个结果列表,一个在|
处拆分,第二个在<>
处拆分),但是我还不够了解关于正则表达式的Python实现以判断RegEx将如何执行.如果split()
对于不同数量的Item且缺少第二个分隔符,则其动态性也比正则表达式差.尽管如此,我仍然无法撼动python没有正则表达式可以做得更好的印象,这就是为什么我要问
My understanding would be that split()
would use more memory (since you basically get two resulting lists, one that splits at |
and the second one that splits at <>
), but I don't know enough about Pythons implementation of regular Expressions to judge how the RegEx would perform. split()
is also less dynamic than a regular expression if it somes to different numbers of Items and the absence of the second seperator. Still, I can't shake the impression that python can do this better without regular expressions, that's why I am asking
一些注意事项:
- 是的,我只能对这两个解决方案进行基准测试,但是我试图在总体上了解有关python的知识以及它在这里的工作方式,如果仅对这两个基准进行测试,我仍然不知道我错过了哪些python函数
- 是的,只有高性能的东西才真正需要在此级别进行优化,但是正如我所说的,我正在尝试学习有关python的知识.
- 添加项:在原始问题中,我完全忘记提及我需要能够将由
|
分隔的部分与带有分隔符<>
的部分区分开来,因此由re.split(\||<>,input)
生成的简单平面列表(由@obmarg提出)不能很好地工作.非常适合这种标准的解决方案.
- Yes, I could just benchmark both solutions, but I'm trying to learn something about python in general and how it works here, and if I just benchmark these two, I still don't know what python functions I have missed.
- Yes, optimizing at this level is only really required for high-performance stuff, but as I said, I am trying to learn things about python.
- Addition: in the original question, I completely forgot to mention that I need to be able to distinguish the parts that were seperated by
|
from the parts with the seperator<>
, so a simple flat list as generated byre.split(\||<>,input)
(as proposed by @obmarg) will not work too well. Solutions fitting this criterium are much appreciated.
总结一下问题:出于何种原因,哪种解决方案将是最有效的.
To sum the question up: Which solution would be the most efficient one, for what reasons.
由于多个请求,我在split()
-解决方案和@obmarg首次提出的正则表达式以及@mgibsonbr和@duncan的解决方案上运行了一些timeit:
Due to multiple requests, I have run some timeit on the split()
-solution and the first proposed regular expression by @obmarg, as well as the solutions by @mgibsonbr and @duncan:
import timeit
import re
def splitit(input):
res0 = input.split("|")
res = []
for element in res0:
t = element.split("<>")
if t != [element]:
res0.remove(element)
res.append(t)
return (res0, res)
def regexit(input):
return re.split( "\||<>", input )
def mgibsonbr(input): # Solution by @mgibsonbr
items = re.split(r'\||<>', input) # Split input in items
offset = 0
result = [] # The result: strings for regular itens, lists for <> separated ones
acc = None
for i in items:
delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'
offset += len(i) + len(delimiter)
if delimiter == '<>': # Will always put the item in a list
if acc is None:
acc = [i] # Create one if doesn't exist
result.append(acc)
else:
acc.append(i)
else:
if acc is not None: # If there was a list, put the last item in it
acc.append(i)
else:
result.append(i) # Add the regular items
acc = None # Clear the list, since what will come next is a regular item or a new list
return result
def split2(input): # Solution by @duncan
res0 = input.split("|")
res1, res2 = [], []
for r in res0:
if "<>" in r:
res2.append(r.split("<>"))
else:
res1.append(r)
return res1, res2
print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
结果:
mgibs: 14.7349407408
split: 6.403942732
split2: 3.68306812233
regex: 5.28414318792
mgibs: 107.046683735
split: 46.0844590775
split2: 26.5595985591
regex: 28.6513302646
此刻,看起来@duncan的split2击败了所有其他算法,无论长度如何(至少使用此有限数据集),而且@mgibsonbr的解决方案似乎也存在一些性能问题(对不起,但是谢谢您提供的解决方案).
At the moment, it looks like split2 by @duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like @mgibsonbr's solution has some performance issues (Sorry 'bout that, but thanks for the solution regardless).
谢谢大家的投入.
推荐答案
split()
在您的代码中表现如此差劲,令我有些惊讶,因此我仔细观察了一下,发现您正在调用list.remove()
在内部循环中.另外,您还要在每个字符串上额外拨打split()
的时间.摆脱这些限制,使用split()
解决方案可以使正则表达式在较短的字符串上胜过一筹,而在较长的字符串上则紧随其后.
I was slightly surprised that split()
performed so badly in your code so I looked at it a bit more closely and noticed that you're calling list.remove()
in the inner loop. Also you're calling split()
an extra time on each string. Get rid of those and a solution using split()
beats the regex hands down on shorter strings and comes a pretty close second on the longer one.
import timeit
import re
def splitit(input):
res0 = input.split("|")
res = []
for element in res0:
t = element.split("<>")
if t != [element]:
res0.remove(element)
res.append(t)
return (res0, res)
def split2(input):
res0 = input.split("|")
res1, res2 = [], []
for r in res0:
if "<>" in r:
res2.append(r.split("<>"))
else:
res1.append(r)
return res1, res2
def regexit(input):
return re.split( "\||<>", input )
rSplitter = re.compile("\||<>")
def regexit2(input):
return rSplitter.split(input)
print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())
print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())
哪个给出以下结果:
split: 1.8427431439631619
split2: 1.0897291360306554
regex: 1.6694280610536225
regex2: 1.2277749050408602
split: 14.356198082969058
split2: 8.009285948995966
regex: 9.526430513011292
regex2: 9.083608677960001
当然还有split2()
给出了所需的嵌套列表,而正则表达式解决方案则没有.
and of course split2()
gives the nested lists that you wanted whereas the regex solution doesn't.
我已经更新了此答案,以包括@ F1Rumors有关编译正则表达式将提高性能的建议.确实有一点不同,但是Python会缓存已编译的正则表达式,因此节省的空间不如您预期的那样.我认为通常不值得为了提高速度而做(尽管在某些情况下可以这样做),但通常值得使代码更清晰.
I've updated this answer to include @F1Rumors suggestion that compiling the regex will improve performance. It does make a slight difference, but Python caches compiled regular expressions so the saving is not as much as you might expect. I think usually it isn't worth doing it for speed (though it can be in some cases), but it is often worthwhile to make the code clearer.
我还更新了代码,使其可以在Python 3上运行.
Also I updated the code so it runs on Python 3.
这篇关于在Python中分割字符串的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!