在Python中分割字符串的最有效方法 [英] Most efficient way to split strings in Python

查看:93
本文介绍了在Python中分割字符串的最有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当前的Python项目将需要大量的字符串拆分来处理传入的软件包.由于我将在一个非常慢的系统上运行它,因此我想知道实现此目标的最有效方法是什么.字符串的格式如下:

My current Python Project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this:

Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5

说明:此特定示例来自一个列表,其中前两个项目是标题和日期,而项目3至项目5将被邀请(这些项目的数量可以是从0到n的任何数字,其中n是服务器上的注册用户数.

Explanation: This particular example would come from a list where the first two Items are a title and a date, while Item 3 to Item 5 would be invited people (The number of those can be anything from zero to n, where n is the number of registered users on the server).

从我所看到的,我有以下选择:

From what I see, I have the following options:

  1. 重复使用split()
  2. 使用正则表达式和Regex函数
  3. 我还没有想到的其他Python函数(可能有一些)

在该示例中,

解决方案1包括在|处进行拆分,然后在<>处对结果列表的最后一个元素进行拆分,而解决方案2可能会生成如下正则表达式:

Solution 1 would include splitting at | and then splitting the last element of the resulting list at <> for this example, while solution 2 would probably result in a regular expression like:

((.+)|)+((.+)(<>)?)+

好的,这个正则表达式太可怕了,我自己也能看到.它也未经测试.但是你明白了.

Okay, this RegEx is horrible, I can see that myself. It is also untested. But you get the idea.

现在,我正在寻找以下方式:a)花费最少的时间,b)理想地使用最少的内存.如果只有这两种可能之一,我宁愿花更少的时间.对于具有更多用|分隔的项目的字符串和完全缺少<>的字符串,理想的解决方案也适用.至少基于正则表达式的解决方案可以做到这一点

Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for Strings that have more Items seperated with | and strings that completely lack the <>. At least the Regular Expression-based Solution would do that

我的理解是split()将使用更多的内存(因为您基本上得到了两个结果列表,一个在|处拆分,第二个在<>处拆分),但是我还不够了解关于正则表达式的Python实现以判断RegEx将如何执行.如果split()对于不同数量的Item且缺少第二个分隔符,则其动态性也比正则表达式差.尽管如此,我仍然无法撼动python没有正则表达式可以做得更好的印象,这就是为什么我要问

My understanding would be that split() would use more memory (since you basically get two resulting lists, one that splits at | and the second one that splits at <>), but I don't know enough about Pythons implementation of regular Expressions to judge how the RegEx would perform. split() is also less dynamic than a regular expression if it somes to different numbers of Items and the absence of the second seperator. Still, I can't shake the impression that python can do this better without regular expressions, that's why I am asking

一些注意事项:

  • 是的,我只能对这两个解决方案进行基准测试,但是我试图在总体上了解有关python的知识以及它在这里的工作方式,如果仅对这两个基准进行测试,我仍然不知道我错过了哪些python函数
  • 是的,只有高性能的东西才真正需要在此级别进行优化,但是正如我所说的,我正在尝试学习有关python的知识.
  • 添加项:在原始问题中,我完全忘记提及我需要能够将由|分隔的部分与带有分隔符<>的部分区分开来,因此由re.split(\||<>,input)生成的简单平面列表(由@obmarg提出)不能很好地工作.非常适合这种标准的解决方案.
  • Yes, I could just benchmark both solutions, but I'm trying to learn something about python in general and how it works here, and if I just benchmark these two, I still don't know what python functions I have missed.
  • Yes, optimizing at this level is only really required for high-performance stuff, but as I said, I am trying to learn things about python.
  • Addition: in the original question, I completely forgot to mention that I need to be able to distinguish the parts that were seperated by | from the parts with the seperator <>, so a simple flat list as generated by re.split(\||<>,input) (as proposed by @obmarg) will not work too well. Solutions fitting this criterium are much appreciated.

总结一下问题:出于何种原因,哪种解决方案将是最有效的.

To sum the question up: Which solution would be the most efficient one, for what reasons.

由于多个请求,我在split()-解决方案和@obmarg首次提出的正则表达式以及@mgibsonbr和@duncan的解决方案上运行了一些timeit:

Due to multiple requests, I have run some timeit on the split()-solution and the first proposed regular expression by @obmarg, as well as the solutions by @mgibsonbr and @duncan:

import timeit
import re

def splitit(input):
    res0 = input.split("|")
    res = []
    for element in res0:
        t = element.split("<>")
        if t != [element]:
            res0.remove(element)
            res.append(t)
    return (res0, res)

def regexit(input):
    return re.split( "\||<>", input )


def mgibsonbr(input): # Solution by @mgibsonbr
    items = re.split(r'\||<>', input) # Split input in items
    offset = 0
    result = [] # The result: strings for regular itens, lists for <> separated ones
    acc = None
    for i in items:
        delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'
        offset += len(i) + len(delimiter)
        if delimiter == '<>': # Will always put the item in a list
            if acc is None:
                acc = [i] # Create one if doesn't exist
                result.append(acc)
            else:
                acc.append(i)
        else:
            if acc is not None: # If there was a list, put the last item in it
                acc.append(i)
            else:
                result.append(i) # Add the regular items
            acc = None # Clear the list, since what will come next is a regular item or a new list
    return result

def split2(input): # Solution by @duncan
    res0 = input.split("|")
    res1, res2 = [], []
    for r in res0:
        if "<>" in r:
            res2.append(r.split("<>"))
        else:
            res1.append(r)
    return res1, res2

print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()

结果:

mgibs: 14.7349407408
split: 6.403942732
split2: 3.68306812233
regex: 5.28414318792
mgibs: 107.046683735
split: 46.0844590775
split2: 26.5595985591
regex: 28.6513302646

此刻,看起来@duncan的split2击败了所有其他算法,无论长度如何(至少使用此有限数据集),而且@mgibsonbr的解决方案似乎也存在一些性能问题(对不起,但是谢谢您提供的解决方案).

At the moment, it looks like split2 by @duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like @mgibsonbr's solution has some performance issues (Sorry 'bout that, but thanks for the solution regardless).

谢谢大家的投入.

推荐答案

split()在您的代码中表现如此差劲,令我有些惊讶,因此我仔细观察了一下,发现您正在调用list.remove()在内部循环中.另外,您还要在每个字符串上额外拨打split()的时间.摆脱这些限制,使用split()解决方案可以使正则表达式在较短的字符串上胜过一筹,而在较长的字符串上则紧随其后.

I was slightly surprised that split() performed so badly in your code so I looked at it a bit more closely and noticed that you're calling list.remove() in the inner loop. Also you're calling split() an extra time on each string. Get rid of those and a solution using split() beats the regex hands down on shorter strings and comes a pretty close second on the longer one.

import timeit
import re

def splitit(input):
    res0 = input.split("|")
    res = []
    for element in res0:
        t = element.split("<>")
        if t != [element]:
            res0.remove(element)
            res.append(t)
    return (res0, res)

def split2(input):
    res0 = input.split("|")
    res1, res2 = [], []
    for r in res0:
        if "<>" in r:
            res2.append(r.split("<>"))
        else:
            res1.append(r)
    return res1, res2

def regexit(input):
    return re.split( "\||<>", input )

rSplitter = re.compile("\||<>")

def regexit2(input):
    return rSplitter.split(input)

print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())
print("split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
print("split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
print("regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())

哪个给出以下结果:

split: 1.8427431439631619
split2: 1.0897291360306554
regex: 1.6694280610536225
regex2: 1.2277749050408602
split: 14.356198082969058
split2: 8.009285948995966
regex: 9.526430513011292
regex2: 9.083608677960001

当然还有split2()给出了所需的嵌套列表,而正则表达式解决方案则没有.

and of course split2() gives the nested lists that you wanted whereas the regex solution doesn't.

我已经更新了此答案,以包括@ F1Rumors有关编译正则表达式将提高性能的建议.确实有一点不同,但是Python会缓存已编译的正则表达式,因此节省的空间不如您预期的那样.我认为通常不值得为了提高速度而做(尽管在某些情况下可以这样做),但通常值得使代码更清晰.

I've updated this answer to include @F1Rumors suggestion that compiling the regex will improve performance. It does make a slight difference, but Python caches compiled regular expressions so the saving is not as much as you might expect. I think usually it isn't worth doing it for speed (though it can be in some cases), but it is often worthwhile to make the code clearer.

我还更新了代码,使其可以在Python 3上运行.

Also I updated the code so it runs on Python 3.

这篇关于在Python中分割字符串的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆