根据python中的条件添加前导零 [英] Add leading zeros based on condition in python

查看:54
本文介绍了根据python中的条件添加前导零的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 500 万行的数据框.假设数据框如下所示:

<预><代码>>>>df = pd.DataFrame(data={"Random": "86 7639103627 96 32 1469476501".split()})>>>df随机的0 861 76391036272 963 324 1469476501

请注意,Random 列存储为字符串.

如果 Random 列中的数字少于 9 位,我想添加前导零以使其成为 9 位.如果数字有 9 位或更多位,我想添加前导零使其成为 20 位.

我所做的是:

for i in range(0,len(df['Random'])):如果 len(df['Random'][i]) <9:df['Random'][i]=df['Random'][i].zfill(9)别的:df['Random'][i]=df['Random'][i].zfill(20)

由于行数超过500万,这个过程需要很多时间!(性能为 5it/sec.使用 tqdm 进行测试,估计完成时间以天为单位!).

是否有更简单快捷的方法来执行此任务?

解决方案

你需要矢量化这个;使用布尔索引选择列并在结果子集上使用 .str.zfill():

# 选择正确的行以避免浪费时间操作更长的字符串较短 = df.Random.str.len() <9更长 = ~更短df.Random[shorter] = df.Random[shorter].str.zfill(9)df.Random[longer] = df.Random[longer].str.zfill(20)

注意:我没有使用 np.where() 因为我们不想把工作加倍.向量化的 df.Random.str.zfill() 比循环遍历行要快,但执行两次仍然比对每组行只执​​行一次要花费更多的时间.

对具有随机长度值(从 5 个字符一直到 30 个字符)的 100 万行字符串进行速度比较:

In [1]: import numpy as np, pandas as pd[2]:导入平台;打印(platform.python_version_tuple(),platform.platform(),pd.__version__,np.__version__,sep=\n")('3', '7', '3')Darwin-17.7.0-x86_64-i386-64bit0.24.21.16.4在 [3]: !sysctl -n machdep.cpu.brand_stringIntel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz在 [4]:从随机导入选择,randrange在 [5]: def randvalue(chars="0123456789", _c=choices, _r=randrange):...: 返回 "".join(_c(chars, k=randrange(5, 30))).lstrip("0")...:在 [6]: df = pd.DataFrame(data={"Random": [randvalue() for _ in range(10**6)]})在 [7] 中:%%timeit...:目标 = df.copy()...:更短 = target.Random.str.len() <9...:更长 = ~更短...: target.Random[shorter] = target.Random[shorter].str.zfill(9)...: target.Random[longer] = target.Random[longer].str.zfill(20)...:...:每个循环 825 ms ± 22.2 ms(7 次运行的平均值 ± 标准偏差,每次 1 次循环)在 [8] 中:%%timeit...:目标 = df.copy()...: target.Random = np.where(target.Random.str.len()<9,target.Random.str.zfill(9),target.Random.str.zfill(20))...:...:每个循环 929 ms ± 69.2 ms(7 次运行的平均值 ± 标准偏差,每次 1 次循环)

(需要 target = df.copy() 行以确保每次重复的测试运行都与之前的测试隔离.)

结论:在 100 万行上,使用 np.where() 大约慢 10%.

然而,使用 df.Row.apply()由 jackbicknell14 提出,节拍两种方法都大幅:

在 [9]: def fill_zeros(x, _len=len, _zfill=str.zfill):...: # len() 和 str.zfill() 作为性能参数缓存...: 返回 _zfill(x, 9 if _len(x) <9 else 20)在 [10] 中:%%timeit...:目标 = df.copy()...: target.Random = target.Random.apply(fill_zeros)...:...:每个循环 299 ms ± 2.55 ms(7 次运行的平均值 ± 标准偏差,每次 1 次循环)

这大约快了 3 倍!

I have a dataframe with 5 million rows. Let's say the dataframe looked like below:

>>> df = pd.DataFrame(data={"Random": "86 7639103627 96 32 1469476501".split()})
>>> df
       Random
0          86
1  7639103627
2          96
3          32
4  1469476501

Note that the Random column is stored as a string.

If the number in column Random has fewer than 9 digits, I want to add leading zeros to make it 9 digits. If the number has 9 or more digits, I want to add leading zeros to make it 20 digits.

what I have done is this:

for i in range(0,len(df['Random'])):
      if len(df['Random'][i]) < 9:
          df['Random'][i]=df['Random'][i].zfill(9)
      else:
           df['Random'][i]=df['Random'][i].zfill(20)

Since the number of rows is over 5 million, this process takes a lot of time! (performance was 5it/sec. Tested using tqdm, estimated time of completion was in days!).

Is there an easier and faster way of performing this task?

解决方案

You need vectorize this; select the columns using a boolean index and use .str.zfill() on the resulting subsets:

# select the right rows to avoid wasting time operating on longer strings
shorter = df.Random.str.len() < 9
longer = ~shorter
df.Random[shorter] = df.Random[shorter].str.zfill(9)
df.Random[longer] = df.Random[longer].str.zfill(20)

Note: I did not use np.where() because we wouldn't want to double the work. A vectorized df.Random.str.zfill() is faster than looping over the rows, but doing it twice still takes more time than doing it just once for each set of rows.

Speed comparison on 1 million rows of strings with values of random lengths (from 5 characters all the way up to 30):

In [1]: import numpy as np, pandas as pd

In [2]: import platform; print(platform.python_version_tuple(), platform.platform(), pd.__version__, np.__version__, sep="\n")
('3', '7', '3')
Darwin-17.7.0-x86_64-i386-64bit
0.24.2
1.16.4

In [3]: !sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

In [4]: from random import choices, randrange

In [5]: def randvalue(chars="0123456789", _c=choices, _r=randrange):
   ...:     return "".join(_c(chars, k=randrange(5, 30))).lstrip("0")
   ...:

In [6]: df = pd.DataFrame(data={"Random": [randvalue() for _ in range(10**6)]})

In [7]: %%timeit
   ...: target = df.copy()
   ...: shorter = target.Random.str.len() < 9
   ...: longer = ~shorter
   ...: target.Random[shorter] = target.Random[shorter].str.zfill(9)
   ...: target.Random[longer] = target.Random[longer].str.zfill(20)
   ...:
   ...:
825 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %%timeit
   ...: target = df.copy()
   ...: target.Random = np.where(target.Random.str.len()<9,target.Random.str.zfill(9),target.Random.str.zfill(20))
   ...:
   ...:
929 ms ± 69.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(The target = df.copy() line is needed to make sure that each repeated test run is isolated from the one before.)

Conclusion: on 1 million rows, using np.where() is about 10% slower.

However, using df.Row.apply(), as proposed by jackbicknell14, beats either method by a huge margin:

In [9]: def fill_zeros(x, _len=len, _zfill=str.zfill):
   ...:     # len() and str.zfill() are cached as parameters for performance
   ...:     return _zfill(x, 9 if _len(x) < 9 else 20)

In [10]: %%timeit
    ...: target = df.copy()
    ...: target.Random = target.Random.apply(fill_zeros)
    ...:
    ...:
299 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's about 3 times faster!

这篇关于根据python中的条件添加前导零的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆