根据python中的条件添加前导零 [英] Add leading zeros based on condition in python
问题描述
我有一个包含 500 万行的数据框.假设数据框如下所示:
<预><代码>>>>df = pd.DataFrame(data={"Random": "86 7639103627 96 32 1469476501".split()})>>>df随机的0 861 76391036272 963 324 1469476501请注意,Random
列存储为字符串.
如果 Random 列中的数字少于 9 位,我想添加前导零以使其成为 9 位.如果数字有 9 位或更多位,我想添加前导零使其成为 20 位.
我所做的是:
for i in range(0,len(df['Random'])):如果 len(df['Random'][i]) <9:df['Random'][i]=df['Random'][i].zfill(9)别的:df['Random'][i]=df['Random'][i].zfill(20)
由于行数超过500万,这个过程需要很多时间!(性能为 5it/sec.使用 tqdm 进行测试,估计完成时间以天为单位!).
是否有更简单快捷的方法来执行此任务?
你需要矢量化这个;使用布尔索引选择列并在结果子集上使用 .str.zfill()
:
# 选择正确的行以避免浪费时间操作更长的字符串较短 = df.Random.str.len() <9更长 = ~更短df.Random[shorter] = df.Random[shorter].str.zfill(9)df.Random[longer] = df.Random[longer].str.zfill(20)
注意:我没有使用 np.where()
因为我们不想把工作加倍.向量化的 df.Random.str.zfill()
比循环遍历行要快,但执行两次仍然比对每组行只执行一次要花费更多的时间.
对具有随机长度值(从 5 个字符一直到 30 个字符)的 100 万行字符串进行速度比较:
In [1]: import numpy as np, pandas as pd[2]:导入平台;打印(platform.python_version_tuple(),platform.platform(),pd.__version__,np.__version__,sep=\n")('3', '7', '3')Darwin-17.7.0-x86_64-i386-64bit0.24.21.16.4在 [3]: !sysctl -n machdep.cpu.brand_stringIntel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz在 [4]:从随机导入选择,randrange在 [5]: def randvalue(chars="0123456789", _c=choices, _r=randrange):...: 返回 "".join(_c(chars, k=randrange(5, 30))).lstrip("0")...:在 [6]: df = pd.DataFrame(data={"Random": [randvalue() for _ in range(10**6)]})在 [7] 中:%%timeit...:目标 = df.copy()...:更短 = target.Random.str.len() <9...:更长 = ~更短...: target.Random[shorter] = target.Random[shorter].str.zfill(9)...: target.Random[longer] = target.Random[longer].str.zfill(20)...:...:每个循环 825 ms ± 22.2 ms(7 次运行的平均值 ± 标准偏差,每次 1 次循环)在 [8] 中:%%timeit...:目标 = df.copy()...: target.Random = np.where(target.Random.str.len()<9,target.Random.str.zfill(9),target.Random.str.zfill(20))...:...:每个循环 929 ms ± 69.2 ms(7 次运行的平均值 ± 标准偏差,每次 1 次循环)
(需要 target = df.copy()
行以确保每次重复的测试运行都与之前的测试隔离.)
结论:在 100 万行上,使用 np.where()
大约慢 10%.
然而,使用 df.Row.apply()
,由 jackbicknell14 提出,节拍两种方法都大幅:
在 [9]: def fill_zeros(x, _len=len, _zfill=str.zfill):...: # len() 和 str.zfill() 作为性能参数缓存...: 返回 _zfill(x, 9 if _len(x) <9 else 20)在 [10] 中:%%timeit...:目标 = df.copy()...: target.Random = target.Random.apply(fill_zeros)...:...:每个循环 299 ms ± 2.55 ms(7 次运行的平均值 ± 标准偏差,每次 1 次循环)
这大约快了 3 倍!
I have a dataframe with 5 million rows. Let's say the dataframe looked like below:
>>> df = pd.DataFrame(data={"Random": "86 7639103627 96 32 1469476501".split()})
>>> df
Random
0 86
1 7639103627
2 96
3 32
4 1469476501
Note that the Random
column is stored as a string.
If the number in column Random has fewer than 9 digits, I want to add leading zeros to make it 9 digits. If the number has 9 or more digits, I want to add leading zeros to make it 20 digits.
what I have done is this:
for i in range(0,len(df['Random'])):
if len(df['Random'][i]) < 9:
df['Random'][i]=df['Random'][i].zfill(9)
else:
df['Random'][i]=df['Random'][i].zfill(20)
Since the number of rows is over 5 million, this process takes a lot of time! (performance was 5it/sec. Tested using tqdm, estimated time of completion was in days!).
Is there an easier and faster way of performing this task?
You need vectorize this; select the columns using a boolean index and use .str.zfill()
on the resulting subsets:
# select the right rows to avoid wasting time operating on longer strings
shorter = df.Random.str.len() < 9
longer = ~shorter
df.Random[shorter] = df.Random[shorter].str.zfill(9)
df.Random[longer] = df.Random[longer].str.zfill(20)
Note: I did not use np.where()
because we wouldn't want to double the work. A vectorized df.Random.str.zfill()
is faster than looping over the rows, but doing it twice still takes more time than doing it just once for each set of rows.
Speed comparison on 1 million rows of strings with values of random lengths (from 5 characters all the way up to 30):
In [1]: import numpy as np, pandas as pd
In [2]: import platform; print(platform.python_version_tuple(), platform.platform(), pd.__version__, np.__version__, sep="\n")
('3', '7', '3')
Darwin-17.7.0-x86_64-i386-64bit
0.24.2
1.16.4
In [3]: !sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
In [4]: from random import choices, randrange
In [5]: def randvalue(chars="0123456789", _c=choices, _r=randrange):
...: return "".join(_c(chars, k=randrange(5, 30))).lstrip("0")
...:
In [6]: df = pd.DataFrame(data={"Random": [randvalue() for _ in range(10**6)]})
In [7]: %%timeit
...: target = df.copy()
...: shorter = target.Random.str.len() < 9
...: longer = ~shorter
...: target.Random[shorter] = target.Random[shorter].str.zfill(9)
...: target.Random[longer] = target.Random[longer].str.zfill(20)
...:
...:
825 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %%timeit
...: target = df.copy()
...: target.Random = np.where(target.Random.str.len()<9,target.Random.str.zfill(9),target.Random.str.zfill(20))
...:
...:
929 ms ± 69.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(The target = df.copy()
line is needed to make sure that each repeated test run is isolated from the one before.)
Conclusion: on 1 million rows, using np.where()
is about 10% slower.
However, using df.Row.apply()
, as proposed by jackbicknell14, beats either method by a huge margin:
In [9]: def fill_zeros(x, _len=len, _zfill=str.zfill):
...: # len() and str.zfill() are cached as parameters for performance
...: return _zfill(x, 9 if _len(x) < 9 else 20)
In [10]: %%timeit
...: target = df.copy()
...: target.Random = target.Random.apply(fill_zeros)
...:
...:
299 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's about 3 times faster!
这篇关于根据python中的条件添加前导零的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!