RegEx将地址分为三个不同的系列[第1部分] [英] RegEx to split address into three distinct Series [Part 1]
问题描述
我正在使用包含客户信息的数据集进行实验/学习Python.
I'm experimenting/learning Python with a data set containing customers information.
DataFrame结构如下(这些都是记录):
The DataFrame structure is the following (these are made up records):
import pandas as pd
import numpy as np
df = pd.DataFrame({'cus_name' : ['James', 'Mary', 'David', 'Linda', 'George', 'Jennifer', 'John', 'Maria', 'Charles', 'Helen'],
'address' : ['Main St 59', 'Yellow Av 11 b.F1', 'Terrazzo Way 100-102', np.nan, 'Hamilton St 159 b.A/B', np.nan, 'Henry St 7 D', 'Mc-Kenzie Av 40P b.1', 'Neptune Av 14 15 b.G', np.nan ],
'postal_code' : [1410, 1210, 1020, np.nan, 1310, np.nan, 1080, 1190, 1040, np.nan],
})
print(df)
cus_name address postal_code
0 James Main St 59 1410.0
1 Mary Yellow Av 11 b.F1 1210.0
2 David Terrazzo Way 100-102 1020.0
3 Linda NaN NaN
4 George Hamilton St 159 b.A/B 1310.0
5 Jennifer NaN NaN
6 John Henry St 7 D 1080.0
7 Maria Mc-Kenzie Av 40P b.1 1190.0
8 Charles Neptune Av 14 15 b.G 1040.0
9 Helen NaN NaN
我对address
系列特别感兴趣.具体来说,我的目标是将街道,数字和盒子的信息拆分"为三个不同的系列.
I'm particularly interested in the address
Series. Specifically, my goal is to "split" the information of the street, number, and box into three distinct Series.
例如,转换后,第一个和第七个记录/行应如下所示:
For instance, after the transformation, the first and seventh record/row should look like this:
| cus_name | street | number | box | postal_code |
|----------|--------------|--------|-----|-------------|
| James | Main St | 59 | NaN | 1410 |
| Maria | Mc-Kenzie Av | 40P | 1 | 1190.0 |
起初,我不知道如何解决这个问题.在这里进行了一些研究之后,我发现了一些有趣的,使用正则表达式的相关文章.
At first, I had no idea how to tackle this problem. After doing some research here, I found some interesting related posts that use regular expressions.
由于我不是Python专家(也不是正则表达式),所以我认为我可以从确定address
系列中的模式开始.实际上,每个地址都有以下模式:
Since I'm no expert in Python (nor regular expressions), I thought I could start by identifying the pattern in the address
Series. In fact, each address has the following pattern:
-
位于字符串开头的街道部分.它由一个或多个用空格字符或破折号分隔的单词组成(例如
Mc-Kenzie Av
);
The street part which is located at the beginning of the string. It is composed of one or more words separated by a white-space character or a dash (e.g.
Mc-Kenzie Av
);
位于字符串中间的 number 部分.它由一个或多个字母数字单词组成,这些单词由空格字符或破折号分隔(例如100-102
,7 D
);
The number part which is located in the middle of the string. It is composed of one or more alpha-numeric words separated by a white-space character or a dash (e.g. 100-102
, 7 D
);
位于字符串末尾的框部分.它始终紧随b.
个字符,由一个包含字母数字字符以及可能包含某些特殊字符(例如A/B
,F1
)的单词组成.
The box part which is located at the end of the string. It always immediately follows the b.
characters and is composed of one word containing alpha-numeric characters and possibly some special characters (e.g. A/B
, F1
).
我正在寻求帮助,以使用正则表达式(如果正则表达式是解决方案)实现期望的目标.
I'm asking for help to achieve my desired goal using regular expressions (if regex is the solution).
推荐答案
另一种正则表达式方法:
Another regex approach:
In [913]: df[['street', 'number', 'box']] = df.address.str.extract(r'(\D+)\s+(\d+[\s-]?(?!b)\w*)(?:\s+b\.)?(\S+)?', expand=True)
In [914]: df
Out[914]:
cus_name address postal_code street number box
0 James Main St 59 1410.0 Main St 59 NaN
1 Mary Yellow Av 11 b.F1 1210.0 Yellow Av 11 F1
2 David Terrazzo Way 100-102 1020.0 Terrazzo Way 100-102 NaN
3 Linda NaN NaN NaN NaN NaN
4 George Hamilton St 159 b.A/B 1310.0 Hamilton St 159 A/B
5 Jennifer NaN NaN NaN NaN NaN
6 John Henry St 7 D 1080.0 Henry St 7 D NaN
7 Maria Mc-Kenzie Av 40P b.1 1190.0 Mc-Kenzie Av 40P 1
8 Charles Neptune Av 14 15 b.G 1040.0 Neptune Av 14 15 G
9 Helen NaN NaN NaN NaN NaN
这篇关于RegEx将地址分为三个不同的系列[第1部分]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!