RegEx将地址分为三个不同的系列[第1部分] [英] RegEx to split address into three distinct Series [Part 1]

查看:82
本文介绍了RegEx将地址分为三个不同的系列[第1部分]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用包含客户信息的数据集进行实验/学习Python.

I'm experimenting/learning Python with a data set containing customers information.

DataFrame结构如下(这些都是记录):

The DataFrame structure is the following (these are made up records):

import pandas as pd
import numpy as np

df = pd.DataFrame({'cus_name' : ['James', 'Mary', 'David', 'Linda', 'George', 'Jennifer', 'John', 'Maria', 'Charles', 'Helen'],
                   'address' : ['Main St 59', 'Yellow Av 11 b.F1', 'Terrazzo Way 100-102', np.nan, 'Hamilton St 159 b.A/B', np.nan, 'Henry St 7 D', 'Mc-Kenzie Av 40P b.1', 'Neptune Av 14 15 b.G', np.nan ], 
                   'postal_code' : [1410, 1210, 1020, np.nan, 1310, np.nan, 1080, 1190, 1040, np.nan], 
                  })

print(df)

   cus_name                address  postal_code
0     James             Main St 59       1410.0
1      Mary      Yellow Av 11 b.F1       1210.0
2     David   Terrazzo Way 100-102       1020.0
3     Linda                    NaN          NaN
4    George  Hamilton St 159 b.A/B       1310.0
5  Jennifer                    NaN          NaN
6      John           Henry St 7 D       1080.0
7     Maria   Mc-Kenzie Av 40P b.1       1190.0
8   Charles   Neptune Av 14 15 b.G       1040.0
9     Helen                    NaN          NaN

我对address系列特别感兴趣.具体来说,我的目标是将街道数字盒子的信息拆分"为三个不同的系列.

I'm particularly interested in the address Series. Specifically, my goal is to "split" the information of the street, number, and box into three distinct Series.

例如,转换后,第一个和第七个记录/行应如下所示:

For instance, after the transformation, the first and seventh record/row should look like this:

| cus_name | street       | number | box | postal_code |
|----------|--------------|--------|-----|-------------|
| James    | Main St      | 59     | NaN | 1410        |
| Maria    | Mc-Kenzie Av | 40P    | 1   | 1190.0      |

起初,我不知道如何解决这个问题.在这里进行了一些研究之后,我发现了一些有趣的,使用正则表达式的相关文章.

At first, I had no idea how to tackle this problem. After doing some research here, I found some interesting related posts that use regular expressions.

由于我不是Python专家(也不是正则表达式),所以我认为我可以从确定address系列中的模式开始.实际上,每个地址都有以下模式:

Since I'm no expert in Python (nor regular expressions), I thought I could start by identifying the pattern in the address Series. In fact, each address has the following pattern:

  • 位于字符串开头的街道部分.它由一个或多个用空格字符或破折号分隔的单词组成(例如Mc-Kenzie Av);

  • The street part which is located at the beginning of the string. It is composed of one or more words separated by a white-space character or a dash (e.g. Mc-Kenzie Av);

位于字符串中间的 number 部分.它由一个或多个字母数字单词组成,这些单词由空格字符或破折号分隔(例如100-1027 D);

The number part which is located in the middle of the string. It is composed of one or more alpha-numeric words separated by a white-space character or a dash (e.g. 100-102, 7 D);

位于字符串末尾的部分.它始终紧随b.个字符,由一个包含字母数字字符以及可能包含某些特殊字符(例如A/BF1)的单词组成.

The box part which is located at the end of the string. It always immediately follows the b.characters and is composed of one word containing alpha-numeric characters and possibly some special characters (e.g. A/B, F1).

我正在寻求帮助,以使用正则表达式(如果正则表达式是解决方案)实现期望的目标.

I'm asking for help to achieve my desired goal using regular expressions (if regex is the solution).

推荐答案

另一种正则表达式方法:

Another regex approach:

In [913]: df[['street', 'number', 'box']] = df.address.str.extract(r'(\D+)\s+(\d+[\s-]?(?!b)\w*)(?:\s+b\.)?(\S+)?', expand=True)

In [914]: df
Out[914]: 
   cus_name                address  postal_code        street   number  box
0     James             Main St 59       1410.0       Main St       59  NaN
1      Mary      Yellow Av 11 b.F1       1210.0     Yellow Av       11   F1
2     David   Terrazzo Way 100-102       1020.0  Terrazzo Way  100-102  NaN
3     Linda                    NaN          NaN           NaN      NaN  NaN
4    George  Hamilton St 159 b.A/B       1310.0   Hamilton St      159  A/B
5  Jennifer                    NaN          NaN           NaN      NaN  NaN
6      John           Henry St 7 D       1080.0      Henry St      7 D  NaN
7     Maria   Mc-Kenzie Av 40P b.1       1190.0  Mc-Kenzie Av      40P    1
8   Charles   Neptune Av 14 15 b.G       1040.0    Neptune Av    14 15    G
9     Helen                    NaN          NaN           NaN      NaN  NaN

这篇关于RegEx将地址分为三个不同的系列[第1部分]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆