如何在 Pandas 数据框上使用来自电话号码 Python 库的解析? [英] How to use parse from phonenumbers Python library on a pandas data frame?

查看:18
本文介绍了如何在 Pandas 数据框上使用来自电话号码 Python 库的解析?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从 Pandas 数据框中解析电话号码,最好使用电话号码库?

我正在尝试在 Python 上使用 Google 的 libphonenumber 库的一个端口,https://pypi.org/project/phonenumbers/.

I am trying to use a port of Google's libphonenumber library on Python, https://pypi.org/project/phonenumbers/.

我有一个包含来自许多国家/地区的 300 万个电话号码的数据框.我有一行是电话号码,一行是国家/地区代码.我正在尝试使用包中的 parse 函数.我的目标是使用相应的国家/地区代码解析每一行,但我找不到一种有效的方法.

I have a data frame with 3 million phone numbers from many countries. I have a row with the phone number, and a row with the country/region code. I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.

我尝试使用 apply 但它没有用.我收到(0) 缺少或无效的默认区域".错误,意味着它不会传递国家/地区代码字符串.

I tried using apply but it didn't work. I get a "(0) Missing or invalid default region." error, meaning it won't pass the country code string.

df['phone_number_clean'] = df.phone_number.apply(lambda x: 
phonenumbers.parse(str(df.phone_number),str(df.region_code)))

下面的行有效,但没有得到我想要的,因为我来自大约 120 多个不同的国家.

The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.

df['phone_number_clean'] = df.phone_number.apply(lambda x:
 phonenumbers.parse(str(df.phone_number),"US"))

我尝试在循环中执行此操作,但速度非常慢.我花了一个多小时来解析 10,000 个数字,结果大约是 300 倍:

I tried doing this in a loop, but it is terribly slow. Took me more than an hour to parse 10,000 numbers, and I have about 300x that:

for i in range(n): 
    df3['phone_number_std'][i] = 
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))

是否有一种方法可以运行得更快?apply 函数运行良好,但我无法将数据框元素传递给它.

Is there a method I'm missing that could run this faster? The apply function works acceptably well but I'm unable to pass the data frame element into it.

我仍然是 Python 的初学者,所以也许这有一个简单的解决方案.但我非常感谢您的帮助.

I'm still a beginner in Python, so perhaps this has an easy solution. But I would greatly appreciate your help.

推荐答案

你使用 apply 的初始解决方案实际上非常接近 - 你没有说它有什么用处,但语法对于数据帧多列上的 lambda 函数,而不是单列中的行,有点不同.试试这个:

Your initial solution using apply is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. Try this:

df['phone_number_clean'] = df.apply(lambda x: 
                              phonenumbers.parse(str(x.phone_number), 
                                                 str(x.region_code)), 
                              axis='columns')

区别:

  1. 您希望在 lambda 函数中包含多列,因此您希望将 lambda 函数应用于整个数据框(即 df.apply)而不是系列(单列),通过执​​行 df.phone_number.apply 返回.(将 df.phone_number 的输出打印到控制台 - 返回的是您的 lambda 函数将提供的所有信息).

  1. You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (i.e, df.apply) rather than to the Series (the single column) that is returned by doing df.phone_number.apply. (print the output of df.phone_number to the console - what is returned is all the information that your lambda function will be given).

参数 axis='columns'(或 axis=1,等效的,请参阅 docs) 实际上是按行对数据框进行切片,因此应用看到"一个 record 一次(即 [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...)电话号码2...])

The argument axis='columns' (or axis=1, which is equivalent, see the docs) actually slices the data frame by rows, so apply 'sees' one record at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...])

您的 lambda 函数只知道占位符 x,在本例中,它是系列 [index0, phonenumber0, countrycode0],因此您需要指定与它知道的 x - 即 x.phone_number、x.country_code.

Your lambda function only knows about the placeholder x, which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x that it knows - i.e., x.phone_number, x.country_code.

这篇关于如何在 Pandas 数据框上使用来自电话号码 Python 库的解析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆