如何在 Pandas 数据框上使用来自电话号码 Python 库的解析? [英] How to use parse from phonenumbers Python library on a pandas data frame?
问题描述
如何从 Pandas 数据框中解析电话号码,最好使用电话号码库?
我正在尝试在 Python 上使用 Google 的 libphonenumber 库的一个端口,https://pypi.org/project/phonenumbers/.
I am trying to use a port of Google's libphonenumber library on Python, https://pypi.org/project/phonenumbers/.
我有一个包含来自许多国家/地区的 300 万个电话号码的数据框.我有一行是电话号码,一行是国家/地区代码.我正在尝试使用包中的 parse 函数.我的目标是使用相应的国家/地区代码解析每一行,但我找不到一种有效的方法.
I have a data frame with 3 million phone numbers from many countries. I have a row with the phone number, and a row with the country/region code. I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.
我尝试使用 apply 但它没有用.我收到(0) 缺少或无效的默认区域".错误,意味着它不会传递国家/地区代码字符串.
I tried using apply but it didn't work. I get a "(0) Missing or invalid default region." error, meaning it won't pass the country code string.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),str(df.region_code)))
下面的行有效,但没有得到我想要的,因为我来自大约 120 多个不同的国家.
The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),"US"))
我尝试在循环中执行此操作,但速度非常慢.我花了一个多小时来解析 10,000 个数字,结果大约是 300 倍:
I tried doing this in a loop, but it is terribly slow. Took me more than an hour to parse 10,000 numbers, and I have about 300x that:
for i in range(n):
df3['phone_number_std'][i] =
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))
是否有一种方法可以运行得更快?apply 函数运行良好,但我无法将数据框元素传递给它.
Is there a method I'm missing that could run this faster? The apply function works acceptably well but I'm unable to pass the data frame element into it.
我仍然是 Python 的初学者,所以也许这有一个简单的解决方案.但我非常感谢您的帮助.
I'm still a beginner in Python, so perhaps this has an easy solution. But I would greatly appreciate your help.
推荐答案
你使用 apply
的初始解决方案实际上非常接近 - 你没有说它有什么用处,但语法对于数据帧多列上的 lambda 函数,而不是单列中的行,有点不同.试试这个:
Your initial solution using apply
is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. Try this:
df['phone_number_clean'] = df.apply(lambda x:
phonenumbers.parse(str(x.phone_number),
str(x.region_code)),
axis='columns')
区别:
您希望在 lambda 函数中包含多列,因此您希望将 lambda 函数应用于整个数据框(即
df.apply
)而不是系列(单列),通过执行df.phone_number.apply
返回.(将df.phone_number
的输出打印到控制台 - 返回的是您的 lambda 函数将提供的所有信息).
You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (i.e,
df.apply
) rather than to the Series (the single column) that is returned by doingdf.phone_number.apply
. (print the output ofdf.phone_number
to the console - what is returned is all the information that your lambda function will be given).
参数 axis='columns'
(或 axis=1
,等效的,请参阅 docs) 实际上是按行对数据框进行切片,因此应用看到"一个 record
一次(即 [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...)电话号码2...])
The argument axis='columns'
(or axis=1
, which is equivalent, see the docs) actually slices the data frame by rows, so apply 'sees' one record
at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...])
您的 lambda 函数只知道占位符 x
,在本例中,它是系列 [index0, phonenumber0, countrycode0],因此您需要指定与它知道的 x
- 即 x.phone_number、x.country_code.
Your lambda function only knows about the placeholder x
, which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x
that it knows - i.e., x.phone_number, x.country_code.
这篇关于如何在 Pandas 数据框上使用来自电话号码 Python 库的解析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!