在Pandas数据框中解析/拆分URL的pythonic方法 [英] pythonic way to parse/split URLs in a pandas dataframe
问题描述
我有一个df,在标有url的列中有成千上万的链接(如下所示),针对不同的用户:
I have a df that has thousands of links like the ones below, for different users, in a column labeled url:
https://www.google.com/something
https://mail.google.com/anohtersomething
https://calendar.google.com/somethingelse
https://www.amazon.com/yetanotherthing
我有以下代码:
import urlparse
df['domain'] = ''
df['protocol'] = ''
df['domain'] = ''
df['path'] = ''
df['query'] = ''
df['fragment'] = ''
unique_urls = df.url.unique()
l = len(unique_urls)
i=0
for url in unique_urls:
i+=1
print "\r%d / %d" %(i, l),
split = urlparse.urlsplit(url)
row_index = df.url == url
df.loc[row_index, 'protocol'] = split.scheme
df.loc[row_index, 'domain'] = split.netloc
df.loc[row_index, 'path'] = split.path
df.loc[row_index, 'query'] = split.query
df.loc[row_index, 'fragment'] = split.fragment
该代码能够正确解析和拆分网址,但是它很慢,因为我要遍历df的每一行.有没有更有效的方法来解析URL?
The code is able to parse and split the urls correctly, but it is slow since I am iterating over each row of the df. Is there a more efficient way to parse the URLs?
推荐答案
You can use Series.map
to accomplish the same in one line:
df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] = zip(*df['url'].map(urlparse.urlsplit))
使用timeit,它在186个URL上运行时,每个循环在2.31 ms
中运行,而不是原始方法中的每个循环在179 ms
中运行. (但是请注意,该代码并未针对重复项进行优化,并且会在urlparse多次运行相同的url.)
Using timeit, this ran in 2.31 ms
per loop instead of 179 ms
per loop as in the original method, when run on 186 urls. (Note however, the code is not optimized for duplicates and will run the same urls through urlparse mulitple times.)
完整代码:
import pandas
urls = ['https://www.google.com/something','https://mail.google.com/anohtersomething','https://www.amazon.com/yetanotherthing'] # tested with list of 186 urls instead
df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] = zip(*df['url'].map(urlparse.urlsplit))
这篇关于在Pandas数据框中解析/拆分URL的pythonic方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!