在Pandas数据框中解析/拆分URL的pythonic方法 [英] pythonic way to parse/split URLs in a pandas dataframe

查看:93
本文介绍了在Pandas数据框中解析/拆分URL的pythonic方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个df,在标有url的列中有成千上万的链接(如下所示),针对不同的用户:

I have a df that has thousands of links like the ones below, for different users, in a column labeled url:

https://www.google.com/something
https://mail.google.com/anohtersomething
https://calendar.google.com/somethingelse
https://www.amazon.com/yetanotherthing

我有以下代码:

import urlparse

df['domain'] = ''
df['protocol'] = ''
df['domain'] = ''
df['path'] = ''
df['query'] = ''
df['fragment'] = ''
unique_urls = df.url.unique()
l = len(unique_urls)
i=0
for url in unique_urls:
    i+=1
    print "\r%d / %d" %(i, l),
    split = urlparse.urlsplit(url)
    row_index = df.url == url
    df.loc[row_index, 'protocol'] = split.scheme
    df.loc[row_index, 'domain'] = split.netloc
    df.loc[row_index, 'path'] = split.path
    df.loc[row_index, 'query'] = split.query
    df.loc[row_index, 'fragment'] = split.fragment

该代码能够正确解析和拆分网址,但是它很慢,因为我要遍历df的每一行.有没有更有效的方法来解析URL?

The code is able to parse and split the urls correctly, but it is slow since I am iterating over each row of the df. Is there a more efficient way to parse the URLs?

推荐答案

您可以使用

You can use Series.map to accomplish the same in one line:

df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] = zip(*df['url'].map(urlparse.urlsplit))

使用timeit,它在186个URL上运行时,每个循环在2.31 ms中运行,而不是原始方法中的每个循环在179 ms中运行. (但是请注意,该代码并未针对重复项进行优化,并且会在urlparse多次运行相同的url.)

Using timeit, this ran in 2.31 ms per loop instead of 179 ms per loop as in the original method, when run on 186 urls. (Note however, the code is not optimized for duplicates and will run the same urls through urlparse mulitple times.)

完整代码:

import pandas

urls = ['https://www.google.com/something','https://mail.google.com/anohtersomething','https://www.amazon.com/yetanotherthing'] # tested with list of 186 urls instead
df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] = zip(*df['url'].map(urlparse.urlsplit))

这篇关于在Pandas数据框中解析/拆分URL的pythonic方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆