使用公共域和页面值对值进行分组 [英] Group values with common domain and page values

查看:25
本文介绍了使用公共域和页面值对值进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基于上一个问题的跟进解析 URI 参数和关键字值对,我想对具有相同域和页面名称的 URL 进行分组,后跟所有参数和值.URL 可以具有相同或不同数量的参数和/或各自的值.打印 URL/页面值,然后是所有参数和关键字值.

Based on a follow-up from a previous question Parsing URI parameter and keyword value pairs, I would like to group URLs that have the same domain and page name, followed by all of the parameter and values. The URLs may have the same or a different number of parameters and/or respective values. The URL/page value is printed, followed by all of it parameter and keyword values.

我正在寻找使用 Python 解析、分组和打印值的答案.我无法通过 Google 或 SO 找到答案.

I am looking for an answer using Python to parse, group and print the values. I have not been able to find an answer via Google or SO.

具有各种参数和值的 URL 示例来源:

Example source of URLs with various parameters and values:

www.domain.com/page?id_eve=479989&adm=no
www.domain.com/page?id_eve=47&adm=yes
www.domain.com/page?id_eve=479
domain.com/cal?view=month
domain.com/cal?view=day
ww2.domain.com/cal?date=2007-04-14
ww2.domain.com/cal?date=2007-08-19
www.domain.edu/some/folder/image.php?l=adm&y=5&id=2&page=http%3A//support.domain.com/downloads/index.asp&unique=12345
blog.news.org/news/calendar.php?view=day&date=2011-12-10
www.domain.edu/some/folder/image.php?l=adm&y=5&id=2&page=http%3A//.domain.com/downloads/index.asp&unique=12345
blog.news.org/news/calendar.php?view=month&date=2011-12-10

我正在寻找的示例输出.来自所有相同 URL 的 URL 和参数/值组合列表是原始的.

Example output I am looking for. The URL and a list of the parameter/value combinations from all of the URLs that are the same is the original.

www.domain.com/page
id_eve=479989
id_eve=47
id_eve=479
adm=no
adm=yes
domain.com/cal
view=month
view=day
w2.domain.com/cal
date=2007-04-14
date=2007-08-19
www.domain.edu/some/folder/image.php
l=adm
l-adm
id=2
id=2
page=http%3A//.domain.com/downloads/index.asp
page=http%3A//support.domain.com/downloads/index.asp

推荐答案

使用 defaultdict() 收集每个 url 路径的参数:

Use defaultdict() to collect parameters per url path:

from collections import defaultdict
from urllib import quote
from urlparse import parse_qsl, urlparse


urls = defaultdict(list)
with open('links.txt') as f:
    for url in f:
        parsed_url = urlparse(url.strip())
        params = parse_qsl(parsed_url.query, keep_blank_values=True)
        for key, value in params:
            urls[parsed_url.path].append("%s=%s" % (key, quote(value)))

# printing results
for url, params in urls.iteritems():
    print url
    for param in params:
        print param

印刷品:

ww2.domain.com/cal
date=2007-04-14
date=2007-08-19
www.domain.edu/some/folder/image.php
l=adm
y=5
id=2
page=http%3A//support.domain.com/downloads/index.asp
unique=12345
l=adm
y=5
id=2
page=http%3A//.domain.com/downloads/index.asp
unique=12345
domain.com/cal
view=month
view=day
www.domain.com/page
id_eve=479989
adm=no
id_eve=47
adm=yes
id_eve=479
blog.news.org/news/calendar.php
view=day
date=2011-12-10
view=month
date=2011-12-10

希望有所帮助.

这篇关于使用公共域和页面值对值进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆