如何使用python从网站中提取带有匹配词的html链接 [英] How to extract html links with a matching word from a website using python

查看:22
本文介绍了如何使用python从网站中提取带有匹配词的html链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个网址,比如 http://www.bbc.com/news/world/asia/.就在这个页面中,我想提取所有包含 India 或 INDIA 或 india(应该不区分大小写)的链接.

如果我点击任何输出链接,它应该带我到相应的页面,例如,这些是印度 印度因多尼船退役而震惊印度雾继续造成的几行混乱.如果我点击这些链接,我应该被重定向到 http://www.bbc.com/news/world-asia-india-30640436http://www.bbc.com/分别是 news/world-asia-india-30630274.

导入urllib从 bs4 导入 BeautifulSoup进口重新进口请求url = "http://www.bbc.com/news/world/asia/"r = requests.get(url)数据 = r.text汤 = BeautifulSoup(数据)only_links = SoupStrainer('a', href=re.compile('india'))打印(only_links)

我在 python 3.4.2 中编写了非常基本的最小代码.

解决方案

您需要在显示的文本中搜索词india.为此,您需要一个自定义函数:

from bs4 import BeautifulSoup进口请求url = "http://www.bbc.com/news/world/asia/"r = requests.get(url)汤 = BeautifulSoup(r.content)india_links = lambda 标签: (getattr(tag, 'name', None) == 'a' andtag.attrs 中的href"和tag.get_text().lower()) 中的印度"结果 = 汤.find_all(india_links)

india_links lambda 查找所有标签是 链接,带有 href 属性并包含 india(不区分大小写)显示文本中的某处.

请注意,我使用了 requests 响应对象 .content 属性;将解码留给 BeautifulSoup!

演示:

<预><代码>>>>从 bs4 导入 BeautifulSoup>>>进口请求>>>url = "http://www.bbc.com/news/world/asia/">>>r = requests.get(url)>>>汤 = BeautifulSoup(r.content)>>>india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs 和 'india' in tag.get_text().lower()>>>结果 = 汤.find_all(india_links)>>>从 pprint 导入 pprint>>>打印(结果)[<a href="/news/world/asia/india/">印度</a>,<a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">印度监控厕所使用的计划</a>,<a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">印度取消汽车税收优惠</a>,<a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">印度对多尼船退役的震惊</a>,<a href="/news/world/asia/india/">印度</a>,<a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="一名戴红旗的德里警察在晨雾中行走印度德里,2014 年 12 月 29 日,星期一." src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headlineHeading-13">印度雾继续造成混乱</span></a>,<a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline Headline-13">法院推动印度人民党首席,<a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline Headline-13">印度队长多尼退出测试</span></a>,<a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="2014 年 2 月 5 日,一名骑摩托车的妇女在孟买的一条街道上等待交通信号."src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>特别报道:India Direct</a>,<a href="/2/hi/south_asia/country_profiles/1154019.stm">印度</a>]

注意这里的 http://www.bbc.co.uk/news/world-radio-and-tv-15386555 链接;我们不得不使用 lambda 搜索,因为使用 text 正则表达式的搜索不会找到该元素;包含的文本(特别报告:India Direct)不是标签中的唯一元素,因此不会被找到.

/news/world-asia-india-30632852 链接也存在类似问题;嵌套的 元素使得 Court boost to India BJP Chief 标题文本不是链接标签的直接子元素.

您可以提取链接:

from urllib.parse import urljoinresult_links = [urljoin(url, tag['href']) for tag in results]

其中所有相对 URL 都相对于原始 URL 进行解析:

<预><代码>>>>从 urllib.parse 导入 urljoin>>>result_links = [urljoin(url, tag['href']) for tag in results]>>>pprint(result_links)['http://www.bbc.com/news/world/asia/india/','http://www.bbc.com/news/world-asia-india-30647504','http://www.bbc.com/news/world-asia-india-30640444','http://www.bbc.com/news/world-asia-india-30640436','http://www.bbc.com/news/world/asia/india/','http://www.bbc.com/news/world-asia-india-30630274','http://www.bbc.com/news/world-asia-india-30632852','http://www.bbc.com/sport/0/cricket/30632182','http://www.bbc.co.uk/news/world-radio-and-tv-15386555','http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']

I have an url, say http://www.bbc.com/news/world/asia/. Just in this page I wanted to extract all the links that has India or INDIA or india (should be case insensitive).

If I click any of the output links it should take me to the corresponding page, for example these are few lines that have india India shock over Dhoni retirement and India fog continues to cause chaos. If I click these links I should be redirected to http://www.bbc.com/news/world-asia-india-30640436 and http://www.bbc.com/news/world-asia-india-30630274 respectively.

import urllib
from bs4 import BeautifulSoup
import re
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
only_links = SoupStrainer('a', href=re.compile('india'))
print (only_links)

I wrote very basic minimal code in python 3.4.2.

解决方案

You need to search for the word india in the displayed text. To do this you'll need a custom function instead:

from bs4 import BeautifulSoup
import requests

url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
soup = BeautifulSoup(r.content)

india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                           'href' in tag.attrs and
                           'india' in tag.get_text().lower())
results = soup.find_all(india_links)

The india_links lambda finds all tags that are <a> links with an href attribute and contain india (case insensitive) somewhere in the displayed text.

Note that I used the requests response object .content attribute; leave decoding to BeautifulSoup!

Demo:

>>> from bs4 import BeautifulSoup
>>> import requests
>>> url = "http://www.bbc.com/news/world/asia/"
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.content)
>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()
>>> results = soup.find_all(india_links)
>>> from pprint import pprint
>>> pprint(results)
[<a href="/news/world/asia/india/">India</a>,
 <a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>,
 <a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>,
 <a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>,
 <a href="/news/world/asia/india/">India</a>,
 <a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headline heading-13">India fog continues to cause chaos</span></a>,
 <a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline heading-13">Court boost to India BJP chief</span></a>,
 <a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline heading-13">India captain Dhoni quits Tests</span></a>,
 <a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
 <a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]

Note the http://www.bbc.co.uk/news/world-radio-and-tv-15386555 link here; we had to use the lambda search because a search with a text regular expression would not have found that element; the contained text (Special report: India Direct) is not the only element in the tag and thus would not be found.

A similar problem applies to the /news/world-asia-india-30632852 link; the nested <span> element makes it that the Court boost to India BJP chief headline text is not a direct child element of the link tag.

You can extract just the links with:

from urllib.parse import urljoin

result_links = [urljoin(url, tag['href']) for tag in results]

where all relative URLs are resolved relative to the original URL:

>>> from urllib.parse import urljoin
>>> result_links = [urljoin(url, tag['href']) for tag in results]
>>> pprint(result_links)
['http://www.bbc.com/news/world/asia/india/',
 'http://www.bbc.com/news/world-asia-india-30647504',
 'http://www.bbc.com/news/world-asia-india-30640444',
 'http://www.bbc.com/news/world-asia-india-30640436',
 'http://www.bbc.com/news/world/asia/india/',
 'http://www.bbc.com/news/world-asia-india-30630274',
 'http://www.bbc.com/news/world-asia-india-30632852',
 'http://www.bbc.com/sport/0/cricket/30632182',
 'http://www.bbc.co.uk/news/world-radio-and-tv-15386555',
 'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']

这篇关于如何使用python从网站中提取带有匹配词的html链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆