如何使用Python来提取与匹配的词HTML链接从网站 [英] How to extract html links with a matching word from a website using python

查看:157
本文介绍了如何使用Python来提取与匹配的词HTML链接从网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个网址,说 http://www.bbc.com/news/world/asia/ 。就在这个页面,我想提取所有具有印度或印度或印度(应不区分大小写)。

I have an url, say http://www.bbc.com/news/world/asia/. Just in this page I wanted to extract all the links that has India or INDIA or india (should be case insensitive).

如果我点击任何输出链接应该带我到相应的页面,例如这些是具有印度的印度震撼过的Dhoni退休印度大雾继续造成几行混乱的。如果我点击这些链接,我应该被重定向到 http://www.bbc.com/news/world-asia-india-30640436 HTTP ://www.bbc.com/news/world-asia-india-30630274 分别为

If I click any of the output links it should take me to the corresponding page, for example these are few lines that have india India shock over Dhoni retirement and India fog continues to cause chaos. If I click these links I should be redirected to http://www.bbc.com/news/world-asia-india-30640436 and http://www.bbc.com/news/world-asia-india-30630274 respectively.

import urllib
from bs4 import BeautifulSoup
import re
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
only_links = SoupStrainer('a', href=re.compile('india'))
print (only_links)

我在Python 3.4.2写得非常基本的最小code。

I wrote very basic minimal code in python 3.4.2.

推荐答案

您需要在显示的文字的搜索词印度。要做到这一点,你需要自定义函数:

You need to search for the word india in the displayed text. To do this you'll need a custom function instead:

from bs4 import BeautifulSoup
import requests

url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
soup = BeautifulSoup(r.content)

india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
                           'href' in tag.attrs and
                           'india' in tag.get_text().lower())
results = soup.find_all(india_links)

india_links 拉姆达发现都是标记< A> 与链接 HREF 属性,包含印度(不区分大小写)某处显示的文本。

The india_links lambda finds all tags that are <a> links with an href attribute and contain india (case insensitive) somewhere in the displayed text.

请注意,我使用了要求响应对象的 .content 的属性;离开解码到BeautifulSoup!

Note that I used the requests response object .content attribute; leave decoding to BeautifulSoup!

演示:

>>> from bs4 import BeautifulSoup
>>> import requests
>>> url = "http://www.bbc.com/news/world/asia/"
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.content)
>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()
>>> results = soup.find_all(india_links)
>>> from pprint import pprint
>>> pprint(results)
[<a href="/news/world/asia/india/">India</a>,
 <a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>,
 <a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>,
 <a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>,
 <a href="/news/world/asia/india/">India</a>,
 <a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headline heading-13">India fog continues to cause chaos</span></a>,
 <a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline heading-13">Court boost to India BJP chief</span></a>,
 <a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline heading-13">India captain Dhoni quits Tests</span></a>,
 <a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
 <a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]

请注意在 http://www.bbc.co.uk/news/world-radio-and-tv-15386555 点击这里;我们不得不使用的λ搜索,因为有一个文本常规EX pression搜索就不会发现,元件;包含的文本(特别报道:印度直接的)是不是在标签的唯一元素,因而不会被发现。

Note the http://www.bbc.co.uk/news/world-radio-and-tv-15386555 link here; we had to use the lambda search because a search with a text regular expression would not have found that element; the contained text (Special report: India Direct) is not the only element in the tag and thus would not be found.

有一个类似的问题适​​用于 /新闻/世界 - 亚洲 - 印度 - 30632852 链接;嵌套的&LT;跨度&GT; 元素,使得它的的法院推动了印度人民党首席的标题文字是不是链接标签的直接子元素。

A similar problem applies to the /news/world-asia-india-30632852 link; the nested <span> element makes it that the Court boost to India BJP chief headline text is not a direct child element of the link tag.

您可以提取的只是的与链接:

You can extract just the links with:

from urllib.parse import urljoin

result_links = [urljoin(url, tag['href']) for tag in results]

,所有相对URL是相对于原来的URL解析:

where all relative URLs are resolved relative to the original URL:

>>> from urllib.parse import urljoin
>>> result_links = [urljoin(url, tag['href']) for tag in results]
>>> pprint(result_links)
['http://www.bbc.com/news/world/asia/india/',
 'http://www.bbc.com/news/world-asia-india-30647504',
 'http://www.bbc.com/news/world-asia-india-30640444',
 'http://www.bbc.com/news/world-asia-india-30640436',
 'http://www.bbc.com/news/world/asia/india/',
 'http://www.bbc.com/news/world-asia-india-30630274',
 'http://www.bbc.com/news/world-asia-india-30632852',
 'http://www.bbc.com/sport/0/cricket/30632182',
 'http://www.bbc.co.uk/news/world-radio-and-tv-15386555',
 'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']

这篇关于如何使用Python来提取与匹配的词HTML链接从网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆