如何使用Python来提取与匹配的词HTML链接从网站 [英] How to extract html links with a matching word from a website using python
问题描述
我有一个网址,说 http://www.bbc.com/news/world/asia/
。就在这个页面,我想提取所有具有印度或印度或印度(应不区分大小写)。
I have an url, say http://www.bbc.com/news/world/asia/
. Just in this page I wanted to extract all the links that has India or INDIA or india (should be case insensitive).
如果我点击任何输出链接应该带我到相应的页面,例如这些是具有印度的印度震撼过的Dhoni退休和印度大雾继续造成几行混乱的。如果我点击这些链接,我应该被重定向到 http://www.bbc.com/news/world-asia-india-30640436
和 HTTP ://www.bbc.com/news/world-asia-india-30630274
分别为
If I click any of the output links it should take me to the corresponding page, for example these are few lines that have india India shock over Dhoni retirement and India fog continues to cause chaos. If I click these links I should be redirected to http://www.bbc.com/news/world-asia-india-30640436
and http://www.bbc.com/news/world-asia-india-30630274
respectively.
import urllib
from bs4 import BeautifulSoup
import re
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
only_links = SoupStrainer('a', href=re.compile('india'))
print (only_links)
我在Python 3.4.2写得非常基本的最小code。
I wrote very basic minimal code in python 3.4.2.
推荐答案
您需要在显示的文字的搜索词印度
。要做到这一点,你需要自定义函数:
You need to search for the word india
in the displayed text. To do this you'll need a custom function instead:
from bs4 import BeautifulSoup
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
india_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
'india' in tag.get_text().lower())
results = soup.find_all(india_links)
的 india_links
拉姆达发现都是标记< A>
与链接 HREF
属性,包含印度
(不区分大小写)某处显示的文本。
The india_links
lambda finds all tags that are <a>
links with an href
attribute and contain india
(case insensitive) somewhere in the displayed text.
请注意,我使用了要求
响应对象的 .content
的属性;离开解码到BeautifulSoup!
Note that I used the requests
response object .content
attribute; leave decoding to BeautifulSoup!
演示:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> url = "http://www.bbc.com/news/world/asia/"
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.content)
>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href' in tag.attrs and 'india' in tag.get_text().lower()
>>> results = soup.find_all(india_links)
>>> from pprint import pprint
>>> pprint(results)
[<a href="/news/world/asia/india/">India</a>,
<a class="story" href="/news/world-asia-india-30647504" rel="published-1420102077277">India scheme to monitor toilet use </a>,
<a class="story" href="/news/world-asia-india-30640444" rel="published-1420022868334">India to scrap tax breaks on cars</a>,
<a class="story" href="/news/world-asia-india-30640436" rel="published-1420012598505">India shock over Dhoni retirement</a>,
<a href="/news/world/asia/india/">India</a>,
<a class="headline-anchor" href="/news/world-asia-india-30630274" rel="published-1419931669523"><img alt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. " src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><span class="headline heading-13">India fog continues to cause chaos</span></a>,
<a class="headline-anchor" href="/news/world-asia-india-30632852" rel="published-1419940599384"><span class="headline heading-13">Court boost to India BJP chief</span></a>,
<a class="headline-anchor" href="/sport/0/cricket/30632182" rel="published-1419930930045"><span class="headline heading-13">India captain Dhoni quits Tests</span></a>,
<a class="story" href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555" rel="published-1392018507550"><img alt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014." src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
<a href="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]
请注意在 http://www.bbc.co.uk/news/world-radio-and-tv-15386555
点击这里;我们不得不使用的λ
搜索,因为有一个文本
常规EX pression搜索就不会发现,元件;包含的文本(特别报道:印度直接的)是不是在标签的唯一元素,因而不会被发现。
Note the http://www.bbc.co.uk/news/world-radio-and-tv-15386555
link here; we had to use the lambda
search because a search with a text
regular expression would not have found that element; the contained text (Special report: India Direct) is not the only element in the tag and thus would not be found.
有一个类似的问题适用于 /新闻/世界 - 亚洲 - 印度 - 30632852
链接;嵌套的&LT;跨度&GT;
元素,使得它的的法院推动了印度人民党首席的标题文字是不是链接标签的直接子元素。
A similar problem applies to the /news/world-asia-india-30632852
link; the nested <span>
element makes it that the Court boost to India BJP chief headline text is not a direct child element of the link tag.
您可以提取的只是的与链接:
You can extract just the links with:
from urllib.parse import urljoin
result_links = [urljoin(url, tag['href']) for tag in results]
,所有相对URL是相对于原来的URL解析:
where all relative URLs are resolved relative to the original URL:
>>> from urllib.parse import urljoin
>>> result_links = [urljoin(url, tag['href']) for tag in results]
>>> pprint(result_links)
['http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30647504',
'http://www.bbc.com/news/world-asia-india-30640444',
'http://www.bbc.com/news/world-asia-india-30640436',
'http://www.bbc.com/news/world/asia/india/',
'http://www.bbc.com/news/world-asia-india-30630274',
'http://www.bbc.com/news/world-asia-india-30632852',
'http://www.bbc.com/sport/0/cricket/30632182',
'http://www.bbc.co.uk/news/world-radio-and-tv-15386555',
'http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']
这篇关于如何使用Python来提取与匹配的词HTML链接从网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!