使用Beautifulsoup搜寻网页的URL [英] Scraping a page for URLs using Beautifulsoup

查看:99
本文介绍了使用Beautifulsoup搜寻网页的URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以将页面抓到标题上,没问题. URL是另一回事.它们是附加在基本URL末尾的片段-我理解... 我需要提取相关URL的格式为base_url.scraped_fragment

I can scrape the page to the headlines, no problem. The URLs are another story. They are fragments that get appended on the end of the base URL - I understand that... What do I need to pull the related URLs for storage in format - base_url.scraped_fragment

from urllib2 import urlopen
import requests
from bs4 import BeautifulSoup
import csv
import MySQLdb
import re


html = urlopen("http://advances.sciencemag.org/")
soup = BeautifulSoup(html.read().decode('utf-8'),"lxml")
#links = soup.findAll("a","href")
headlines = soup.findAll("div", "highwire-cite-title media__headline__title")
    for headline in headlines:
    text = (headline.get_text())
    print text

推荐答案

首先,类名之间应该有一个空格:

First of all, there should be a space between the class names:

highwire-cite-title media__headline__title
               HERE^

无论如何,由于需要链接,因此应该找到a元素并使用urljoin()来创建绝对URL:

Anyway, since you need the links, you should be locating the a elements and use urljoin() to make absolute URLs:

from urlparse import urljoin

import requests
from bs4 import BeautifulSoup


base_url = "http://advances.sciencemag.org"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")

headlines = soup.find_all(class_="highwire-cite-linked-title")
for headline in headlines:
    print(urljoin(base_url, headline["href"]))

打印:

http://advances.sciencemag.org/content/2/4/e1600069
http://advances.sciencemag.org/content/2/4/e1501914
http://advances.sciencemag.org/content/2/4/e1501737
...
http://advances.sciencemag.org/content/2/2
http://advances.sciencemag.org/content/2/1

这篇关于使用Beautifulsoup搜寻网页的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆