如何刮HREF与Python 3.5和BeautifulSoup [英] How to scrape href with Python 3.5 and BeautifulSoup

查看：260 发布时间：2016/8/5 19:20:03 python html python-3.x beautifulsoup python-3.5

本文介绍了如何刮HREF与Python 3.5和BeautifulSoup的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从该网站刮每个项目的href <一个href=\"https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1\" rel=\"nofollow\">https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1与Python 3.5和BeautifulSoup。

这是我的code

 #Loading库\r
进口的urllib\r
进口urllib.request里\r
从BS4进口BeautifulSoup\r
\r
刮的#define网址\r
theurl = \"https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1\"\r
拖到绘图页= urllib.request.urlopen（theurl）\r
\r
＃烹饪汤\r
汤= BeautifulSoup（拖到绘图页，html.parser）\r
\r
\r
#Scraping链接（HREF）\r
project_ref = soup.findAll（'H6'，{'类'：'项目的标题'}）\r
project_href = [project.findChildren（'A'）[0] project_ref .href项目如果project.findChildren（'A'）]\r
打印（project_href）

我得到的[无，无，无... ...，无]回来。
我需要从类中的所有HREF列表。

任何想法？

解决方案

尝试是这样的：

 进口urllib.request里
从BS4进口BeautifulSouptheurl = \"https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1\"
拖到绘图页= urllib.request.urlopen（theurl）汤= BeautifulSoup（拖到绘图页）project_href = [我[在soup.find_allHREF对于我（'A'中，href = TRUE）]
打印（project_href）

这将返回所有在的href 实例。正如我在你的链接看到，很多的href 标记有＃在他们里面。你可以避开这些用一个简单的正则表达式正确的链接，或者只是忽略＃ symboles。

  project_href = [我['href属性]因为我在soup.find_all（'A'中，href = TRUE）如果我['href属性]！=＃]

这仍然会给你喜欢一些垃圾链接/发现？REF = NAV ，所以如果你想将它缩小使用您需要的链接正确的正则表达式。

编辑：

要解决，你在评论中提到的问题：

 汤= BeautifulSoup（拖到绘图页）
因为我在soup.find_all（'格'，ATTRS = {'类'：'项目卡片的内容'}）：
    打印（一.A ['HREF']）

I want to scrape the href of every project from the website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 with Python 3.5 and BeautifulSoup.

That's my code

#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup

#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)

#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")


#Scraping "Link" (href)
project_ref = soup.findAll('h6', {'class': 'project-title'})
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)

I get [None, None, .... None, None] back. I need a list with all the href from the class .

Any ideas?

解决方案

Try something like this:

import urllib.request
from bs4 import BeautifulSoup

theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)

soup = BeautifulSoup(thepage)

project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)

This will return all the href instances. As i see in your link, a lot of href tags have # inside them. You can avoid these with a simple regex for proper links, or just ignore the # symboles.

project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]

This will still give you some trash links like /discover?ref=nav, so if you want to narrow it down use a proper regex for the links you need.

EDIT:

To solve the problem you mentioned in the comments:

soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs={'class' : 'project-card-content'}):
    print(i.a['href'])

这篇关于如何刮HREF与Python 3.5和BeautifulSoup的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何刮HREF与Python 3.5和BeautifulSoup [英] How to scrape href with Python 3.5 and BeautifulSoup

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何刮HREF与Python 3.5和BeautifulSoup [英] How to scrape href with Python 3.5 and BeautifulSoup

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭