如何刮HREF与Python 3.5和BeautifulSoup [英] How to scrape href with Python 3.5 and BeautifulSoup

查看:260
本文介绍了如何刮HREF与Python 3.5和BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从该网站刮每个项目的href <一个href=\"https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1\" rel=\"nofollow\">https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1与Python 3.5和BeautifulSoup。

这是我的code

\r
\r

#Loading库\r
进口的urllib\r
进口urllib.request里\r
从BS4进口BeautifulSoup\r
\r
刮的#define网址\r
theurl = \"https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1\"\r
拖到绘图页= urllib.request.urlopen(theurl)\r
\r
#烹饪汤\r
汤= BeautifulSoup(拖到绘图页,html.parser)\r
\r
\r
#Scraping链接(HREF)\r
project_ref = soup.findAll('H6',{'类':'项目的标题'})\r
project_href = [project.findChildren('A')[0] project_ref .href项目如果project.findChildren('A')]\r
打印(project_href)

\r

\r
\r

我得到的[无,无,无... ...,无]回来。
我需要从类中的所有HREF列表。

任何想法?


解决方案

尝试是这样的:

 进口urllib.request里
从BS4进口BeautifulSouptheurl = \"https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1\"
拖到绘图页= urllib.request.urlopen(theurl)汤= BeautifulSoup(拖到绘图页)project_href = [我[在soup.find_allHREF对于我('A'中,href = TRUE)]
打印(project_href)

这将返回所有在的href 实例。正如我在你的链接看到,很多的href 标记有在他们里面。你可以避开这些用一个简单的正则表达式正确的链接,或者只是忽略 symboles。

  project_href = [我['href属性]因为我在soup.find_all('A'中,href = TRUE)如果我['href属性]!=#]

这仍然会给你喜欢一些垃圾链接/发现?REF = NAV ,所以如果你想将它缩小使用您需要的链接正确的正则表达式。

编辑:

要解决,你在评论中提到的问题:

 汤= BeautifulSoup(拖到绘图页)
因为我在soup.find_all('格',ATTRS = {'类':'项目卡片的内容'}):
    打印(一.A ['HREF'])

I want to scrape the href of every project from the website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 with Python 3.5 and BeautifulSoup.

That's my code

#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup

#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)

#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")


#Scraping "Link" (href)
project_ref = soup.findAll('h6', {'class': 'project-title'})
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)

I get [None, None, .... None, None] back. I need a list with all the href from the class .

Any ideas?

解决方案

Try something like this:

import urllib.request
from bs4 import BeautifulSoup

theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)

soup = BeautifulSoup(thepage)

project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)

This will return all the href instances. As i see in your link, a lot of href tags have # inside them. You can avoid these with a simple regex for proper links, or just ignore the # symboles.

project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]

This will still give you some trash links like /discover?ref=nav, so if you want to narrow it down use a proper regex for the links you need.

EDIT:

To solve the problem you mentioned in the comments:

soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs={'class' : 'project-card-content'}):
    print(i.a['href'])

这篇关于如何刮HREF与Python 3.5和BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆