如何刮HREF与Python 3.5和BeautifulSoup [英] How to scrape href with Python 3.5 and BeautifulSoup
问题描述
我想从该网站刮每个项目的href <一个href=\"https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1\" rel=\"nofollow\">https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1与Python 3.5和BeautifulSoup。
这是我的code
#Loading库\r
进口的urllib\r
进口urllib.request里\r
从BS4进口BeautifulSoup\r
\r
刮的#define网址\r
theurl = \"https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1\"\r
拖到绘图页= urllib.request.urlopen(theurl)\r
\r
#烹饪汤\r
汤= BeautifulSoup(拖到绘图页,html.parser)\r
\r
\r
#Scraping链接(HREF)\r
project_ref = soup.findAll('H6',{'类':'项目的标题'})\r
project_href = [project.findChildren('A')[0] project_ref .href项目如果project.findChildren('A')]\r
打印(project_href)
\r
我得到的[无,无,无... ...,无]回来。
我需要从类中的所有HREF列表。
任何想法?
尝试是这样的:
进口urllib.request里
从BS4进口BeautifulSouptheurl = \"https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1\"
拖到绘图页= urllib.request.urlopen(theurl)汤= BeautifulSoup(拖到绘图页)project_href = [我[在soup.find_allHREF对于我('A'中,href = TRUE)]
打印(project_href)
这将返回所有在的href
实例。正如我在你的链接看到,很多的href
标记有#
在他们里面。你可以避开这些用一个简单的正则表达式正确的链接,或者只是忽略#
symboles。
project_href = [我['href属性]因为我在soup.find_all('A'中,href = TRUE)如果我['href属性]!=#]
这仍然会给你喜欢一些垃圾链接/发现?REF = NAV
,所以如果你想将它缩小使用您需要的链接正确的正则表达式。
编辑:
要解决,你在评论中提到的问题:
汤= BeautifulSoup(拖到绘图页)
因为我在soup.find_all('格',ATTRS = {'类':'项目卡片的内容'}):
打印(一.A ['HREF'])
I want to scrape the href of every project from the website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 with Python 3.5 and BeautifulSoup.
That's my code
#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")
#Scraping "Link" (href)
project_ref = soup.findAll('h6', {'class': 'project-title'})
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)
I get [None, None, .... None, None] back. I need a list with all the href from the class .
Any ideas?
Try something like this:
import urllib.request
from bs4 import BeautifulSoup
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage)
project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)
This will return all the href
instances. As i see in your link, a lot of href
tags have #
inside them. You can avoid these with a simple regex for proper links, or just ignore the #
symboles.
project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]
This will still give you some trash links like /discover?ref=nav
, so if you want to narrow it down use a proper regex for the links you need.
EDIT:
To solve the problem you mentioned in the comments:
soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs={'class' : 'project-card-content'}):
print(i.a['href'])
这篇关于如何刮HREF与Python 3.5和BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!