如何抓取网站的所有首页文本内容? [英] How to scrape all the home page text content of a website?

查看:104
本文介绍了如何抓取网站的所有首页文本内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我是web爬网的新手,我只想刮擦主页的所有文本内容。

So I am new to webscraping, I want to scrape all the text content of only the home page.

这是我的代码,但是现在可以正常使用了。

this is my code, but it now working correctly.

from bs4 import BeautifulSoup
import requests


website_url = "http://www.traiteurcheminfaisant.com/"
ra = requests.get(website_url)
soup = BeautifulSoup(ra.text, "html.parser")

full_text = soup.find_all()

print(full_text)

当我打印 full_text,当我 ctrl + f traiteurcheminfaisant@hotmail.com 主页上的电子邮件地址(页脚)时,它提供了很多html内容,但不是全部)
在全文中找不到。

When I print "full_text" it give me a lot of html content but not all, when I ctrl + f " traiteurcheminfaisant@hotmail.com" the email adress that is on the home page (footer) is not found on full_text.

感谢您的帮助!

推荐答案

快速浏览一下您要从中抓取的网站,这使我怀疑在通过请求模块发送简单的get请求时,并非所有内容都已加载。换句话说,似乎网站上的某些组件(例如您提到的页脚)正在使用Javascript异步加载。

A quick glance at the website that you're attempting to scrape from makes me suspect that not all content is loaded when sending a simple get request via the requests module. In other words, it seems likely that some components on the site, such as the footer you mentioned, are being loaded asynchronously with Javascript.

如果是这种情况,可能需要使用某种自动化工具来导航到页面,等待页面加载,然后解析完整加载的源代码。为此,最常用的工具是Selenium。首次设置可能会有些棘手,因为您还需要为要使用的任何浏览器安装单独的网络驱动程序。就是说,我上次设置此设置非常简单。这是一个大概的例子,说明您的情况(一旦正确设置了硒):

If that is the case, you'll probably want to use some sort of automation tool to navigate to the page, wait for it to load and then parse the fully loaded source code. For this, the most common tool would be Selenium. It can be a bit tricky to set up the first time since you'll also need to install a separate webdriver for whatever browser you'd like to use. That said, the last time I set this up it was pretty easy. Here's a rough example of what this might look like for you (once you've got Selenium properly set up):

from bs4 import BeautifulSoup
from selenium import webdriver

import time

driver = webdriver.Firefox(executable_path='/your/path/to/geckodriver')
driver.get('http://www.traiteurcheminfaisant.com')
time.sleep(2)

source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')

full_text = soup.find_all()

print(full_text)

这篇关于如何抓取网站的所有首页文本内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆