是否可以在不使用python中使用第三方库的情况下抓取网页? [英] Is it possible to scrape webpage without using third-party libraries in python?

查看:53
本文介绍了是否可以在不使用python中使用第三方库的情况下抓取网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解美丽的汤在python中如何工作.我过去使用过漂亮的汤,lxml,但是现在尝试实现一个脚本,该脚本可以在没有任何第三方库的情况下从给定的网页读取数据,但是看起来xml模块没有太多选择,并且会引发很多错误.是否有其他图书馆提供了很好的文档,可以从网页中读取数据?我未在任何特定网站上使用这些脚本.我只是想从公共页面和新闻博客中阅读.

I am trying to understand how beautiful soup works in python. I used beautiful soup,lxml in my past but now trying to implement one script which can read data from given webpage without any third-party libraries but it looks like xml module don't have much options and throwing many errors. Is there any other library with good documentation for reading data from web page? I am not using these scripts on any particular websites. I am just trying to read from public pages and news blogs.

推荐答案

第三方库可以使您的生活更轻松.是的,您当然可以在没有它们的情况下编写程序(库的作者必须这样做).但是,为什么要重新发明轮子呢?

Third party libraries exist to make your life easier. Yes, of course you could write a program without them (the authors of the libraries had to). However, why reinvent the wheel?

您最好的选择是beautifulsoup和scrappy.但是,如果您在Beautifulsoup方面遇到问题,我不会尝试.

Your best options are beautifulsoup and scrappy. However, if your having trouble with beautifulsoup, I wouldn't try scrappy.

也许您只需要从网站上获得纯文本就可以了?

Perhaps you can get by with just the plain text from the website?

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
pagetxt = soup.get_text()

然后,您可以使用所有外部库完成操作,而只需使用纯文本即可.但是,如果您需要做一些更复杂的事情.您确实应该使用HTML来处理HTML.它们太多了,可能会出错.

Then you can be done with all external libraries and just work with plain text. However, if you need to do something more complicated. HTML is something you really should use a library for manipulating. They is just too much that can go wrong.

这篇关于是否可以在不使用python中使用第三方库的情况下抓取网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆