Python Web 爬虫和“获取"html源代码 [英] Python Web Crawlers and "getting" html source code
问题描述
所以我哥哥想让我用 Python 编写一个网络爬虫(自学),我知道 C++、Java 和一些 html.我正在使用 2.7 版并阅读 python 库,但我有一些问题1. httplib.HTTPConnection
和 request
概念对我来说是新的,我不明白它是下载 cookie 之类的 html 脚本还是实例.如果您同时执行这两项操作,您是否获得了网站页面的来源?以及我需要知道哪些词才能修改页面并返回修改后的页面.
So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems
1. httplib.HTTPConnection
and request
concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.
仅作为背景,我需要下载一个页面并将任何 img 替换为我拥有的 img
Just for background, I need to download a page and replace any img with ones I have
如果你们能告诉我你对 2.7 和 3.1 的看法就好了
And it would be nice if you guys could tell me your opinion of 2.7 and 3.1
推荐答案
使用 Python 2.7,目前有更多 3rd 方库.(见下文).
我推荐你使用 stdlib 模块 urllib2
,它可以让你轻松获取网络资源.示例:
I recommend you using the stdlib module urllib2
, it will allow you to comfortably get web resources.
Example:
import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()
要解析代码,请查看BeautifulSoup
.
顺便说一句:你到底想做什么:
BTW: what exactly do you want to do:
仅作为背景,我需要下载一个页面并将任何 img 替换为我拥有的 img
Just for background, I need to download a page and replace any img with ones I have
现在是 2014 年,大多数重要的库都已移植,如果可以,您绝对应该使用 Python 3.python-requests
是一个非常好的高级库比 urllib2
更容易使用.
It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests
is a very nice high-level library which is easier to use than urllib2
.
这篇关于Python Web 爬虫和“获取"html源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!