Python Web爬虫和“获取” html源代码 [英] Python Web Crawlers and "getting" html source code
问题描述
1.
httplib.HTTPConnection
和 request 概念对我来说是新的,我不明白它是否下载像Cookie或实例一样的html脚本。如果你同时做这两件事,你会得到一个网站页面的来源吗?我需要知道什么是修改页面并返回修改过的页面的单词。
仅用于背景,我需要下载页面并替换任何img与我拥有的
如果你们可以告诉我你对2.7和3.1的看法,那将是非常好的
>解决方案
使用Python 2.7,目前有更多的第三方库。(编辑:见下文)。
我建议您使用stdlib模块 urllib2
,它可以让您轻松获取网络资源。
示例:
import urllib2
response = urllib2.urlopen(http:/ /google.de)
page_source = response.read()
解析代码,看看 BeautifulSoup
。
顺便说一句:你到底想做什么:
仅用于背景,我需要下载一个页面,并用我有的$
$ b $替换任何img b
编辑:现在是2014年,大多数重要的图书馆已经移植,如果可以的话,您一定要使用Python 3。
python-requests
是一个非常好的高级库比urllib2
更容易使用。So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems 1.
httplib.HTTPConnection
andrequest
concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.Just for background, I need to download a page and replace any img with ones I have
And it would be nice if you guys could tell me your opinion of 2.7 and 3.1
解决方案
Use Python 2.7, is has more 3rd party libs at the moment.(Edit: see below).I recommend you using the stdlib module
urllib2
, it will allow you to comfortably get web resources. Example:import urllib2 response = urllib2.urlopen("http://google.de") page_source = response.read()
For parsing the code, have a look at
BeautifulSoup
.BTW: what exactly do you want to do:
Just for background, I need to download a page and replace any img with ones I have
Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can.
python-requests
is a very nice high-level library which is easier to use thanurllib2
.这篇关于Python Web爬虫和“获取” html源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!