Python Web 爬虫和“获取"html源代码 [英] Python Web Crawlers and "getting" html source code

查看:48
本文介绍了Python Web 爬虫和“获取"html源代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我哥哥想让我用 Python 编写一个网络爬虫(自学),我知道 C++、Java 和一些 html.我正在使用 2.7 版并阅读 python 库,但我有一些问题1. httplib.HTTPConnectionrequest 概念对我来说是新的,我不明白它是下载 cookie 之类的 html 脚本还是实例.如果您同时执行这两项操作,您是否获得了网站页面的来源?以及我需要知道哪些词才能修改页面并返回修改后的页面.

So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems 1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.

仅作为背景,我需要下载一个页面并将任何 img 替换为我拥有的 img

Just for background, I need to download a page and replace any img with ones I have

如果你们能告诉我你对 2.7 和 3.1 的看法就好了

And it would be nice if you guys could tell me your opinion of 2.7 and 3.1

推荐答案

使用 Python 2.7,目前有更多 3rd 方库.(见下文).

我推荐你使用 stdlib 模块 urllib2,它可以让你轻松获取网络资源.示例:

I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources. Example:

import urllib2

response = urllib2.urlopen("http://google.de")
page_source = response.read()

要解析代码,请查看BeautifulSoup.

顺便说一句:你到底想做什么:

BTW: what exactly do you want to do:

仅作为背景,我需要下载一个页面并将任何 img 替换为我拥有的 img

Just for background, I need to download a page and replace any img with ones I have

现在是 2014 年,大多数重要的库都已移植,如果可以,您绝对应该使用 Python 3.python-requests 是一个非常好的高级库比 urllib2 更容易使用.

It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests is a very nice high-level library which is easier to use than urllib2.

这篇关于Python Web 爬虫和“获取"html源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆