仅使用内置库在Python中制作基本的Web抓取工具-Python [英] Making a basic web scraper in Python with only built in libraries - Python

查看:81
本文介绍了仅使用内置库在Python中制作基本的Web抓取工具-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

学习Python,我正在尝试制作一个没有任何第三方库的网络抓取工具,以使该过程对我而言不会简化,而且我知道我在做什么.我浏览了一些在线资源,但所有这些使我对某些事情感到困惑.

Learning Python, I'm trying to make a web scraper without any 3rd party libraries, so that the process isn't simplified for me, and I know what I am doing. I looked through several online resources, but all of which have left me confused about certain things.

html看起来像这样,

The html looks something like this,

<html>
<head>...</head>
<body>
    *lots of other <div> tags*
<div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal"">
<form class ="subform">...</form>
<div class = "subdiv1" >...</div>
<div class = "subdiv2" >...</div>
    *lots of other <div> tags*
</body>
</html>

我希望刮板提取<div class = "want"...>*content*</div>并将其保存到html文件中.

I want the scraper to extract the <div class = "want"...>*content*</div> and save that into a html file.

对于如何解决这个问题,我有一个非常基本的想法.

I have a very basic idea of how I need to go about this.

import urllib
from urllib import request
#import re
#from html.parser import HTMLParser

response = urllib.request.urlopen("http://website.com")
html = response.read()

#Some how extract that wanted data

f = open('page.html', 'w')
f.write(data)
f.close()

推荐答案

标准库随附各种

The standard library comes with a variety of Structured Markup Processing Tools, which you can use for parsing the HTML and then searching it to extract your div.

那里有很多选择.你用什么?

There's a whole lot of choices there. What do you use?

html.parser 似乎是显而易见的选择,但我d实际上以 ElementTree 开头.这是一个非常不错且功能强大的API,并且网上有大量文档和示例代码可以帮助您入门,而且每天都有很多专家在使用它,可以帮助您解决问题.如果事实证明etree无法解析您的HTML,则您将不得不使用其他内容……但请先尝试.

html.parser looks like the obvious choice, but I'd actually start with ElementTree instead. It's a very nice and very powerful API, and there's tons of documentation and sample code all over the web to get you started, and a lot of experts using it on a daily basis who can help you with your problems. If it turns out that etree can't parse your HTML, you will have to use something else… but try it first.

例如,您对HTML进行了一些小的修复,从而使HTML无效,因此它实际上是有效的,因此实际上有些文本值得您从div中删除:

For example, with a few minor fixes to you snipped HTML so it's actually valid, and so there's actually some text worth getting out of your div:

<html>
<head>...</head>
<body>
    *lots of other <div /> tags*
<div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal">spam spam spam
<form class ="subform">...</form>
<div class = "subdiv1" >...</div>
<div class = "subdiv2" >...</div>
    *lots of other <div /> tags*
</div>
</body>
</html>

您可以使用这样的代码(假设您知道或愿意学习XPath):

You can use code like this (I'm assuming you know, or are willing to learn, XPath):

tree = ElementTree.fromstring(page)
mydiv = tree.find('.//div[@class="want"]')

现在,您已经获得了对类"want"div的引用.您可以通过以下方式获取其直接文本:

Now you've got a reference to the div with class "want". You can get its direct text with this:

print(mydiv.text)

但是,如果要提取整个子树,那就更容易了:

But if you want to extract the whole subtree, that's even easier:

data = ElementTree.tostring(mydiv)

如果要将其包装在有效的<html><body>中和/或删除<div>本身,则必须手动执行此操作.该文档说明了如何使用简单的树API构建元素:创建headbody放入html中,然后将div粘贴在body中,然后将tostring html,就是这样.

If you want to wrap that up in a valid <html> and <body> and/or remove the <div> itself, you'll have to do that part manually. The documentation explains how to build up elements using a simple tree API: you create a head and a body to put in the html, then stick the div in the body, then tostring the html, and that's about it.

这篇关于仅使用内置库在Python中制作基本的Web抓取工具-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆