使用Python解析网页的搜索结果 [英] Parsing Web Page's Search Results With Python

查看:61
本文介绍了使用Python解析网页的搜索结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始使用python开发一个程序,该程序使用户可以轻松地将任何动词结合起来.为此,我使用urllib模块打开相应的变位网页.例如,动词"beber"将具有以下网页:

I recently started working on a program in python which allows the user to conjugate any verb easily. To do this, I am using the urllib module to open the corresponding conjugations web page. For example, the verb "beber" would have the web page:

" http://www.spanishdict.com/conjugate/beber "

要打开页面,我使用以下python代码:

To open the page, I use the following python code:

source = urllib.urlopen("http://wwww.spanishdict.com/conjugate/beber").read()

此来源确实包含我要解析的信息.但是,当我像这样制作出BeautifulSoup对象时:

This source does contain the information that I want to parse. But, when I make a BeautifulSoup object out of it like this:

soup = BeautifulSoup(source)

我似乎丢失了所有我想解析的信息.制作BeautifulSoup对象时丢失的信息通常看起来像这样:

I appear to lose all the information I want to parse. The information lost when making the BeautifulSoup object usually looks something like this:

<tr>
      <td class="verb-pronoun-row">
    yo      </td>
                        <td class="">
      bebo        </td>
                          <td class="">
      bebí        </td>
                          <td class="">
      bebía        </td>
                          <td class="">
      bebería        </td>
                          <td class="">
      beberé        </td>
        </tr>

我做错了什么?我一般都不是Python或Web解析专家,所以这可能是一个简单的问题.

What am I doing wrong? I am no professional at Python or Web Parsing in general, so it may be a simple problem.

这是我完整的代码(我使用"++++++"来区分两者):

Here is my complete code (I used the "++++++" to differentiate the two):

import urllib
from bs4 import BeautifulSoup

source = urllib.urlopen("http://www.spanishdict.com/conjugate/beber").read()
soup = BeautifulSoup(source)

print source
print "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"
print str(soup)

推荐答案

您的问题可能与编码有关.我认为 bs4 utf-8 一起使用,并且您在计算机上设置了默认的默认编码(包含西班牙语字母的编码).因此urllib使用默认编码请求页面,没关系,因此源中有数据,它甚至可以打印出来,但是当您将其传递给基于 utf-8 bs4 字符丢失.尝试寻找在 bs4 中设置其他编码的方式,并尽可能将其设置为默认编码.不过,这只是个猜测,请放轻松.

Your problem may be with encoding. I think that bs4 works with utf-8 and you have a different encoding set on your machine as default(an encoding that contains spanish letters). So urllib requests the page in your default encoding,thats okay so data is there in the source, it even prints out okay, but when you pass it to utf-8 based bs4 that characters are lost. Try looking for setting a different encoding in bs4 and if possible set it to your default. This is just a guess though, take it easy.

我建议使用正则表达式.我已经将它们用于所有我的网络爬虫.是否对您有用,取决于网站的动态.但是,即使您使用 bs4 ,也存在该问题.您只需手动编写所有 re ,然后让它发挥作用即可.查找所需信息时,您将必须使用 bs4 类似方法.

I recommend using regular expressions. I have used them for all my web crawlers. If this is usable for you depends on the dynamicity of the website. But that problem is there even when you use bs4. You just write all your re manually and let it do the magic. You would have to work with the bs4 similiar way when looking foor information you want.

这篇关于使用Python解析网页的搜索结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆