BeautifulSoup,你把我的 HTML 放在哪里? [英] BeautifulSoup, where are you putting my HTML?

查看:35
本文介绍了BeautifulSoup,你把我的 HTML 放在哪里?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 BS4 和 python2.7.这是我的代码的开始(感谢 root):

I'm using BS4 with python2.7. Here's the start of my code (Thanks root):

from bs4 import BeautifulSoup
import urllib2

f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)

当我打印 html 时,其内容与在 chrome 中查看的页面来源相同.然而,当我打印汤时,它会切掉整个身体,留下这个(头部标签的内容):

When I print html, its contents are the same as the source of the page viewed in chrome. When I print soup however, it cuts out all the entire body and leaves me with this (the contents of the head tag):

<!DOCTYPE html>

<html>
<head>
<title>Browse Movie - YIFY Torrents</title>
<meta charset="utf-8">
<meta content="IE=9" http-equiv="X-UA-Compatible"/>
<meta content="YIFY-Torrents.com - The official YIFY Torrents website. Here you will be able to browse and download all YIFY rip movies in excellent DVD, 720p, 1080p and 3D quality, all at the smallest file size." name="description"/>
<meta content="torrents, yify, movies, movie, download, 720p, 1080p, 3D, browse movies, yify-torrents" name="keywords"/>
<link href="http://static.yify-torrents.com/yify.ico" rel="shortcut icon"/>
<link href="http://yify-torrents.com/rss" rel="alternate" title="YIFY-Torrents RSS feed" type="application/rss+xml"/>
<link href="http://static.yify-torrents.com/assets/css/styles.css?1353330463" rel="stylesheet" type="text/css"/>
<link href="http://static.yify-torrents.com/assets/css/colorbox.css?1327223987" rel="stylesheet" type="text/css"/>
<script src="http://static.yify-torrents.com/assets/js/jquery-1.6.1.min.js?1327224013" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.validate.min.js?1327224011" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.colorbox-min.js?1327224010" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/form.js?1349683447" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/common.js?1353399801" type="text/javascript"></script>
<script>
        var webRoot = 'http://yify-torrents.com/';
        var IsLoggedIn = 0  </script>
<!--[if !IE]><!--><style type="text/css">#content input.field:focus, #content textarea:focus{border: 1px solid #47bc15 !important;}</style></meta></head></html> 

我哪里出错了?!

推荐答案

我遇到了同样的问题,这解决了我的问题:

I had the same problem and this solved my problem:

soup = BeautifulSoup(html, 'html5lib')

您需要安装 html5lib:

You need to install html5lib:

pip install html5lib

easy_install html5lib

您可以在此处阅读有关 Beautiful Soup 的不同解析器(优缺点)的更多信息:

You can read more about different parsers (pros and cons) for Beautiful Soup here:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

这篇关于BeautifulSoup,你把我的 HTML 放在哪里?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆