BeautifulSoup,你在哪里把我的HTML? [英] BeautifulSoup, where are you putting my HTML?

查看:155
本文介绍了BeautifulSoup,你在哪里把我的HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用BS4与python2.7。这里是我的code(感谢根)的启动:

I'm using BS4 with python2.7. Here's the start of my code (Thanks root):

from bs4 import BeautifulSoup
import urllib2

f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)

当我打印的HTML,它的内容是相同的镀铬中查看的页面的源代码。然而,当我打印的汤,它削减了所有的整个身体,让我这个(head标签中的内容):

When I print html, its contents are the same as the source of the page viewed in chrome. When I print soup however, it cuts out all the entire body and leaves me with this (the contents of the head tag):

<!DOCTYPE html>

<html>
<head>
<title>Browse Movie - YIFY Torrents</title>
<meta charset="utf-8">
<meta content="IE=9" http-equiv="X-UA-Compatible"/>
<meta content="YIFY-Torrents.com - The official YIFY Torrents website. Here you will be able to browse and download all YIFY rip movies in excellent DVD, 720p, 1080p and 3D quality, all at the smallest file size." name="description"/>
<meta content="torrents, yify, movies, movie, download, 720p, 1080p, 3D, browse movies, yify-torrents" name="keywords"/>
<link href="http://static.yify-torrents.com/yify.ico" rel="shortcut icon"/>
<link href="http://yify-torrents.com/rss" rel="alternate" title="YIFY-Torrents RSS feed" type="application/rss+xml"/>
<link href="http://static.yify-torrents.com/assets/css/styles.css?1353330463" rel="stylesheet" type="text/css"/>
<link href="http://static.yify-torrents.com/assets/css/colorbox.css?1327223987" rel="stylesheet" type="text/css"/>
<script src="http://static.yify-torrents.com/assets/js/jquery-1.6.1.min.js?1327224013" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.validate.min.js?1327224011" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.colorbox-min.js?1327224010" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/form.js?1349683447" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/common.js?1353399801" type="text/javascript"></script>
<script>
        var webRoot = 'http://yify-torrents.com/';
        var IsLoggedIn = 0  </script>
<!--[if !IE]><!--><style type="text/css">#content input.field:focus, #content textarea:focus{border: 1px solid #47bc15 !important;}</style></meta></head></html> 

我在哪里去了?!

Where am I going wrong?!

推荐答案

我有同样的问题,这个解决我的问题:

I had the same problem and this solved my problem:

soup = BeautifulSoup(html, 'html5lib')

您需要安装html5lib:

You need to install html5lib:

pip install html5lib

easy_install html5lib

您可以了解更多关于美丽的汤这里不同的解析器(优劣):

You can read more about different parsers (pros and cons) for Beautiful Soup here:

<一个href=\"http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser\">http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

这篇关于BeautifulSoup,你在哪里把我的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆