从HTML中提取标题不起作用 [英] Extracting title from HTML not working

查看：194 发布时间：2018/6/22 20:43:59 python html python-3.x beautifulsoup

本文介绍了从HTML中提取标题不起作用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在对古腾堡下载的大量小说进行一些文本分析。我想保留尽可能多的元数据，所以我正在下载html，然后转换为文本。我的问题是从html文件中提取元数据，特别是每本小说的标题。

截至目前，我正在使用BeautifulSoup生成文本文件并提取标题。对于Jane Eyre的示例文本，我的代码如下：

  from bs4 import BeautifulSoup 
 
 ###打开html文件
 html = open（filepath / Jane_Eyre.htm）
 
 ###清理html文件
 soup = BeautifulSoup（html，'lxml'） 
 
 title_data = soup.title.string

然而，当我这样做时，我得到以下错误：

pre $ AttributeError：'NoneType'对象没有属性'string'

title 标记肯定存在于原始html中;当我打开文件时，这是我在前几行看到的内容：

 <！DOCTYPE html 
 PUBLIC  -  // W3C // DTD XHTML 1.0 Strict // EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\"> 
< html xmlns =http://www.w3.org/1999/xhtml> 
< head> 
< meta http-equiv =Content-Typecontent =text / html; charset = US-ASCII/> 
< title>简爱< / title> 
< style type =text / css>

有没有人有任何关于我在这里做错了什么的建议？

解决方案

以下方法从古登堡电子书的html文件中提取标题。

 >>> from urllib.request import Request，urlopen 
>>> from bs4 import BeautifulSoup 
>>> url ='http://www.gutenberg.org/ebooks/subject/99'
>>> req = Request（url，headers = {'User-Agent'：'Mozilla / 5.0'}）
>>> webpage = urlopen（req）.read（）
>>>汤= BeautifulSoup（网页，html.parser）
>>> required = soup.find_all（span，{class：title}）
>>> x1 = [] 
>>> for i in required：
 ... x1.append（i.get_text（））
 ... 
>>>对于我在x1：
 ...打印（i）
 ... 
按字母顺序排序
排序方式发布日期
远大期望
简爱：自传b $ b LesMisérables
 Oliver Twist 
安妮的绿山墙
大卫科波菲尔
秘密花园
安妮的岛
安妮的Avonlea 
小公主
金
安妮的梦之屋
海蒂
 Udolpho之谜
人类束缚
秘密花园
爸爸长腿
 Les混血儿Tome I：Fantine（法语）
简爱
盛开的玫瑰
 Avonlea的其他编年史
新森林的孩子们
 Oliver Twist;或者，教区男孩的进步。图解
大卫科波菲尔的个人历史
 Heidi 
>>>

I'm performing some text analytics on a large number of novels downloaded from Gutenberg. I want to keep as much metadata as a I can, so I'm downloading as html then later converting to text. My problem is extracting the metadata from the html files, in particular, the title of each novel.

As of now, I'm using BeautifulSoup to generate the text files and extract the title. For an example text of Jane Eyre, my code is as follows:
from bs4 import BeautifulSoup ### Opens html file html = open("filepath/Jane_Eyre.htm") ### Cleans html file soup = BeautifulSoup(html, 'lxml') title_data = soup.title.string
However, when I do this, I get the following error:
AttributeError: 'NoneType' object has no attribute 'string'
The title tag is definitely there in the original html; when I open the file this is what I see in the first few lines:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII" /> <title>Jane Eyre</title> <style type="text/css">
Does anyone have any suggestion as to what I'm doing wrong here?
解决方案
The following approach works to extract the titles from html file of Gutenberg ebooks.
>>> from urllib.request import Request, urlopen >>> from bs4 import BeautifulSoup >>> url = 'http://www.gutenberg.org/ebooks/subject/99' >>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'}) >>> webpage = urlopen(req).read() >>> soup = BeautifulSoup(webpage, "html.parser") >>> required = soup.find_all("span", {"class": "title"}) >>> x1 = [] >>> for i in required: ... x1.append(i.get_text()) ... >>> for i in x1: ... print(i) ... Sort Alphabetically Sort by Release Date Great Expectations Jane Eyre: An Autobiography Les Misérables Oliver Twist Anne of Green Gables David Copperfield The Secret Garden Anne of the Island Anne of Avonlea A Little Princess Kim Anne's House of Dreams Heidi The Mysteries of Udolpho Of Human Bondage The Secret Garden Daddy-Long-Legs Les misérables Tome I: Fantine (French) Jane Eyre Rose in Bloom Further Chronicles of Avonlea The Children of the New Forest Oliver Twist; or, The Parish Boy's Progress. Illustrated The Personal History of David Copperfield Heidi >>>

这篇关于从HTML中提取标题不起作用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从HTML中提取标题不起作用 [英] Extracting title from HTML not working

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

从HTML中提取标题不起作用 [英] Extracting title from HTML not working

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭