从HTML中提取标题不起作用 [英] Extracting title from HTML not working

查看:194
本文介绍了从HTML中提取标题不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对古腾堡下载的大量小说进行一些文本分析。我想保留尽可能多的元数据,所以我正在下载html,然后转换为文本。我的问题是从html文件中提取元数据,特别是每本小说的标题。

截至目前,我正在使用BeautifulSoup生成文本文件并提取标题。对于Jane Eyre的示例文本,我的代码如下:

  from bs4 import BeautifulSoup 

###打开html文件
html = open(filepath / Jane_Eyre.htm)

###清理html文件
soup = BeautifulSoup(html,'lxml')

title_data = soup.title.string

然而,当我这样做时,我得到以下错误:

pre $ AttributeError:'NoneType'对象没有属性'string'

title 标记肯定存在于原始html中;当我打开文件时,这是我在前几行看到的内容:

 <!DOCTYPE html 
PUBLIC - // W3C // DTD XHTML 1.0 Strict // EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">
< html xmlns =http://www.w3.org/1999/xhtml>
< head>
< meta http-equiv =Content-Typecontent =text / html; charset = US-ASCII/>
< title>简爱< / title>
< style type =text / css>

有没有人有任何关于我在这里做错了什么的建议?

解决方案

以下方法从古登堡电子书的html文件中提取标题。

 >>> from urllib.request import Request,urlopen 
>>> from bs4 import BeautifulSoup
>>> url ='http://www.gutenberg.org/ebooks/subject/99'
>>> req = Request(url,headers = {'User-Agent':'Mozilla / 5.0'})
>>> webpage = urlopen(req).read()
>>>汤= BeautifulSoup(网页,html.parser)
>>> required = soup.find_all(span,{class:title})
>>> x1 = []
>>> for i in required:
... x1.append(i.get_text())
...
>>>对于我在x1:
...打印(i)
...
按字母顺序排序
排序方式发布日期
远大期望
简爱:自传b $ b LesMisérables
Oliver Twist
安妮的绿山墙
大卫科波菲尔
秘密花园
安妮的岛
安妮的Avonlea
小公主

安妮的梦之屋
海蒂
Udolpho之谜
人类束缚
秘密花园
爸爸长腿
Les混血儿Tome I:Fantine(法语)
简爱
盛开的玫瑰
Avonlea的其他编年史
新森林的孩子们
Oliver Twist;或者,教区男孩的进步。图解
大卫科波菲尔的个人历史
Heidi
>>>


I'm performing some text analytics on a large number of novels downloaded from Gutenberg. I want to keep as much metadata as a I can, so I'm downloading as html then later converting to text. My problem is extracting the metadata from the html files, in particular, the title of each novel.

As of now, I'm using BeautifulSoup to generate the text files and extract the title. For an example text of Jane Eyre, my code is as follows:

from bs4 import BeautifulSoup

### Opens html file
html = open("filepath/Jane_Eyre.htm")

### Cleans html file
soup = BeautifulSoup(html, 'lxml')

title_data = soup.title.string

However, when I do this, I get the following error:

AttributeError: 'NoneType' object has no attribute 'string'

The title tag is definitely there in the original html; when I open the file this is what I see in the first few lines:

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII" />
<title>Jane Eyre</title>
<style type="text/css">

Does anyone have any suggestion as to what I'm doing wrong here?

解决方案

The following approach works to extract the titles from html file of Gutenberg ebooks.

>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.gutenberg.org/ebooks/subject/99'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required = soup.find_all("span", {"class": "title"})
>>> x1 = []
>>> for i in required:
...     x1.append(i.get_text())
...
>>> for i in x1:
...     print(i)
...
Sort Alphabetically
Sort by Release Date
Great Expectations
Jane Eyre: An Autobiography
Les Misérables
Oliver Twist
Anne of Green Gables
David Copperfield
The Secret Garden
Anne of the Island
Anne of Avonlea
A Little Princess
Kim
Anne's House of Dreams
Heidi
The Mysteries of Udolpho
Of Human Bondage
The Secret Garden
Daddy-Long-Legs
Les misérables Tome I: Fantine (French)
Jane Eyre
Rose in Bloom
Further Chronicles of Avonlea
The Children of the New Forest
Oliver Twist; or, The Parish Boy's Progress. Illustrated
The Personal History of David Copperfield
Heidi
>>>

这篇关于从HTML中提取标题不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆