如何解决西里尔符号解析html文件的问题? [英] How to solve problem with parsing html file with cyrillic symbol?
问题描述
我有一些带有span元素的html文件:
I have some html file with span elements:
<html>
<body>
<span class="one">Text</span>some text</br>
<span class="two">Привет</span>Текст на русском</br>
</body>
</html>
要获取一些文字":
# -*- coding:cp1251 -*-
import lxml
from lxml import html
filename = "t.html"
fread = open(filename, 'r')
source = fread.read()
tree = html.fromstring(source)
fread.close()
tags = tree.xpath('//span[@class="one" and text()="Text"]') #This OK
print "name: ",tags[0].text
print "value: ",tags[0].tail
tags = tree.xpath('//span[@class="two" and text()="Привет"]') #This False
print "name: ",tags[0].text
print "value: ",tags[0].tail
此节目:
name: Text
value: some text
Traceback: ... in line `tags = tree.xpath('//span[@class="two" and text()="Привет"]')`
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes
如何解决这个问题?
推荐答案
lxml
(据观察,这在系统编码之间有点不明朗,尽管在Linux中确实如此,但在Windows XP中显然不能正常工作.)
lxml
(As observed, this is a bit dodgy between system encodings and apparently doesn't work properly in Windows XP, though it did in Linux.)
我通过解码源字符串-tree = html.fromstring(source.decode('utf-8'))
使它起作用:
I got it to work by decoding the source string - tree = html.fromstring(source.decode('utf-8'))
:
# -*- coding:cp1251 -*-
import lxml
from lxml import html
filename = "t.html"
fread = open(filename, 'r')
source = fread.read()
tree = html.fromstring(source.decode('utf-8'))
fread.close()
tags = tree.xpath('//span[@class="one" and text()="Text"]') #This OK
print "name: ",tags[0].text
print "value: ",tags[0].tail
tags = tree.xpath('//span[@class="two" and text()="Привет"]') #This is now OK too
print "name: ",tags[0].text
print "value: ",tags[0].tail
这意味着实际的树是所有unicode
对象.如果仅将xpath参数作为unicode
放置,则会找到0个匹配项.
This means that the actual tree is all unicode
objects. If you just put the xpath parameter as a unicode
it finds 0 matches.
无论如何,我更喜欢将BeautifulSoup用于任何此类东西.这是我的互动环节;我将文件保存在cp1251中.
I prefer to use BeautifulSoup for any of this sort of stuff, anyway. Here is my interactive session; I saved the file in cp1251.
>>> from BeautifulSoup import BeautifulSoup
>>> filename = '/tmp/cyrillic'
>>> fread = open(filename, 'r')
>>> source = fread.read()
>>> source # Scary
'<html>\n<body>\n<span class="one">Text</span>some text</br>\n<span class="two">\xcf\xf0\xe8\xe2\xe5\xf2</span>\xd2\xe5\xea\xf1\xf2 \xed\xe0 \xf0\xf3\xf1\xf1\xea\xee\xec</br>\n</body>\n</html>\n'
>>> source = source.decode('cp1251') # Let's try getting this right.
u'<html>\n<body>\n<span class="one">Text</span>some text</br>\n<span class="two">\u041f\u0440\u0438\u0432\u0435\u0442</span>\u0422\u0435\u043a\u0441\u0442 \u043d\u0430 \u0440\u0443\u0441\u0441\u043a\u043e\u043c</br>\n</body>\n</html>\n'
>>> soup = BeautifulSoup(source)
>>> soup # OK, that's looking right now. Note the </br> was dropped as that's bad HTML with no meaning.
<html>
<body>
<span class="one">Text</span>some text
<span class="two">Привет</span>Текст на русском
</body>
</html>
>>> soup.find('span', 'one').findNextSibling(text=True)
u'some text'
>>> soup.find('span', 'two').findNextSibling(text=True) # This looks a bit daunting ...
u'\u0422\u0435\u043a\u0441\u0442 \u043d\u0430 \u0440\u0443\u0441\u0441\u043a\u043e\u043c'
>>> print _ # ... but it's not, really. Just Unicode chars.
Текст на русском
>>> # Then you may also wish to get things by text:
>>> print soup.find(text=u'Привет').findParent().findNextSibling(text=True)
Текст на русском
>>> # You can't get things by attributes and the contained NavigableString at the same time, though. That may be a limitation.
最后,从文件系统中获取source.decode('cp1251')
而不是source.decode('utf-8')
可能值得考虑. lxml可能实际上可以正常工作.
At the end of that, it's possibly worth while considering trying source.decode('cp1251')
instead of source.decode('utf-8')
when you're taking it from the filesystem. lxml may actually work then.
这篇关于如何解决西里尔符号解析html文件的问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!