Python:使用index / find在HTML中搜索Unicode字符串返回错误的位置 [英] Python: Searching for Unicode string in HTML with index/find returns wrong position

查看:270
本文介绍了Python:使用index / find在HTML中搜索Unicode字符串返回错误的位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解析从搜索查询返回的HTML代码的结果数量,但是当我使用find / index()时,它似乎返回错误的位置。我搜索的字符串有重音,所以我尝试以Unicode形式搜索它。



正在解析的HTML代码片段:

 < div id =WPaging_total> 
Aproximádamente37个结果。
< / div>

然后我像这样搜索它:

  str_start = html.index(u'Aproxim\xe1damente')
str_end = html.find('resultados',str_start + 16)#len('Aproxim\ xe1damente')== 16
print html [str_start + 16:str_end] #works by 16到24

打印语句返回:

  damente 37 

当预期结果是:

  37 

似乎str_start并非从我搜索的字符串开始,而是从8个位置开始。 p>

  print html [str_start:str_start + 5] 

输出:

  l> 

这个问题很难复制,因为它不会在使用代码片段时发生,只有在整个HTML字符串内部搜索时才会发生。改变str _start + 16到str_start + 24让它按预期工作,但是这并不能帮助我理解问题。这是一个Unicode问题吗?希望有人能够解决这个问题。



谢谢。

LINK: strong>
http://guiasamarillas.com。 mx / buscador /?actividad = Chedraui& localidad =& id_page = 1


$ b 示例代码

  from urllib2 import请求,urlopen 

url ='http://guiasamarillas.com.mx/buscador/?actividad = $'$用户代理':'Mozilla / 4.0(compatible; MSIE 6.0; Windows NT 5.2)'} $ b $ = Chedraui& localidad =& id_page = 1'
post = b req = Request(url,post,headers)
conn = urlopen(req)

html = conn.read()

str_start = html.index(
str_end = html.find('resultados',str_start + 16)
print html [str_start + 16:str_end]


解决方案

您r问题最终归结为在Python 2.x中, str 类型表示一个字节序列,而 unicode type表示一系列字符。由于一个字符可以用多个字节编码,这意味着字符串的 unicode 类型表示的长度可能不同于的长度, str - 表示同一个字符串,同样,字符串的 unicode 表示形式的索引可能指向不同的字符串部分文本,而不是 str 表示中的相同索引。



发生的情况是当您执行 str_start = html.index(u'Aproxim\xe1damente'),Python会自动解码 html 变量,假设它是以utf-8编码。 (实际上,在我的PC上,当我试图执行该行时,我只是简单地得到了一个 UnicodeDecodeError 。与文本编码有关的一些系统设置必须不同。)因此,如果 str_start 是n,那么这意味着 u'Aproxim \xe1damente'出现在第第n个字符的HTML。然而,当你在第(n + 16)字符之后尝试获取内容时,将它用作切片索引时,实际获得的是第(n + 16)字节之后的内容 ,在这种情况下,它是不相同的,因为页面的早期内容包含以utf-8编码时占用2个字节的重音字符。



最好的解决方案是在收到它时将html转换为unicode。对示例代码的这种小修改将按照您的要求进行,而不会出现任何错误或怪异行为:

  from urllib2 import Request,urlopen 

url ='http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {' User-Agent':'Mozilla / 4.0(compatible; MSIE 6.0; Windows NT 5.2)'}
req =请求(url,post,headers)
conn = urlopen(req)

html = conn.read()。decode('utf-8')

str_start = html.index(u'Aproxim\xe1damente')
str_end = html.find ('resultados',str_start + 16)
print html [str_start + 16:str_end]


I am trying to parse the number of results from the HTML code returned from a search query, however when I use find/index() it seems to return the wrong position. The string I am searching for has an accent, so I try searching for it in Unicode form.

A snippet of the HTML code being parsed:

<div id="WPaging_total">
  Aproximádamente 37 resultados.
</div>

and I search for it like this:

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)#len('Aproxim\xe1damente ')==16
print html[str_start+16:str_end] #works by changing 16 to 24

The print statement returns:

damente 37

When the expected result is:

37

It seems str_start isn't starting at the beginning of the string I am searching for, instead 8 positions back.

print html[str_start:str_start+5]

Outputs:

l">

The problem is hard to replicate though because it doesn't happen when using the code snippet, only when searching inside the entire HTML string. I could simply change str_start+16 to str_start+24 to get it working as intended, however that doesn't help me understand the problem. Is it a Unicode issue? Hopefully someone can shed some light on the issue.

Thank you.

LINK: http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1

SAMPLE CODE:

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}          
req = Request(url, post, headers)
conn = urlopen(req)

html = conn.read()

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end]

解决方案

Your problem ultimately boils down to the fact that in Python 2.x, the str type represents a sequence of bytes while the unicode type represents a sequence of characters. Because one character can be encoded by multiple bytes, that means that the length of a unicode-type representation of a string may differ from the length of a str-type representation of the same string, and, in the same way, an index on a unicode representation of the string may point to a different part of the text than the same index on the str representation.

What's happening is that when you do str_start = html.index(u'Aproxim\xe1damente '), Python automatically decodes the html variable, assuming that it is encoded in utf-8. (Well, actually, on my PC I simply get a UnicodeDecodeError when I try to execute that line. Some of our system settings relating to text encoding must be different.) Consequently, if str_start is n then that means that u'Aproxim\xe1damente ' appears at the nth character of the HTML. However, when you use it as a slice index later to try and get content after the (n+16)th character, what you're actually getting is stuff after the (n+16)th byte, which in this case is not equivalent because earlier content of the page featured accented characters that take up 2 bytes when encoded in utf-8.

The best solution would be simply to convert the html to unicode when you receive it. This small modification to your sample code will do what you want with no errors or weird behaviour:

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}          
req = Request(url, post, headers)
conn = urlopen(req)

html = conn.read().decode('utf-8')

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end] 

这篇关于Python:使用index / find在HTML中搜索Unicode字符串返回错误的位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆