ascii to latin1 [英] ascii to latin1

查看:221
本文介绍了ascii to latin1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

-----开始PGP签名消息-----

哈希:SHA1





我正在开发一个基于django的内部网Web服务器,它有一个搜索页面。


数据库中包含的数据是混合的。有些词是

重音,有些不是,但它们应该是。这是因为

数据集开始于很久以前ascii是唯一的出路。


问题是用户必须搜索超过对于某些单词,

,因为搜索到的单词可以是或不是重音。如果我们考虑

一些表达式可以有多个可以重音的字母,那么

的搜索工作量就太多了。


我在网上寻找某种解决方案,却找不到。我已经找到了相反的



例如:

如果搜索的单词是''televis?o '',我希望通过

''televisao'',''televis?o''甚至'télévisao''进行搜索(这最后一个没有

以葡萄牙语存在)是成功的。


所以,不是只有一次搜索,而是会有几次使用。


是还有什么已编码的,或者我必须自己尝试全部用

吗?

Luis P. Mendes

---- -BEGIN PGP SIGNATURE -----

版本:GnuPG v1.4.1(GNU / Linux)

评论:将GnuPG与Mozilla一起使用 - http://enigmail.mozdev.org

iD8DBQFEX9yqHn4UHCY8rB8RAovDAJ90vllWjxfXN5bnNvg0OC KadbrfnwCfb4Hp

2jmRFyNYySukPwYACJ1TdM8 =

= hTr3

----- END PGP SIGNATURE -----

解决方案

Luis P. Mendes写道:

例如:
如果搜索的单词是''televis?£o'',我希望通过
''televisao'',''televis?£o''甚至't ?? l ?? visao''(最后一个用葡萄牙语不存在)是成功的。



ICU库可以通过某些

规则集来音译字符串。一个这样的规则集将上述所有内容音译为''televisao''。

这个音译可以作为类似于词干的标准化步骤。


有那里有一两个Python绑定。谷歌为PyICU。如果它暴露音译API,我不记得




-

Robert Kern

我已经开始相信整个世界都是一个谜,一个无害的谜团

由于我们疯狂地试图解释它而变得可怕有一个潜在的真相。

- Umberto Eco


Luis P. Mendes :

我正在开发一个基于django的内部网Web服务器,它有一个搜索页面。

数据库中包含的数据是混合的。有些词是重音的,有些不是,但它们应该是。这是因为数据收集很久以前就开始了ascii是唯一的出路。

问题是用户必须多次搜索某些单词,因为搜索到的单词可以是或不是重音。如果我们考虑一些表达式可以有几个可以重音的字母,那么
搜索工作量就太大了。




我觉得最好解决方案是索引ASCII中的所有数据。也就是说,将
a字段转换为ASCII(从重音字符转换为其非重音成分)

和索引。


然后,在搜索中,你还需要解除搜索短语的匹配,并将其与b / b
匹配。


-

RenéPijlman


Luis P. Mendes写道:

-----开始PGP签名消息-----
哈希:SHA1



我正在开发一个基于django的内部网Web服务器,它有一个搜索页面。

数据库是混合的。有些词是重音的,有些不是,但它们应该是。这是因为数据收集很久以前就开始了ascii是唯一的出路。

问题是用户必须多次搜索某些单词,因为搜索到的单词可以是或不是重音。如果我们认为某些表达式可以有几个可以重音的字母,那么搜索工作就太多了。

我在网上寻找某种解决方案但是找不到。我刚刚找到相反的东西。

例如:
如果搜索的单词是''televis?o'',我希望通过
''televisao'',''televis?o''甚至'télévisao''(最后一个用葡萄牙语不存在)是成功的。

所以,而不是只有一个搜索,将有几个使用。

是否有任何已经编码,或我将不得不尝试通过
自己完成这一切?




您需要从latin1转换为ascii而不是从ascii转换为latin1。下面的

函数就是这样做的。然后你需要建立数据库索引不在

latin1文本上,而是在ascii文本上。之后将用户输入转换为ascii

并搜索。


导入unicodedata

def search_key(s):

de_str = unicodedata.normalize(" NFD",s)

return''''。join(如果没有,则为deptr中的cp cp

unicodedata.category(cp).startswith(''M''))


print search_key(utelevis?o)

print search_key (utélévisao)


=====结果:

televisao

televisao


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I''m developing a django based intranet web server that has a search page.

Data contained in the database is mixed. Some of the words are
accented, some are not but they should be. This is because the
collection of data began a long time ago when ascii was the only way to go.

The problem is users have to search more than once for some word,
because the searched word can be or not be accented. If we consider
that some expressions can have several letters that can be accented, the
search effort is too much.

I''ve searched the net for some kind of solution but couldn''t find. I''ve
just found for the opposite.

example:
if the word searched is ''televis?o'', I want that a search by either
''televisao'', ''televis?o'' or even ''télévisao'' (this last one doesn''t
exist in Portuguese) is successful.

So, instead of only one search, there will be several used.

Is there anything already coded, or will I have to try to do it all by
myself?
Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEX9yqHn4UHCY8rB8RAovDAJ90vllWjxfXN5bnNvg0OC KadbrfnwCfb4Hp
2jmRFyNYySukPwYACJ1TdM8=
=hTr3
-----END PGP SIGNATURE-----

解决方案

Luis P. Mendes wrote:

example:
if the word searched is ''televis?£o'', I want that a search by either
''televisao'', ''televis?£o'' or even ''t??l??visao'' (this last one doesn''t
exist in Portuguese) is successful.



The ICU library has the capability to transliterate strings via certain
rulesets. One such ruleset would transliterate all of the above to ''televisao''.
That transliteration could act as a normalization step akin to stemming.

There are one or two Python bindings out there. Google for PyICU. I don''t recall
if it exposes the transliteration API or not.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco


Luis P. Mendes:

I''m developing a django based intranet web server that has a search page.

Data contained in the database is mixed. Some of the words are
accented, some are not but they should be. This is because the
collection of data began a long time ago when ascii was the only way to go.

The problem is users have to search more than once for some word,
because the searched word can be or not be accented. If we consider
that some expressions can have several letters that can be accented, the
search effort is too much.



I guess the best solution is to index all data in ASCII. That is, convert
a field to ASCII (from accented character to its unaccented constituent)
and index that.

Then, on a search, you also need to unaccent the search phrase, and match
it against the asciified index.

--
René Pijlman


Luis P. Mendes wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I''m developing a django based intranet web server that has a search page.

Data contained in the database is mixed. Some of the words are
accented, some are not but they should be. This is because the
collection of data began a long time ago when ascii was the only way to go.

The problem is users have to search more than once for some word,
because the searched word can be or not be accented. If we consider
that some expressions can have several letters that can be accented, the
search effort is too much.

I''ve searched the net for some kind of solution but couldn''t find. I''ve
just found for the opposite.

example:
if the word searched is ''televis?o'', I want that a search by either
''televisao'', ''televis?o'' or even ''télévisao'' (this last one doesn''t
exist in Portuguese) is successful.

So, instead of only one search, there will be several used.

Is there anything already coded, or will I have to try to do it all by
myself?



You need to covert from latin1 to ascii not from ascii to latin1. The
function below does that. Then you need to build database index not on
latin1 text but on ascii text. After that convert user input to ascii
and search.

import unicodedata

def search_key(s):
de_str = unicodedata.normalize("NFD", s)
return ''''.join(cp for cp in de_str if not
unicodedata.category(cp).startswith(''M''))

print search_key(u"televis?o")
print search_key(u"télévisao")

===== Result:
televisao
televisao


这篇关于ascii to latin1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆