python:词和词形的字典 [英] python: dictionary of words and wordforms

查看:92
本文介绍了python:词和词形的字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下问题:我创建了一个字典(德语)与单词和相应的引理。例如:
Lagerbestände,Lager-bestand; Wohnhäuser,Wohn-haus; Bahnhof,Bahn-hof



我现在有一个文字,我想检查一下他们的词条。可能会发生,它似乎是一个不在字典中的单词,如Restbestände。但是,bestände的引理,我们已经知道了。所以我想把这个单词的第一部分在dicti中是未知的,并将其添加到第二部分,并打印出来(或返回)。
示例:Restbestände - >Rest-bestand。 (bestand取自Lagerbestände的引文)



我编码如下:



<$ p $对于范围限制(1,len(Word)),p>
for k,v in dicti.iteritems():
如果re.search('[\w ] *'+ Word [limit:],k,re.IGNORECASE)!=无:
如果v中的' - '
tmp = v.find(' - ')
end = v [tmp:]
end = re.sub(ur'[ - ]',,end)
Word = Word [:limit] +' - '+ end`

但是我有两个问题:


  1. 在单词结尾处,每次&#10打印出来。如何避免这种情况?

  2. 这个词的第二部分有时是不正确的 - 必须有一个逻辑错误。

然而;你会如何解决这个问题?

解决方案


时间&#10。
我如何避免这种情况?


必须使用 UNICODE 你脚本中的任何地方。 无处不在,无处不在。



另外,python RegEx函数接受标志 re。您应该始终设置的UNICODE 。德语字母不属于ASCII集,所以RegEx有时可能会混淆,比如匹配 r'\w'


I have the following problem: I created a dictionary (german) with words and their corresponding lemma. exemple: "Lagerbestände", "Lager-bestand"; "Wohnhäuser", "Wohn-haus"; "Bahnhof", "Bahn-hof"

I now have a text and I want to check for all word their lemmata. It can happen that it appears a word which is not in the dict, such as "Restbestände". But the lemma of "bestände", we already know it. So I want to take the first part of the word which is unknown in dicti and add this to the lemmatized second part and print this out (or return it). Example: "Restbestände" --> "Rest-bestand". ("bestand" is taken from the lemma of "Lagerbestände")

I coded the following:

for limit in range(1, len(Word)): 
    for k, v in dicti.iteritems():
        if re.search('[\w]*'+Word[limit:], k, re.IGNORECASE) != None:
            if '-' in v:
                tmp = v.find('-')
                end = v[tmp:]
                end = re.sub(ur'[-]',"", end)
                Word = Word[:limit] + '-' + end `

But I got 2 problems:

  1. At the end of the words, it is printed out every time "&#10". How can I avoid this?
  2. The second part of the word is sometimes not correct - there must be a logical error.

However; how would you solve this?

解决方案

At the end of the words, it is printed out every time "&#10". How can I avoid this?

In must use UNICODE everywhere in your script. Everywhere, everywhere, everywhere.

Also, python RegEx functions accept flag re.UNICODE that you should always set. German letters are out of ASCII set, so RegEx can be sometimes confused, for instance when matching r'\w'

这篇关于python:词和词形的字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆