用威尔士语计数文本中的字母 [英] Count letters in a text in the Welsh language

查看:69
本文介绍了用威尔士语计数文本中的字母的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何计算Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch中的字母?

  print(len('Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch')) 

说58

好吧,如果我这么简单,我不会问你,现在好吗?!

维基百科说()匹配单词-不是数字或下划线的字符,即字母,包括带有变音符号的字母.

How do I count the letters in Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch?

print(len('Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'))

Says 58

Well if it was that easy I wouldn't be asking you, now would I?!

Wikipedia says (https://en.wikipedia.org/wiki/Llanfairpwllgwyngyll#Placename_and_toponymy)

The long form of the name is the longest place name in the United Kingdom and one of the longest in the world at 58 characters (51 "letters" since "ch" and "ll" are digraphs, and are treated as single letters in the Welsh language).

So I want to count that and get the answer 51.

Okey dokey.

print(len(['Ll','a','n','f','a','i','r','p','w','ll','g','w','y','n','g','y','ll','g','o','g','e','r','y','ch','w','y','r','n','d','r','o','b','w','ll','ll','a','n','t','y','s','i','l','i','o','g','o','g','o','g','o','ch']))
51

Yeh but that's cheating, obviously I want to use the word as input, not the list.

Wikipedia also says that the digraphs in Welsh are ch, dd, ff, ng, ll, ph, rh, th

https://en.wikipedia.org/wiki/Welsh_orthography#Digraphs

So off we go. Let's add up the length and then take off the double counting.

word='Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'
count=len(word)
print('starting with count of',count)
for index in range(len(word)-1):
  substring=word[index]+word[index+1]
  if substring.lower() in ['ch','dd','ff','ng','ll','ph','rh','th']:
    print('taking off double counting of',substring)
    count=count-1
print(count)

This gets me this far

starting with count of 58
taking off double counting of Ll
taking off double counting of ll
taking off double counting of ng
taking off double counting of ll
taking off double counting of ch
taking off double counting of ll
taking off double counting of ll
taking off double counting of ll
taking off double counting of ch
49

It appears that I've subtracted too many then. I'm supposed to get 51. Now one problem is that with the llll it has found 3 lls and taken off three instead of two. So that's going to need to be fixed. (Must not overlap.)

And then there's another problem. The ng. Wikipedia didn't say anything about there being a letter "ng" in the name, but it's listed as one of the digraphs on the page I quoted above.

Wikipedia gives us some more clue here: "additional information may be needed to distinguish a genuine digraph from a juxtaposition of letters". And it gives the example of "llongyfarch" where the ng is just a "juxtaposition of letters", and "llong" where it is a digraph.

So it seems that 'Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch' is one of those words where the -ng- is bit just a "juxtaposition of letters".

And obviously there's no way that the computer can know that. So I'm going to have to give it that "additional information" that Wikipedia talks about.

So anyways, I decided to look in an online dictionary http://geiriadur.ac.uk/gpc/gpc.html and you can see that if you look up llongyfarch (the example from Wikipedia that has the "juxtaposition of letters") it displays it with a vertical line between the n and the g but if you look up "llong" then it doesn't do this.

So I've decided okay what we need to do is provide the additional information by putting a | in the input string like it does in the dictionary, just so that the algorithm knows that the ng bit is really two letters. But obviously I don't want the | itself to be counted as a letter.

So now I've got these inputs:

word='llong'
ANSWER NEEDS TO BE 3 (ll o ng)

word='llon|gyfarch'
ANSWER NEEDS TO BE 9 (ll o n g y f a r ch)

word='Llanfairpwllgwyn|gyllgogerychwyrndrobwllllantysiliogogogoch'
ANSWER NEEDS TO BE 51 (Ll a n f a i r p w ll g w y n g y ll g o g e r y ch w y r n d r o b w ll ll a n t y s i l i o g o g o g o ch)

and still this list of digraphs:

['ch','dd','ff','ng','ll','ph','rh','th']

and the rules are going to be:

  1. ignore case

  2. if you see a digraph then count it as 1

  3. work from left to right so that llll is ll + ll, not l + ll + l

  4. if you see a | don't count it, but you can't ignore it completely, it is there to stop ng being a digraph

and I want it to count it as 51 and to do it for the right reasons, not just fluke it.

Now I am getting 51 but it is fluking it because it is counting the | as a letter (1 too high), and then it is taking off one too many with the llll (1 too low) - ERRORS CANCEL OUT

It is getting llong right (3).

It is getting llon|gyfarch wrong (10) - counting the | again

How can I fix it the right way?

解决方案

Like many problems to do with strings, this can be done in a simple way with a regex.

>>> word = 'Llanfairpwllgwyn|gyllgogerychwyrndrobwllllantysiliogogogoch'
>>> import re
>>> pattern = re.compile(r'ch|dd|ff|ng|ll|ph|rh|th|[^\W\d_]', flags=re.IGNORECASE)
>>> len(pattern.findall(word))
51

The character class [^\W\d_] (from here) matches word-characters that are not digits or underscores, i.e. letters, including those with diacritics.

这篇关于用威尔士语计数文本中的字母的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆