此代码如何在字符串中找到重复的字符? [英] How does this code find duplicate characters in a string?

查看:132
本文介绍了此代码如何在字符串中找到重复的字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

示例:给定一个字符串(在本示例中为char *word),您要查找重复的字符(字节).

Example: Given a string (in this example char *word) you want to find duplicate characters (bytes).

我想知道是否有人可以向我解释以下工作原理:

I wanted to know if someone can explain to me how the following works:

    int table[256] = {0};
    for (int i = 0; i < len; i++)  table[word[i]]++;

之后,您可以检查另一个循环是否重复:

After that you can check with another loop if duplicate or not like:

    for (int i = 0; i < len; i++) if (table[word[i]] > 1) { … }

这是如何工作的?我不明白为什么表中重复的字符> 1?

How does this work? I don't understand why duplicated chars are > 1 in the table?

推荐答案

将我的评论转换为半连贯答案.

第一个循环计算每个字节值在0..255范围内的出现次数(对于找到的每个字节值,它在字节计数中增加一个);重新扫描会发现字符串中出现多个字节的字节值-重复项的定义.

The first loop counts the number of occurrences of each byte value in the range 0..255 (it adds one to the count of the byte for each byte value it finds); the rescan finds the byte values with more than one occurrence in the string — the very definition of a duplicate.

循环都假设字符串是用单字节代码集而不是像UTF-8这样的多字节代码集编码的.他们还假定普通char是无符号类型(不常见;大多数x86平台具有带符号的普通char类型),或者字符串中没有任何值,其中word[i]为负(无重音字符).

The loops both assume that the string is encoded in a single-byte code set and not a multi-byte code set like UTF-8, for example. They also assume that either plain char is an unsigned type (not common; most x86 platforms have a signed plain char type) or that there are no values in the string where word[i] is negative (no accented characters).

因此,为了安全起见,代码应为:

For safety, therefore, the code should be:

for (int i = 0; i < len; i++)
    table[(unsigned char)word[i]]++;

for (int i = 0; i < len; i++)
{
    if (table[(unsigned char)word[i]] > 1)
    {
        …
    }
}

您可以使用word[i] & 0xFF代替演员表;它甚至具有更少的字符,但是我认为演员表更清晰(并且字符数是红色鲱鱼,请不要追逐它).请注意,无论普通char是带符号类型还是无符号类型,这两种变体(广播和掩码)都可以正常工作(尽管代码确实做出了最合理的假设,即CHAR_BIT是8而不是更大的数字—它不能再小).

You could use word[i] & 0xFF in place of the cast; it even has fewer characters, but I'd argue that the cast is clearer (and the character count is a red herring — please don't chase it). Note that both of these variants (cast and mask) work correctly regardless of whether the plain char is a signed or unsigned type (though the code does make the mostly reasonable assumption that CHAR_BIT is 8 and not some larger number — it can't be smaller).

但是为什么将它添加到找到的每个char值中,而不仅仅是添加到表中当前位置的那个char值中?当我打印出内存位置时,我从示例字符串"abcda"中看到,两个a在内存中是相同的.我认为在表数组中它们必须位于不同(但连续)的位置;为什么一样?

But why does it add to every char value it finds and not to just the one on the current position in the table? When I print out the memory location I see that from an example string "abcda" that both a are the same in memory. I thought in the table array they must be different (but continuous) locations; why is it the same?

当您在字符串中找到'a'时,会有一个与该字符关联的字节值,通常为97.因此,当计数代码读取word[i]word[i]的内容为'a'时,就好像数组中有97个(实际上,数组中有97个),因此第一个循环递增table[97] —读取abcda中的第一个a时将其从0更改为1,读取第二个a时将其从1更改为2.

When you find 'a' in a character string, there's a byte value, usually 97, associated with that character. So when the counting code reads word[i] and the content of word[i] is 'a', it's as if there was 97 in the array (in fact, there is 97 in the array), so the first loop increments table[97] — changing it from 0 to 1 when the first a in abcda is read, and from 1 to 2 when the second a is read.

当一切都说完之后,字符就是数字(但是有很多话要说,做到这一点就可以在一切都说完之前发生;幸运的是,您不必经历所有的大部分过程)时间).

Characters are just numbers, when all is said and done (but there's a heck of a lot of saying and doing that can happen before all is said and done; fortunately, you don't have to go through all that most of the time).

这篇关于此代码如何在字符串中找到重复的字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆