拼写错误的单词期间出现CS50拼写器细分错误问题 [英] CS50 Speller Segmentation Fault Issue During Misspelled Words
问题描述
我的代码在某处导致分段错误.我不确定如何.我认为这与负载无关紧要,因为程序会在突然停止并给我seg fault错误之前开始列出拼写错误的单词.
My code is causing a segmentation fault somewhere. I'm not entirely sure how. I don't think it's an issue with load, as the program begins listing off Misspelled words before abruptly stopping and giving me the seg fault error.
// Implements a dictionary's functionality
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "dictionary.h"
#define HASHTABLE_SIZE 80000
unsigned int count = 0;
// Represents a node in a hash table
typedef struct node
{
char word[LENGTH + 1];
struct node *next;
}
node;
// Number of buckets in hash table
const unsigned int N = HASHTABLE_SIZE;
// Hash table
node *table[N];
// Returns true if word is in dictionary else false
bool check(const char *word)
{
node *tmp = NULL;
int ch = hash(word);
int len = strlen(word);
char w[len+1];
for(int i = 0; i<len; i++)
{
w[i] = tolower(word[i]);
}
w[len] = '\0';
tmp = table[ch];
while(tmp->next != NULL)
{
if(strcmp(tmp->word, w) == 0)
{
return true;
}
if(tmp->next != NULL)
{
tmp = tmp->next;
}
}
return false;
}
// Hashes word to a number
unsigned int hash(const char *word)
{
int len = strlen(word);
char key[len+1];
for(int p = 0; p < len; p++)
{
key[p] = tolower(word[p]);
}
key[len] = '\0';
unsigned int hash = 0;
for (int i = 0, n = strlen(key); i < n; i++)
hash = (hash << 2) ^ key[i];
return hash % HASHTABLE_SIZE;
}
// Loads dictionary into memory, returning true if successful else false
bool load(const char *dictionary)
{
FILE *file = fopen(dictionary, "r");
if(file == NULL)
{
printf("could not open file.\n");
fclose(file);
return false;
}
char temp[LENGTH + 1];
while(fscanf(file, "%s", temp) != EOF)
{
node *tmp = malloc(sizeof(node));
strcpy(tmp->word, temp);
unsigned int code = hash(temp);
count++;
if(table[code] == NULL)
{
table[code] = tmp;
}
else if(table[code] != NULL)
{
node *pointer = table[code];
while(pointer->next != NULL)
{
tmp->next = table[code];
table[code] = tmp;
}
//YOU ARE HERE
}
}
return true;
}
// Returns number of words in dictionary if loaded else 0 if not yet loaded
unsigned int size(void)
{
node* tmp = NULL;
for(int i=0; i< N; i++ )
{
if(table[i]!=NULL)
{
tmp = table[i];
while(tmp->next != NULL)
{
tmp = tmp->next;
count++;
}
}
}
return count;
}
// Unloads dictionary from memory, returning true if successful else false
bool unload(void)
{
node *tmp = NULL;
node *del;
for(int i = 0; i < N; i++)
{
tmp = table[i];
while(tmp->next != NULL)
{
del = tmp;
if(tmp->next != NULL)
{
tmp = tmp->next;
}
free(del);
}
return true;
}
return false;
}
运行该程序时,我收到以下消息:
When running the program, I receive this:
~/pset5/speller/ $ ./speller dictionaries/large keys/her.txt
MISSPELLED WORDS
MISSPELLED
WORDS
Jonze
INT
Segmentation fault
因此,似乎正确加载了字典和文本.
So it appears to be properly loading the dictionary and the text.
推荐答案
You have a few misconceptions with CS50 Speller. Specifically, the requirement of:
您的check实现必须不区分大小写.换句话说,如果
foo
在字典中,则给定任何条件,检查应返回true大写;foo
,foO
,fOo
,fOO
,fOO
,应该将Foo
,FoO
,FOo
和FOO
拼写错误.
Your implementation of check must be case-insensitive. In other words, if
foo
is in dictionary, then check should return true given any capitalization thereof; none offoo
,foO
,fOo
,fOO
,fOO
,Foo
,FoO
,FOo
, andFOO
should be considered misspelled.
这意味着当您将字典加载到哈希表中时,必须在计算哈希值之前将字典单词转换为小写.否则,当您 check(word)
并将单词的副本转换为小写字母时,如果原始字典单词在哈希之前没有转换为小写字母,则永远不会计算相同的哈希值.
What this means is when you load the dictionary into the hash-table, you must convert the dictionary word to lower-case before computing the hash. Otherwise, when you check(word)
and you convert a copy of word to lower-case, you would never compute the same hash if the original dictionary word were not converted to lowercase before hashing.
在计算哈希值之前,您的 check(word)
函数也不会转换为小写字母.这将导致您错过用字典词的小写形式进行哈希处理的字典词.之所以也进行段错误,是因为在取消引用 tmp-> next
之前无法检查 tmp
是否不是 NULL
.但是,您对如何检查哈希表的基础了解正确.
Your check(word)
function isn't converting to lower-case before computing the hash either. This will cause you to miss the dictionary word which was hashed with the lower-case form of the dictionary word. You segfault as well because you fail to check that tmp
is not NULL
before dereferencing tmp->next
. But, you were on the right track with the basics of how to check a hash table otherwise.
由于在散列和存储字典单词之前以及在散列要检查的单词副本之前都将转换为小写,因此使用简单的字符串到小写的函数是有意义的.然后,您可以将 check()
函数简化为:
Since you will convert to lowercase both before hashing and storing the dictionary word and before hashing a copy of the word to check, it would make sense to use a simple string-to-lower function. Then you can reduce your check()
function to:
// string to lower
char *str2lower (char *str)
{
if (!str) return NULL;
char *p = str;
for (; *p; p++)
if (isupper((unsigned char)*p))
*p ^= ('A' ^ 'a');
return str;
}
// Returns true if word is in dictionary else false
bool check(const char *word)
{
char lcword[LENGTH+1]; /* make a copy of word from txt to convert to lc */
size_t len = strlen (word); /* get length of word */
unsigned h;
if (len > LENGTH) { /* validate word will fit */
fprintf (stderr, "error: check() '%s' exceeds LENGTH.\n", word);
return false;
}
memcpy (lcword, word, len+1); /* copy word to lower-case word */
h = hash (str2lower(lcword)); /* convert to lower-case then hash */
for (node *n = table[h]; n; n = n->next) /* now loop over list nodes */
if (strcmp (lcword, n->word) == 0) /* compare lower-case words */
return true;
return false;
}
接下来,尽管未在问题集中进行讨论,但您不应跳过哈希表的大小. dictionaries/large
中有 143091
个单词.理想情况下,您希望将哈希表的负载率保持在 0.6
以下(不超过60%的存储桶已填充,以最大程度地减少冲突)我没有测试过表的实际负载率,但我只需要 N == 8000
Next, though not discussed in the problem-set, you should not skimp on hash-table size. There are 143091
words in dictionaries/large
. Ideally, you want to keep the load-factor of your hash table less than 0.6
(no more than 60% of your buckets filled to minimize collisions) I haven't tested the actual load factor for your table, but I wouldn't want anything less than N == 8000
更新:我确实进行了检查,并使用 N == 131072
,使用 lh_strhash()
为 0.665
,这很接近您要重新哈希的程度,但是出于您的目的应该很好.(值得注意的是,所有额外的存储空间都不会使检查时间的负载提高超过百分之一秒(这表明它们即使处理额外的冲突也相当有效)
update: I did check, and with N == 131072
your load-factor loading the large
dictionary using lh_strhash()
would be 0.665
which is getting to the point you would want to re-hash, but for your purposes should be fine. (notably all of the additional storage doesn't improve the load of check times more than a hundredth of a second (which indicates they are reasonably efficient even handling the additional collisions)
哈希函数
您可以尝试几种方法,但是使用/usr/share/dict/words
(这是 large
的来源),我发现了openSSL lh_strhash()
哈希函数可提供最少的冲突次数,同时效率很高.您可以将 hash()
函数实现为包装器,并以这种方式快速尝试多种不同的哈希,例如
You can experiment with several, but using the /usr/share/dict/words
(which is where large
comes from) I have found the openSSL lh_strhash()
hash function provides the minimum number of collisions while being quite efficient. You can implement your hash()
function as a wrapper and try a number of different hashes quickly that way, e.g.
// openSSL lh_strhash
uint32_t lh_strhash (const char *s)
{
uint64_t ret = 0, v;
int64_t n = 0x100;
int32_t r;
if (!s || !*s) return (ret);
for (; *s; s++) {
v = n | (*s);
n += 0x100;
r = (int32_t)((v >> 2) ^ v) & 0x0f;
ret = (ret << r) | (ret >> (32 - r));
ret &= 0xFFFFFFFFL;
ret ^= v * v;
}
return ((ret >> 16) ^ ret);
}
// Hashes word to a number
unsigned int hash (const char *word)
{
return lh_strhash (word) % N;
}
您的 load()
函数遭受同样的失败,即无法在哈希之前转换为小写字母.您不可能为哈希表中的字典中的每个单词排列和存储每个大写字母排列.由于必须执行不区分大小写的 check()
,因此只有在散列和存储之前进行转换(转换为上或下-一致)才有意义.
Your load()
function suffers from the same failure to convert to lower-case before hashing. You can't possibly permute and store every capitalization permutation for every word in the dictionary in your hash table. Since you must perform a case-insensitive check()
, it only makes sense to convert (to either upper or lower -- be consistent) before hashing and storing.
此外,在将新条目插入该存储区列表之前,无需迭代该存储区的最后一个节点.(这是非常低效的),而是简单地使用一种名为"正向链接"的方法,在将存储桶设置为新节点的地址之前,将新节点插入存储桶地址,将其中的内容移至-> next
指针.这样就可以插入O(1)时间.例如:
Further, there is no need to iterate to the last node for the bucket before inserting a new entry in that bucket's list. (that is quite inefficient) Instead simply use a method called "forward-chaining" to insert the new node at the bucket address moving what was there to the ->next
pointer before setting the bucket to the address of the new node. That gives O(1) time insertions. For example:
// Loads dictionary into memory, returning true if successful else false
bool load (const char *dictionary)
{
char word[MAXC];
FILE *fp = fopen (dictionary, "r");
if (!fp) {
perror ("fopen-dictionary");
return false;
}
while (fgets (word, MAXC, fp)) {
unsigned h;
size_t len;
node *htnode = NULL;
word[(len = strcspn(word, " \r\n"))] = 0; /* trim \n or terminate at ' ' */
if (len > LENGTH) {
fprintf (stderr, "error: word '%s' exceeds LENGTH.\n", word);
continue;
}
if (!(htnode = malloc (sizeof *htnode))) {
perror ("malloc-htnode");
return false;
}
h = hash (str2lower(word));
memcpy (htnode->word, word, len+1); /* copy word to htnode->word */
htnode->next = table[h]; /* insert node at table[h] */
table[h] = htnode; /* use fowrard-chaining for list */
htsize++; /* increment table size */
}
fclose (fp);
return htsize > 0;
}
对于哈希表的大小,只需在 dictionary.c
中添加一个全局变量,然后按照上述 load()
中的操作(即 htsize
变量).这使得表 size()
的功能很简单:
As for hash table size, just add a global to dictionary.c
and increment that global as done in load()
above (that is the htsize
variable). That makes the table size()
function as a simple as:
// Hash table size
unsigned htsize;
...
// Returns number of words in dictionary if loaded else 0 if not yet loaded
unsigned int size (void)
{
return htsize;
}
您的 unload()
有点令人费解,如果在 table [i]
上有一个节点,则将无法释放分配的内存.相反,您实际上可以缩短您的逻辑并完成您需要的工作:
Your unload()
is a bit convoluted, and will fail free the allocated memory it there is a single node at table[i]
. Instead you can actually shorten your logic and accomplish what you need with:
// Unloads dictionary from memory, returning true if successful else false
bool unload(void)
{
for (int i = 0; i < N; i++) {
node *n = table[i];
while (n) {
node *victim = n;
n = n->next;
free (victim);
}
}
htsize = 0;
return true;
}
示例使用/区别键
创建一个 test/
目录,然后将输出重定向到 test/
目录中的文件,将使您可以将结果与预期结果进行比较:
Creating a test/
directory and then redirecting output to files in the test/
directory will allow you to compare the results with expected results:
$ ./bin/speller texts/bible.txt > test/bible.txt
keys/
目录包含来自"staff"的输出.代码.此实现与键的输出匹配,但还包括定时信息(您不能更改此信息-它在 speller.c
中进行了硬编码,您不能根据练习的限制对其进行修改):
The keys/
directory contains the output from the "staff" code. This implementation matches the output of the keys, but includes the timing information as well (that's not something you can change -- it is hardcoded in speller.c
which you cannot modify per the restriction on the exercise):
$ diff -uNb keys/bible.txt test/bible.txt
--- keys/bible.txt 2019-10-08 22:35:16.000000000 -0500
+++ test/bible.txt 2020-09-01 02:09:31.559728835 -0500
@@ -33446,3 +33446,9 @@
WORDS MISSPELLED: 33441
WORDS IN DICTIONARY: 143091
WORDS IN TEXT: 799460
+TIME IN load: 0.03
+TIME IN check: 0.51
+TIME IN size: 0.00
+TIME IN unload: 0.01
+TIME IN TOTAL: 0.55
+
(注意: -b
选项允许 diff
忽略空白量的变化"
,因此它将忽略行尾的更改,例如DOS "\ r \ n"
与Linux '\ n'
行尾)
(note: the -b
option allows diff
to "ignore changes in the amount of white space"
so it will ignore changes in line-endings, like DOS "\r\n"
versus Linux '\n'
line endings)
代码输出与 keys/
目录中的文件之间的唯一区别是第一列(最后6列中标有'+'
符号的行)线)显示时间信息是唯一的区别.
The only differences between the code output and the files in the keys/
directory are those lines marked with a '+'
sign in the first column (the last 6-lines) showing the timing information is the only difference.
内存使用/错误检查
所有内存已正确释放:
$ valgrind ./bin/speller texts/lalaland.txt > test/lalaland.txt
==10174== Memcheck, a memory error detector
==10174== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==10174== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==10174== Command: ./bin/speller texts/lalaland.txt
==10174==
==10174==
==10174== HEAP SUMMARY:
==10174== in use at exit: 0 bytes in 0 blocks
==10174== total heap usage: 143,096 allocs, 143,096 frees, 8,026,488 bytes allocated
==10174==
==10174== All heap blocks were freed -- no leaks are possible
==10174==
==10174== For counts of detected and suppressed errors, rerun with: -v
==10174== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
仔细检查一下,如果还有其他问题,请告诉我.
Look things over and let me know if you have further questions.
如果您在细节上苦苦挣扎,这是使用的完整 dictionary.c
,并且我在末尾添加了 loadfactor()
函数,以便您可以进行计算如果您感兴趣的话,在 N
上改变值的负载系数:
If you are struggling with the details, this is the complete dictionary.c
used, and I have added the loadfactor()
function at the end so you can compute the load-factor for varying values on N
if you are interested:
// Implements a dictionary's functionality
#include "dictionary.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <ctype.h>
// Represents a node in a hash table
typedef struct node
{
char word[LENGTH + 1];
struct node *next;
}
node;
// Number of buckets in hash table
#define N 131072
// Max Characters Per-Line of Input
#define MAXC 1024
// Hash table
node *table[N];
// Hash table size
unsigned htsize;
// string to lower
char *str2lower (char *str)
{
if (!str) return NULL;
char *p = str;
for (; *p; p++)
if (isupper((unsigned char)*p))
*p ^= ('A' ^ 'a');
return str;
}
// Returns true if word is in dictionary else false
bool check(const char *word)
{
char lcword[LENGTH+1]; /* make a copy of word from txt to convert to lc */
size_t len = strlen (word); /* get length of word */
unsigned h;
if (len > LENGTH) { /* validate word will fit */
fprintf (stderr, "error: check() '%s' exceeds LENGTH.\n", word);
return false;
}
memcpy (lcword, word, len+1); /* copy word to lower-case word */
h = hash (str2lower(lcword)); /* convert to lower-case then hash */
for (node *n = table[h]; n; n = n->next) /* now loop over list nodes */
if (strcmp (lcword, n->word) == 0) /* compare lower-case words */
return true;
return false;
}
// openSSL lh_strhash
uint32_t lh_strhash (const char *s)
{
uint64_t ret = 0, v;
int64_t n = 0x100;
int32_t r;
if (!s || !*s) return (ret);
for (; *s; s++) {
v = n | (*s);
n += 0x100;
r = (int32_t)((v >> 2) ^ v) & 0x0f;
ret = (ret << r) | (ret >> (32 - r));
ret &= 0xFFFFFFFFL;
ret ^= v * v;
}
return ((ret >> 16) ^ ret);
}
// Hashes word to a number
unsigned int hash (const char *word)
{
return lh_strhash (word) % N;
}
// Loads dictionary into memory, returning true if successful else false
bool load (const char *dictionary)
{
char word[MAXC];
FILE *fp = fopen (dictionary, "r");
if (!fp) {
perror ("fopen-dictionary");
return false;
}
while (fgets (word, MAXC, fp)) {
unsigned h;
size_t len;
node *htnode = NULL;
word[(len = strcspn(word, " \r\n"))] = 0; /* trim \n or terminate at ' ' */
if (len > LENGTH) {
fprintf (stderr, "error: word '%s' exceeds LENGTH.\n", word);
continue;
}
if (!(htnode = malloc (sizeof *htnode))) {
perror ("malloc-htnode");
return false;
}
h = hash (str2lower(word));
memcpy (htnode->word, word, len+1); /* copy word to htnode->word */
htnode->next = table[h]; /* insert node at table[h] */
table[h] = htnode; /* use fowrard-chaining for list */
htsize++; /* increment table size */
}
fclose (fp);
return htsize > 0;
}
// Returns number of words in dictionary if loaded else 0 if not yet loaded
unsigned int size (void)
{
return htsize;
}
// Unloads dictionary from memory, returning true if successful else false
bool unload(void)
{
for (int i = 0; i < N; i++) {
node *n = table[i];
while (n) {
node *victim = n;
n = n->next;
free (victim);
}
}
htsize = 0;
return true;
}
float loadfactor (void)
{
unsigned filled = 0;
for (int i = 0; i < N; i++)
if (table[i])
filled++;
return (float)filled / N;
}
这篇关于拼写错误的单词期间出现CS50拼写器细分错误问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!