如何摆脱标点符号?并检查拼写错误 [英] How to get rid of the punctuation? and check the spelling error

查看:212
本文介绍了如何摆脱标点符号?并检查拼写错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 消除标点符号
  • 遇到新的行和空格时单词会分开,然后存储在数组中
  • 使用checkSpelling.m文件功能检查文本文件是否错误
  • 总结该文章中的错误总数
  • 没有建议被认为是没有错误,然后返回-1
  • 错误总和> 20,返回1
  • 错误总和< == 20,返回-1

我想检查某些段落的拼写错误,我遇到了摆脱标点符号的问题.另一个原因可能有问题,它会向我返回如下错误:

I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:

我的data2文件是:

My data2 file is :

checkSpelling.m

checkSpelling.m

function suggestion = checkSpelling(word)

h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
  suggestion = []; %return empty if spelled correctly
else
  %If incorrect and there are suggestions, return them in a cell array
  if h.GetSpellingSuggestions(word).count > 0
      count = h.GetSpellingSuggestions(word).count;
      for i = 1:count
          suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
      end
  else
      %If incorrect but there are no suggestions, return this:
      suggestion = 'no suggestion';
  end

end
%Quit Word to release the server
h.Quit    

f19.m

for i = 1:1

data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')';  %read text file and store data in CharData
fclose(data2);

word_punctuation=regexprep(CharData,'[`~!@#$%^&*()-_=+[{]}\|;:\''<,>.?/','')

word_newLine = regexp(word_punctuation, '\n', 'split')

word = regexp(word_newLine, ' ', 'split')

[sizeData b] = size(word)

suggestion = cellfun(@checkSpelling, word, 'UniformOutput', 0)

A19(i)=sum(~cellfun(@isempty,suggestion))

feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end

推荐答案

将您的regexprep呼叫替换为

word_punctuation=regexprep(CharData,'\W','\n');

在这里\W查找所有用换行符替换的非字母数字字符(包括空格).

Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.

然后

word = regexp(word_punctuation, '\n', 'split');

如您所见,您不需要按空格分割(请参见上文).但是您可以删除空单元格:

As you can see you don't need to split by space (see above). But you can remove the empty cells:

word(cellfun(@isempty,word)) = [];

一切对我都有用.但是我不得不说,您的checkSpelling函数非常慢.在每次调用时,它都必须创建一个ActiveX服务器对象,添加新文档,并在检查完成后删除该对象.考虑重写该函数以接受字符串的单元格数组.

Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.

更新

我看到的唯一问题是删除引号'字符(我是,不是,等等).您可以使用下划线(是的,将其视为字母数字)或任何未使用的字符序列来临时替换它们.或者,您可以使用所有非字母数字字符列表将其删除,而不用\W放在方括号中.

The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.

更新2

第一个UPDATE的另一种解决方案:

Another solution to the 1st UPDATE:

word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');

这篇关于如何摆脱标点符号?并检查拼写错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆