如何检测文本是人类可读? [英] How To Detect Is Text Human Readable?

查看:179
本文介绍了如何检测文本是人类可读?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有一种方法告诉给定的文本是人类可读。人类可读的,我的意思是:它有一些含义,像格式写的某人的制品,或至少由一个软件翻译,旨在由人读取产生

I am wondering if there's a way to tell a given text is human readable. By human readable, I mean: it has some meanings, format like an article written by somebody, or at least generated by a software translator that is intended to be read by a human.

下面的故事背景:最近我在做一个应用程序,允许用户上传一个简短的文字到数据库。在部署的早期阶段,我注意到一些用户始终上传文字损坏,由于与编码的问题。这个问题后固定的,但让我不知道是否有一种方法提供文本回用户之前拿起非人类可读的文本。

Here's the background story: recently I am making an app that allows user to upload a short text to a database. At the early stage of deployment I noticed some user always uploaded corrupted text due to a problem with encoding. This problem is fixed later, but leaves me wonder if there's a way to pick up non human readable text before serving the text back to users.

任何意见将AP preciated。该范围可能太大,包括其他语言,所以在目前我们只讨论只有英文。

Any advice will be appreciated. The scope might be too large to include other languages, so at the moment let's limit the discussion to English only.

推荐答案

您可以尝试使用语言识别工具,或者类似的东西。

You can try a language identification tool, or something similar.

基本上,你要算的字符,或字符组(字符正克),和比较写在良好的英语文字的集合中的字母分布提交的案文的字母分布。 (请确保这种收集文本的重新预期的输入presentative)。

Basically you have to count the characters, or groups of character (character n-grams), and compare the distribution of the letters of the text submitted with the distribution of the letters of a collection of texts written in good english. (Make sure that such collection of texts is representative of the expected input).

在一个N元方法的连续性,你可能想尝试一个基于字典的方法,并检查了停止词'(如'了','A','一','的presence ')在输入的文本。

In the continuity of a N-gram approach you might want to try a dictionary based approach and check for the presence of 'stop words' (e.g. 'the', 'a', 'an', 'of') in the input text.

这篇关于如何检测文本是人类可读?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆