检查TEXT文件是否包含任何可打印字符 [英] Check whether a TEXT file contains any Printable characters
问题描述
我的应用程序正在处理文件夹中的数千个TEXT文件,我正在处理过滤功能以删除空白文件。
截至目前,我正在检查字节长度以及是否小于阈值[只是为了容纳带空格的文件]。这按预期工作得很好,但现在我遇到了一些文件,其中只包含换行符,并且没有任何可打印字符
我的用户想要过滤 - 这些文件以及空白文件也是如此。
请建议我最好的方法来检查文件是否包含可打印字符,另一方面是否有以任何方式查找TEXT文件的字节大小,不包括空格。
请注意性能是我主要关注的问题,因为我正在处理有数千个TEXT文件
Hi,
My application is dealing with thousands of TEXT files in a folder, where i am working on a filter functionality to remove blank files.
As of now, i am checking the byte length and if it is less than a threshold [just to accommodate files with blank spaces]. This was working good as expected, but now i came across few files which contains only line breaks and it doesn't have any printable character
My users want to filter-out those files as well along with blank files.
Please suggest me a best possible way to check whether a file contains a printable character or not on the other hand is there any way to find the bytesize of a TEXT file excluding blank space.
Please note that performance would my primary concern, as i am dealing with thousands of TEXT files
推荐答案
我认为没有比逐个字符(逐字节)扫描每个文件更好的方法,搜索第一个打印字符或文件结束标记。
为了获得良好的性能,我建议使用低级API,例如读取
FileStream
对象的方法,使用大缓冲区进行块转换FERS。 (基准并调整缓冲区大小。)
要检查打印字符,可以依赖.NET实用程序函数,如Char :: IsControl
,或实现自己的查找表,涵盖所有256字节值。
I see no better way than scanning every file character by character (byte by byte), in search of either the first printing character or the end-of-file marker.
To achieve good performance, I'd recommend to use a low-level API, such as theRead
method of aFileStream
object, using a large buffer for block transfers. (Benchmark and tune the buffer size.)
To check for the printing characters, you can rely on .NET utility functions likeChar::IsControl
, or implement your own lookup table that covers all 256 byte values.
这篇关于检查TEXT文件是否包含任何可打印字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!