数千个文件中的模式匹配 [英] Pattern matching in Thousands of files
问题描述
我有一个正则表达式模式,如welcome1|welcome2|changeme
...,我需要搜索成千上万个文件(范围从1KB到24 MB),文件大小从100到8000不等.
I've a regex pattern of words like welcome1|welcome2|changeme
... which I need to search for in thousands of files (varies between 100 to 8000) ranging from 1KB to 24 MB each, in size.
我想知道是否有比我尝试的方法更快的模式匹配方法.
I would like to know if there's a faster way of pattern matching than doing what I have been trying.
环境:
- jdk 1.8
- Windows 10
- Unix4j库
这是我到目前为止尝试过的
Here's what I tried till now
try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(FilePredicates.isFileAndNotDirectory())) {
List<String> obviousStringsList = Strings_PASSWORDS.stream()
.map(s -> ".*" + s + ".*").collect(Collectors.toList()); //because Unix4j apparently needs this
Pattern pattern = Pattern.compile(String.join("|", obviousStringsList));
GrepOptions options = new GrepOptions.Default(GrepOption.count,
GrepOption.ignoreCase,
GrepOption.lineNumber,
GrepOption.matchingFiles);
Instant startTime = Instant.now();
final List<Path> filesWithObviousStringss = stream
.filter(path -> !Unix4j.grep(options, pattern, path.toFile()).toStringResult().isEmpty())
.collect(Collectors.toList());
System.out.println("Time taken = " + Duration.between(startTime, Instant.now()).getSeconds() + " seconds");
}
我得到Time taken = 60 seconds
,这让我觉得我做错了什么事情.
I get Time taken = 60 seconds
which makes me think I'm doing something really wrong.
我对流使用了不同的方法,平均每种方法需要大约一分钟的时间来处理当前的6660个文件文件夹.
I've tried different ways with the stream and on an average every method takes about a minute to process my current folder of 6660 files.
mysys2/mingw64上的Grep持续约15秒,而node.js中的exec('grep...')
持续约12秒.
Grep on mysys2/mingw64 takes about 15 seconds and exec('grep...')
in node.js takes about 12 seconds consistently.
我选择Unix4j是因为它提供了Java本机grep和清晰的代码.
I chose Unix4j because it provides java native grep and clean code.
有一种方法可以在Java中产生更好的结果,但我很遗憾地错过了?
Is there a way to produce better results in Java, that I'm sadly missing?
推荐答案
本机工具可以更快地处理此类文本文件的主要原因是他们假设一个特定的字符集,尤其是当它具有基于ASCII的8位编码时,而Java执行字节到字符的转换,其抽象能够支持任意字符集.
The main reason why native tools can process such text files much faster, is their assumption of one particular charset, especially when it has an ASCII based 8 Bit encoding, whereas Java performs a byte to character conversion whose abstraction is capable of supporting arbitrary charsets.
当我们类似地假设具有上述属性的单个字符集时,我们可以使用低级工具来显着提高性能.
When we similarly assume a single charset with the properties named above, we can use lowlevel tools which may increase the performance dramatically.
对于此类操作,我们定义了以下辅助方法:
For such an operation, we define the following helper methods:
private static char[] getTable(Charset cs) {
if(cs.newEncoder().maxBytesPerChar() != 1f)
throw new UnsupportedOperationException("Not an 8 bit charset");
byte[] raw = new byte[256];
IntStream.range(0, 256).forEach(i -> raw[i] = (byte)i);
char[] table = new char[256];
cs.newDecoder().onUnmappableCharacter(CodingErrorAction.REPLACE)
.decode(ByteBuffer.wrap(raw), CharBuffer.wrap(table), true);
for(int i = 0; i < 128; i++)
if(table[i] != i) throw new UnsupportedOperationException("Not ASCII based");
return table;
}
和
private static CharSequence mapAsciiBasedText(Path p, char[] table) throws IOException {
try(FileChannel fch = FileChannel.open(p, StandardOpenOption.READ)) {
long actualSize = fch.size();
int size = (int)actualSize;
if(size != actualSize) throw new UnsupportedOperationException("file too large");
MappedByteBuffer mbb = fch.map(FileChannel.MapMode.READ_ONLY, 0, actualSize);
final class MappedCharSequence implements CharSequence {
final int start, size;
MappedCharSequence(int start, int size) {
this.start = start;
this.size = size;
}
public int length() {
return size;
}
public char charAt(int index) {
if(index < 0 || index >= size) throw new IndexOutOfBoundsException();
byte b = mbb.get(start + index);
return b<0? table[b+256]: (char)b;
}
public CharSequence subSequence(int start, int end) {
int newSize = end - start;
if(start<0 || end < start || end-start > size)
throw new IndexOutOfBoundsException();
return new MappedCharSequence(start + this.start, newSize);
}
public String toString() {
return new StringBuilder(size).append(this).toString();
}
}
return new MappedCharSequence(0, size);
}
}
这允许将文件映射到虚拟内存并将其直接投影到CharSequence
,而无需进行复制操作,前提是可以使用一个简单的表进行映射,并且对于基于ASCII的字符集,可以使用大多数字符甚至不需要查表,因为它们的数值与Unicode代码点相同.
This allows to map a file into the virtual memory and project it directly to a CharSequence
, without copy operations, assuming that the mapping can be done with a simple table and, for ASCII based charsets, the majority of the characters do not even need a table lookup, as their numerical value is identical to the Unicode codepoint.
使用这些方法,您可以将操作实现为
With these methods, you may implement the operation as
// You need this only once per JVM.
// Note that running inside IDEs like Netbeans may change the default encoding
char[] table = getTable(Charset.defaultCharset());
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream//.parallel()
.filter(path -> {
try {
return pattern.matcher(mapAsciiBasedText(path, table)).find();
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
这比普通的文本转换要快得多,但仍支持并行执行.
This runs much faster than the normal text conversion, but still supports parallel execution.
除了需要基于ASCII的单字节编码外,还存在一些限制,即该代码不支持大于2GiB的文件.虽然可以扩展解决方案以支持更大的文件,但除非确实需要,否则我不会添加这种复杂性.
Besides requiring an ASCII based single byte encoding, there’s the restriction that this code doesn’t support files larger than 2 GiB. While it is possible to extend the solution to support larger files, I wouldn’t add this complication unless really needed.
这篇关于数千个文件中的模式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!