Groovy/Java:目录结构的并行处理,其中每个节点都是子目录/文件的列表 [英] Groovy/Java: Parallel processing of directory structure where each node is a list of subdirectories/files

查看:205
本文介绍了Groovy/Java:目录结构的并行处理,其中每个节点都是子目录/文件的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我当前的问题:

我有一个目录结构,存储在某个地方的云存储中.在 Root 文件夹下,我有1000多个子目录,每个子目录下都有一个子目录.在这些子目录的每个子目录中,都存在一个文件.因此,简化图如下所示:

I have a directory structure stored inside a cloud storage somewhere. Under the Root folder, I have 1000+ subdirectories and each of those have a single subdirectory under them. And within each of those subdirectories, a single file exists. So a simplified diagram looks something like this:

                      Root
       ________________|________________
      |         |             |         |
   FolderA   FolderB  ...  FolderY   FolderZ
      |         |             |         |
   Folder1   Folder2       Folder3   Folder4
      |         |             |         |
    FileA     FileB         FileC     FileD

对于每个节点,它具有属性type(目录"或文件")和path("/Root/FolderB").检索这些节点的唯一方法是调用称为listDirectory(path)的方法,该方法进入云,获取该path中的所有对象.我需要找到所有文件并进行处理.

For each node, it has properties type ("directory" or "file") and path ("/Root/FolderB"). And the only way to retrieve these nodes is to call a method called listDirectory(path) which goes to the cloud, gets all the objects within that path. I need to find all the files and process them.

问题在于,按照其结构方式,如果我要查找FileA,则需要调用listDirectory() 三遍(根-> FolderA-> Folder1),您可以想象一下,整个过程会大大减慢速度.

The problem is that with the way that it's structured, if I want to look for FileA, I need to call listDirectory() three times (Root -> FolderA -> Folder1) which you can imagine slows the whole thing down significantly.

我想以并行方式处理此问题,但似乎无法使它正常工作.我尝试通过将GParsPool.withPooleachParallel()结合使用来递归地执行此操作,但是我发现具有递归的并行编程可能会很危险(且代价昂贵).我尝试通过创建一个同步列表来线性地做到这一点,该列表包含每个线程访问过的目录的所有路径.但是这些似乎都无法解决这个问题,也无法提供有效的解决方案.

I want to process this in a parallel manner but I can't seem to get this to work. I've tried doing it recursively by using GParsPool.withPool with eachParallel() but I found out that parallel programming with recursion can be a dangerous (and expensive) slope. I've tried doing it linearly by creating a synchronized list that holds all the paths that are of directories that each thread have visited. But none of these seems to work or provide an efficient solution to this problem.

仅供参考,我无法更改listDirectory()方法.每次调用都会检索该路径中的所有对象.

FYI, I can't change the listDirectory() method. Each call will retrieve all the objects in that path.

TL; DR:我需要找到一种并行的方式来处理云存储文件结构,其中获取文件夹/文件的唯一方法是通过listDirectory(path)方法.

TL;DR: I need to find a parallel way to process through a cloud-storage file structure where the only way to get the folders/files are through a listDirectory(path) method.

推荐答案

如果不是通过使用守护进程在内存中缓存目录结构的话,则不可行.

If caching the directory structure in memory by using a deamon is not an option.

或者通过最初在内存中创建存储结构的一次性映射并挂接到对存储的每个添加删除更新操作并相应地更改数据库来缓存目录结构不是一种选择.

or caching the directory structure by initially creating a one time mapping of the storage structure in the memory and hooking into each add remove update operation to the storage and changing the database accordingly is not an option.

假定存储结构是一棵树(通常是),因为listDirectory()的工作方式我认为您最好使用首先搜索以搜索存储结构树.这样您就可以使用并行编程一次搜索一个级别

assuming the storage structure is a Tree (usually is) because the way listDirectory() works i think you are better off using Breadth first search to search the storage structure tree. that way you can search one level at time using parallel programming

您的代码可能看起来像这样:

your code could look something like this:

SearchElement.java-代表目录或文件

SearchElement.java - represents either a directory or a file

public class SearchElement {

private String path;
private String name;

public SearchElement(String path, String name) {
    this.path = path;
    this.name = name;
}

public String getPath() {
    return path;
}

public String getName() {
    return name;
}

}

ElementFinder.java-一个类,用于搜索将listDirectory函数替换为实现所需的存储空间

ElementFinder.java - a class that searches the storage you need to replace the listDirectory function to your implementation

import java.util.ArrayList;
import java.util.Collection;
import java.util.Optional;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicReference;

public class ElementFinder {
    private final SearchElement ROOT_DIRECTORY_PATH = new SearchElement("/", "");


    public Optional<SearchElement> find(String elementName) {
        Queue<SearchElement> currentLevelElements = new ConcurrentLinkedQueue();
        currentLevelElements.add(ROOT_DIRECTORY_PATH);

        AtomicReference<Optional<SearchElement>> wantedElement = new AtomicReference<>(Optional.empty());

        while (!currentLevelElements.isEmpty() && wantedElement.get().isEmpty()) {
            Queue<SearchElement> nextLevelElements = new ConcurrentLinkedQueue();
            currentLevelElements.parallelStream().forEach(currentSearchElement -> {
                Collection<SearchElement> subDirectoriesAndFiles = listDirectory(currentSearchElement.getPath());

                subDirectoriesAndFiles.stream()
                        .filter(searchElement -> searchElement.getName().equals(elementName))
                        .findAny()
                        .ifPresent(element -> wantedElement.set(Optional.of(element)));

                nextLevelElements.addAll(subDirectoriesAndFiles);
            });

            currentLevelElements = nextLevelElements;
        }

        return wantedElement.get();
    }

    private Collection<SearchElement> listDirectory(String path) {
        return new ArrayList<>(); // replace me!
    }
}

这篇关于Groovy/Java:目录结构的并行处理,其中每个节点都是子目录/文件的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆