Java:带状态的ASCII随机行文件访问 [英] Java: ASCII random line file access with state

查看:121
本文介绍了Java:带状态的ASCII随机行文件访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有更好的[预先存在的可选Java 1.6]解决方案,而不是创建符合以下条件的流式文件阅读器类?

Is there a better [pre-existing optional Java 1.6] solution than creating a streaming file reader class that will meet the following criteria?


  • 给定一个任意大小的ASCII文件,其中每一行以 \ n

  • 终止每次调用某些方法 readLine()从文件中读取随机行

  • 并且在文件句柄的生命周期内没有调用 readLine()应该返回两次相同的行

  • Given an ASCII file of arbitrary large size where each line is terminated by a \n
  • For each invocation of some method readLine() read a random line from the file
  • And for the life of the file handle no call to readLine() should return the same line twice

更新:


  • 最终必须读取所有行

上下文:创建文件的内容从Unix shell命令获取给定目录中包含的所有路径的目录列表;有数百万到十亿个文件(在目标文件中产生数百万到十亿行)。如果在创建期间有一些方法可以将路径随机分配到文件中,这也是一种可接受的解决方案。

Context: the file's contents are created from Unix shell commands to get a directory listing of all paths contained within a given directory; there are between millions to a billion files (which yields millions to a billion lines in the target file). If there is some way to randomly distribute the paths into a file during creation time that is an acceptable solution as well.

推荐答案

如果在文件数量真正任意的情况下,似乎在内存使用方面跟踪已处理文件可能存在相关问题(如果在文件而不是列表或集合中跟踪,则为IO时间)。保持增长所选行列表的解决方案也会遇到与时间相关的问题。

If the number of files is truly arbitrary it seems like there could be an associated issue with tracking processed files in terms of memory usage (or IO time if tracking in files instead of a list or set). Solutions that keep a growing list of selected lines also run in to timing-related issues.

我会考虑一些与以下:


  1. 创建 n bucket文件。 n 可以基于考虑文件数量和系统内存的内容来确定。 (如果 n 很大,您可以生成 n 的子集以保持打开文件句柄。)

  2. 每个文件的名称是哈希,并进入适当的存储桶文件,根据任意条件分割目录。

  3. 读入存储桶文件内容(只是文件名)并按原样处理(随机提供散列机制),或选择rnd(n)并随时移除,提供更多的随机性。

  4. 或者,你可以填充并使用随机访问的想法,删除索引/偏移他们被选中的清单。

  1. Create n "bucket" files. n could be determined based on something that takes in to account the number of files and system memory. (If n is large, you could generate a subset of n to keep open file handles down.)
  2. Each file's name is hashed, and goes into an appropriate bucket file, "sharding" the directory based on arbitrary criteria.
  3. Read in the bucket file contents (just filenames) and process as-is (randomness provided by hashing mechanism), or pick rnd(n) and remove as you go, providing a bit more randomosity.
  4. Alternatively, you could pad and use the random access idea, removing indices/offsets from a list as they're picked.

这篇关于Java:带状态的ASCII随机行文件访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆