如何将文件分成相等的部分,而不会破坏单独的行? [英] How to split a file into equal parts, without breaking individual lines?
问题描述
我想知道是否有可能将一个文件分成相等的部分( = 除最后一个外都相等)而不会断线?在 Unix 中使用 split 命令,行可能会被分成两半.有没有办法,比如说,将一个文件分成 5 个相等的部分,但它仍然只由整行组成(如果其中一个文件大一点或小一点也没有问题)?我知道我可以只计算行数,但我必须对 bash 脚本中的很多文件执行此操作.非常感谢!
I was wondering if it was possible to split a file into equal parts (edit: = all equal except for the last), without breaking the line? Using the split command in Unix, lines may be broken in half. Is there a way to, say, split up a file in 5 equal parts, but have it still only consist of whole lines (it's no problem if one of the files is a little larger or smaller)? I know I could just calculate the number of lines, but I have to do this for a lot of files in a bash script. Many thanks!
推荐答案
如果你的意思是 行数相等, split
有一个选项:
If you mean an equal number of lines, split
has an option for this:
split --lines=75
如果您需要知道 75
对于 N
等量部分的真正含义,其:
If you need to know what that 75
should really be for N
equal parts, its:
lines_per_part = int(total_lines + N - 1) / N
其中可以使用 wc -l
获得总行数.
where total lines can be obtained with wc -l
.
请参阅以下脚本示例:
#!/usr/bin/bash
# Configuration stuff
fspec=qq.c
num_files=6
# Work out lines per file.
total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))
# Split the actual file, maintaining lines.
split --lines=${lines_per_file} ${fspec} xyzzy.
# Debug information
echo "Total lines = ${total_lines}"
echo "Lines per file = ${lines_per_file}"
wc -l xyzzy.*
输出:
Total lines = 70
Lines per file = 12
12 xyzzy.aa
12 xyzzy.ab
12 xyzzy.ac
12 xyzzy.ad
12 xyzzy.ae
10 xyzzy.af
70 total
<小时>
最新版本的 split
允许您使用 -n/--number
选项指定多个 CHUNKS
.因此,您可以使用以下内容:
More recent versions of split
allow you to specify a number of CHUNKS
with the -n/--number
option. You can therefore use something like:
split --number=l/6 ${fspec} xyzzy.
(那是ell-slash-6
,意思是lines
,而不是one-slash-6
).
(that's ell-slash-six
, meaning lines
, not one-slash-six
).
这将使您的文件大小大致相同,没有中线分割.
That will give you roughly equal files in terms of size, with no mid-line splits.
我提到最后一点是因为它不会在每个文件中为您提供大致相同数量的行,而是更多相同数量的字符.
I mention that last point because it doesn't give you roughly the same number of lines in each file, more the same number of characters.
因此,如果您有一个 20 个字符的行和 19 个 1 个字符的行(总共 20 行)并拆分为五个文件,您很可能不会在每个文件中得到四行.
So, if you have one 20-character line and 19 1-character lines (twenty lines in total) and split to five files, you most likely won't get four lines in every file.
这篇关于如何将文件分成相等的部分,而不会破坏单独的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!