清理大文本文件 [英] Cleaning a large text file

查看:311
本文介绍了清理大文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

导入到MS Access之前,我尝试过大量的脚本和方法来清理大文本文件。



文本文件是500k +行。一些行包含回车或换行符。这些在记事本中显示为方形符号。 (有趣的是,在Windows XP中,它们是正方形,但在Windows 2003中,它们不会出现在记事本中,而是将文本打破到下一行/行。



每个字段都应该因此,我需要一种从文件中删除所有这些的方法。



文本文件内容示例:

  FIELD_NAME1 | FIELD_NAME2 | FIELD_NAME3 
约翰|他喜欢食物| 1002
杰克|他吃食物| 1004
杰克|他吃食物和[] []喜欢游泳| 1003

1)一个解决方案是阅读文件和修复行。但是让这个工作有困难。通常,您只会根据以下行中的错误,意识到该行是错误的。



2)另一个是将文本文件分割成较小的位。然后使用查找和替换。一旦被清理 - 粘在一起成MS Access。



有没有一个简单的方法呢?



任务只需要运行几次,所以自动化并不重要。



由dmuk添加的分析输出,然后由Tony Dallimore编辑 / p>

有关此分析输出的说明,请参阅我的(Tony Dallimore)答案。我没想到可以找到这么长的一串控制字符(由例如44个空白行引起)。我已经将这些长字符串包含在第1列中,以提高可读性。

  String |文件|线|文件|行
13 10 | 1 | 1 | 376 | 626
9 | 1 | 2299 | 375 | 3524
9 9 | 3 | 6106 | 67 | 6111
9 9 9 9 | 6 | 1916 | 53 | 1492
9 9 9 | 6 | 1917 | 53 | 1493
9 9 9 9 9 | 42 | 1266 | 42 | 1266
10 | 69 | 1524 | 240 | 4885
10 10 | 69 | 3577 | 222 | 4651
13 10 13 10 | 71 | 3697 | 374 | 3258
13 10 10 | 80 | 5440 | 240 | 4166
13 10 13 10 13 | 81 | 2657 | 290 | 2094
10 13 10 | | | |
13 10 13 10 13 | 81 | 2662 | 215 | 1802
10 | | | |
13 10 13 10 10 | 86 | 2082 | 86 | 6914
10 10 10 | 88 | 1314 | 221 | 4754
9 10 | 94 | 246 | 94 | 246
13 10 13 10 13 | 126 | 1699 | 126 | 1699
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | 143 | 2078 | 143 | 2078
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 | | | |
10 10 10 10 | 182 | 1846 | 188 | 2663
10 10 10 10 10 | 195 | 3320 | 195 | 3320
10 10 10 10 10 | | | |
10 10 10 10 10 | | | |
10 10 10 10 10 | | | |
10 10 10 10 10 | | | |
10 10 10 10 10 | | | |
10 10 10 10 10 | | | |
10 10 10 10 | | | |
13 10 13 10 13 | 198 | 4223 | 198 | 4223
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 | 198 | 4223 | 198 | 4223
10 10 10 10 10 | 213 | 5449 | 213 | 5449
10 10 10 10 10 | | | |
10 10 10 10 10 | | | |
10 10 10 10 10 | | | |
10 10 10 10 10 | | | |
10 10 10 10 10 | | | |
10 | | | |
13 10 13 10 13 | 278 | 788 | 278 | 788
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 13 | | | |
10 13 10 13 10 | | | |
13 10 13 10 | | | |


解决方案

简介



最初出现的问题是额外的回车。第一个解决方案(已被删除)搜索单个CR并将其删除。这没有任何有益的效果,所以很明显,问题不是额外的回车。我提供了下面的分析代码,以便我们可以适当地评估真实的场景。该分析程序的输出被添加到原始问题。对此输出的回顾显示,真正的问题是:




  • 大量的空行。




基于这些发现的修订解决方案在分析代码下面。



分析



您需要将以下代码包含在模块中。该例程需要一个名为DiagInfo的工作表。



代码循环从输入文件读取约1 Mb的块。它将每个块分成具有用作行终止符的任何控制字符的行。它每个块创建一个输出文件。



在例程的顶部附近,您会发现:

 '######根据需要替换名称
FileInNameRoot =TestSplitLine In
FileOutNameRoot =TestSplitLine Out

输入文件是: FileInNameRoot& .txt



输出文件命名为: FileOutNameRoot& 001.txt FileOutNameRoot& 002.txt FileOutNameRoot& 003.txt等。



如果你愿意,你可以从1 Mb更改块大小。该程序的速度非常快,块大小为1,000,000,但是可以获得十倍的输出文件。我发现1 Mb给我可以用NotePad轻松访问的文件。



输出如下:

  000001 FIELD_NAME1 | FIELD_NAME2 | FIELD_NAME3 13 10 
000002约翰|他喜欢食物| 1002 13 10
000003杰克|他吃食物| 1004 13 10
000004杰克|他吃食物13
000005喜欢游泳| 1003 13 10
000006约翰|他喜欢食物| 1002 13 10
000007杰克|他吃食物| 1004 13 10
000008杰克|他吃的食物和20 27 0 4

前七个字符是后面跟着的行号空间。一条线以任何控制字符结束。来自输入文件的显示字符输出不变。每个控制字符输出为空格,后跟其代码值。大多数行以13 10(CR LF)终止,但第4行以13(CR)终止,第8行由20 27 0 4(DC4 ESC NUL EOT)终止。



工作表DiagInfo如下所示:

 第一个最后
字符串文件行文件行
13 10 1 1 66 5786
13 1 4 66 5666
20 27 0 4 1 8 66 5670

列A包含由例程发现的每个不同的控制字符串。列B和C包含第一次出现的文件和行号。列D和E包含最后一次出现的文件和行号。



该例程使用工作表DiagInfo作为粗略进度指示符,最后一行显示当前输出文件号和最后一行的数字是100的倍数。使用我的63Mb测试文件,例程花了2分钟。



这将告诉我们我们正在处理什么并允许我们相应地进行规划。

  Option Explicit 
Sub AnalyseFileAndSplitIntoBlocks()

Dim Block As String
Dim BlockLen As Long
Dim CtrlChr As Long
Dim CtrlChrStg As String
Dim FileIn As Object
Dim FileInNameRoot As String
Dim FileOut As Object
Dim FileOutNameRoot As String
Dim Found As Boolean
Dim FSO As Object
Dim LineOut As String
Dim NumFileOut As Long
Dim NumLine As Long
Dim PathCrnt As String
Dim PosCrnt As Long
Dim PosStart As Long
Dim RowDiagCrnt As Long
Dim RowDiagNext As Long
Dim StartTime As Single
Dim TrailingFromLastBlock As String

StartTime = Timer

'#### ##根据需要替换名称
FileInNameRoot =TestSplitLine In
FileOutNameRoot =TestSplitLine Out

带有工作表(DiagInfo)
。激活
.Cells.EntireRow.Delete
.Range(B1:C1)。合并
带.Range(B1)
.Value =First
.Horizo​​ntalAlignment = xlCenter
结束
.Range(D1:E1)。合并
带.Range(D1)
.Value =Last
。 Horizo​​ntalAlignment = xlCenter
End with
.Range(A2)。Value =String
.Range(B2)。Value =File
.Range C2)Value =Line
.Range(D2)。Value =File
.Range(E2)。Value =Line
.Range (B2:E2)。Horizo​​ntalAlignment = xlRight
.Range(A1:E2)。Font.Bold = True
RowDiagNext = 3
.Cells(RowDiag接下来,1)。选择
结束
ActiveWindow.FreezePanes = False
ActiveWindow.FreezePanes = True

PathCrnt = ActiveWorkbook.Path
设置FSO = CreateObject(Scripting.FileSystemObject)
BlockLen = 1000000

设置FileIn = FSO.OpenTextFile(PathCrnt& \& FileInNameRoot& .txt,1,0)
'1 =读取。 0 = ASCII文件

NumFileOut = 0
TrailingFromLastBlock =

尽管FileIn.AtEndOfStream<> True
Block = TrailingFromLastBlock& FileIn.read(BlockLen)
Do While True
'确保块不会在控制字符串的中间分割
If(Right(Block,1)<或Right ,1)= Chr(127))和_
FileIn.AtEndOfStream<> True Then
'块的最后一个字符是一个控制字符。获取另一个
Block = Block& FileIn.read(1)
Else
退出Do
结束如果
循环

与Worksheets(DiagInfo)
NumFileOut = NumFileOut + 1
.Cells(RowDiagNext,2).Value = NumFileOut
NumLine = 1
.Cells(RowDiagNext,3).Value = NumLine
End with

设置FileOut = FSO.CreateTextFile(PathCrnt&\& FileOutNameRoot&& _
右(000& NumFileOut,3)&.txt,True ,False)
'True =可以覆盖。 False = ASCII

PosStart = 1'从第一行开始
PosCrnt = 1
尽管PosCrnt <= Len(块)
如果Mid(Block,PosCrnt ,1) 或_
中(块,PosCrnt,1)= Chr(127)然后
'找到一个控制字符。
LineOut = Mid(Block,PosStart,PosCrnt - PosStart)
'构建控制字符的显示字符串和
'任何后续控制字符。
CtrlChrStg =
Do While True
CtrlChrStg = CtrlChrStg& & Asc(Mid(Block,PosCrnt,1))
PosCrnt = PosCrnt + 1
如果PosCrnt> Len(Block)然后
'这个块完成
Exit Do
End If
如果Mid(Block,PosCrnt,1) 或_
中(块,PosCrnt,1)= Chr(127)然后
'另一个控制字符
Else
'下一行的第一个显示字符
退出Do
结束If
循环
'在工作表中搜索控件字符串DiagInfo
带有工作表(DiagInfo)
Found = False
对于RowDiagCrnt = 3 To RowDiagNext - 1
如果.Cells(RowDiagCrnt,1).Value = CtrlChrStg然后
Found = True
退出
结束如果
下一个
如果未找到则
'以前未知的控制字符串
RowDiagCrnt = RowDiagNext
RowDiagNext = RowDiagNext + 1
.Cells(RowDiagNext,1)。选择
。单元格(RowDiagCrnt,1).Value ='& CtrlChrStg
'第一次出现
.Cells(RowDiagCrnt,2).Value = NumFileOut
.Cells(RowDiagCrnt,3).Value = NumLine
End If
'Last发生
.Cells(RowDiagCrnt,4).Value = NumFileOut
.Cells(RowDiagCrnt,5).Value = NumLine
End with
FileOut.writeline Right(00000& ; NumLine,6)& & _
LineOut& CtrlChrStg
PosStart = PosCrnt'当前行的开始
NumLine = NumLine + 1
如果NumLine Mod 100 = 0然后
与Worksheets(DiagInfo)
.Cells (RowDiagNext,2).Value = NumFileOut
.Cells(RowDiagNext,3).Value = NumLine
End with
End If
Else
PosCrnt = PosCrnt + 1
结束如果
循环
FileOut.Close
'保存下一行的尾随字符
TrailingFromLastBlock = Mid(Block,PosStart,Len(Block) - PosStart + 1)
循环

FileIn.Close

与Worksheets(DiagInfo)
.Cells(RowDiagNext,2).Value =
.Cells(RowDiagNext,3).Value =
.Cells(3,1)。选择
.Cells.Columns.AutoFit
结束

Debug.Print Timer - StartTime

End Sub

修订解决方案



评论ana裂解输出显示真正的问题是:




  • 大量的空行。

  • 额外的行饲料。



文本中还有标签,但提问者决定这些不是问题,而是被保留。提问者希望删除空白行,换行替换为空格。



下面的例程读取100,000个字节的块中的输入文件。与更新长字符串相关联的重要开销。有限的实验表明,100,000是可接受的妥协。如果块的最后一个字符是控制字符,则例程循环向块添加另一个字符,直到最后一个字符不是控制字符。这样可以确保不会在两个块之间分割控制字符序列。例程首先循环使用 CR LF 替换 CR LF CR LF ,直到没有空行。然后,例程将寻找 CR 之前的 LF 。找到的任何空格都被替换。在一个63 Mb文件中,大量的空白行和额外的 LF s,例程花了22秒完成任务。



唯一需要更改的语句位于例程的顶部。

  Option Explicit 
Sub RemoveUnwantedCtrlChars ()

Dim Block As String
Dim BlockLen As Long
Dim FileIn As Object
Dim FileInName As String
Dim FileOut As Object
Dim FileOutName As String
Dim FSO As Object
Dim PathCrnt As String
Dim PosCRLF As Long
Dim PosLF As Long
Dim PosLastCRLF As Long
Dim PosLastLF As Long
Dim StartTime As Single

StartTime = Timer

'##这假定输入文件位于同一个文件夹
'##该工作簿包含此宏。
PathCrnt = ActiveWorkbook.Path

'######根据需要替换名称。
FileInName =TestSplitLine In.txt
FileOutName =TestSplitLine Out.txt

设置FSO = CreateObject(Scripting.FileSystemObject)
BlockLen = 100000

设置FileIn = FSO.OpenTextFile(PathCrnt&\& FileInName,1,0)
'1 =读取。 0 = ASCII文件

设置FileOut = FSO.CreateTextFile(PathCrnt&\& FileOutName,True,False)
'True =可以覆盖。 False = ASCII

尽管FileIn.AtEndOfStream<> True
Block = FileIn.Read(BlockLen)
Do While True
'确保块不分割在一个字符串的控制字符串的中间
If(Right(Block,1)< ;或右(块,1)= Chr(127))和_
FileIn.AtEndOfStream<> True Then
'块的最后一个字符是一个控制字符。获取另一个
'字符
Block = Block& FileIn.Read(1)
Else
退出Do
结束如果
循环
'删除所有空行
Do While InStr(1,Block,vbCr & vbLf& vbCr& vbLf) 0
Block = Replace(Block,vbCr& vbLf& vbCr& vbLf,vbCr& vbLf)
循环
'查找所有孤立LF并替换为
PosLF = 1
PosCRLF = 1
Do While True
PosLastLF = PosLF
PosLastCRLF = PosCRLF
PosLF = InStr(PosLF,Block,vbLf)
PosCRLF = InStr(PosCRLF,Block,vbCr& vbLf)
如果PosLF = 0然后
'此块中没有更多LF
退出Do
ElseIf PosCRLF& 0和PosLF> PosCRLF然后
'有LF LF的LF。不需要任何操作
PosLF = PosLF + 1
PosCRLF = PosLF
Else
'有一个孤独的LF
Block = Mid(Block,1,PosLF - 1)& ; &中间(块,PosLF + 1)
'将CRLF指针移回到替换的位置LF
PosCRLF = PosLF
如果
循环
PosLF = 1
FileOut.write块
循环

FileIn.Close
FileOut.Close

Debug.Print计时器 - StartTime

End Sub


I have tried numerous scripts and approaches to clean a large text file before importing into MS Access.

The text file is 500k+ lines. Some lines contain 'carriage returns' or 'line breaks'. These are displayed as square symbols in Notepad. (Interestingly in Windows XP they are squares, but in Windows 2003 they don't appear in Notepad but do break text onto the next line/row.

Each field should have no occurrences of these. Therefore I need a way of removing all of these from the file.

Example of text file contents:

 FIELD_NAME1|FIELD_NAME2                         |FIELD_NAME3
 John       |He likes food                       |1002
 Jake       |He eats food                        |1004
 Jake       |He eats food and [][] likes swimming|1003

1) One solution was to read through the file and repair rows. However difficulty in getting this to work. Typically you only realise the row is erroneous based on errors in following rows.

2) Another is to split the text file into smaller bits. Then use find and replace. Once cleansed - stick back together into MS Access.

Is there a simple way to do this?

This task only has to be run a couple of times so automation is not crucial.

Analysis output added by dmuk and then editted by Tony Dallimore

See my (Tony Dallimore) answer for an explanation of this analysis output. I had not expected such long string of control characters (caused by, for example, 44 blank lines) to be found. I have wrapped these long strings in column 1 to improve readability.

String         |       File    |       Line    |       File    |       Line
 13 10         |       1       |       1       |       376     |       626
 9             |       1       |       2299    |       375     |       3524
 9 9           |       3       |       6106    |       67      |       6111
 9 9 9 9       |       6       |       1916    |       53      |       1492
 9 9 9         |       6       |       1917    |       53      |       1493
 9 9 9 9 9     |       42      |       1266    |       42      |       1266
 10            |       69      |       1524    |       240     |       4885
 10 10         |       69      |       3577    |       222     |       4651
 13 10 13 10   |       71      |       3697    |       374     |       3258
 13 10 10      |       80      |       5440    |       240     |       4166
 13 10 13 10 13|       81      |       2657    |       290     |       2094
 10 13 10      |               |               |               |
 13 10 13 10 13|       81      |       2662    |       215     |       1802
 10            |               |               |               |
 13 10 13 10 10|       86      |       2082    |       86      |       6914
 10 10 10      |       88      |       1314    |       221     |       4754
 9 10          |       94      |       246     |       94      |       246
 13 10 13 10 13|       126     |       1699    |       126     |       1699
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|       143     |       2078    |       143     |       2078
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10   |               |               |               |
 10 10 10 10   |       182     |       1846    |       188     |        2663
 10 10 10 10 10|       195     |       3320    |       195     |        3320
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10   |               |               |               |
 13 10 13 10 13|       198     |       4223    |       198     |       4223
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10      |       198     |       4223    |       198     |       4223
 10 10 10 10 10|       213     |       5449    |       213     |       5449
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10 10 10 10 10|               |               |               |
 10            |               |               |               |
 13 10 13 10 13|       278     |       788     |       278     |       788
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10 13|               |               |               |
 10 13 10 13 10|               |               |               |
 13 10 13 10   |               |               |               |

解决方案

Introduction

It appeared at first that the problem was extra carriage returns. The first solution (which has been deleted) searched for lone CRs and removed them. This had no beneficial effect so it became clear that the problem was not extra carriage returns. I provided the analysis code below so we could properly assess the true sitiation. The output from this analysis routine was added to the original question. Review of this output revealed the true problems were:

  • A large number of blank lines.
  • extra line feeds.

A revised solution based on these findings is below the analysis code.

Analysis

You need to include the code below in a module. The routine requires a worksheet named "DiagInfo".

The code loops reading blocks of approximately 1 Mb from the input file. It splits each block into lines with any control character serving as a line terminator. It creates one output file per block.

Near the top of the routine, you will find:

  ' ###### Replace names as required
  FileInNameRoot = "TestSplitLine In"
  FileOutNameRoot = "TestSplitLine Out"

The input file is: FileInNameRoot & ".txt".

The output files are named: FileOutNameRoot & " 001.txt", FileOutNameRoot & " 002.txt", FileOutNameRoot & " 003.txt", etc.

You can change the block size from 1 Mb if you wish. The routine is very slighly faster with a block size of 1,000,000 but you get ten times more output files. I find 1 Mb gives me files that can be accessed easily with NotePad.

The output looks like:

000001 FIELD_NAME1|FIELD_NAME2|FIELD_NAME3  13 10
000002 John|He likes food|1002  13 10
000003 Jake|He eats food|1004  13 10
000004 Jake|He eats food and  13
000005 likes swimming|1003 13 10
000006 John|He likes food|1002  13 10
000007 Jake|He eats food|1004  13 10
000008 Jake|He eats food and  20 27 0 4

The first seven characters are the line number followed by space. A line is ended by any control character. The display characters from the input file are output unchanged. Each control character is output as space followed by its code value. Most lines are terminated by 13 10 (CR LF), but line 4 is terminated by 13 (CR) and line 8 is terminated by 20 27 0 4 (DC4 ESC NUL EOT).

The worksheet "DiagInfo" looks like:

               First          Last  
String      File   Line    File   Line
 13 10        1       1     66    5786
 13           1       4     66    5666
 20 27 0 4    1       8     66    5670

Column A contains every different string of control characters found by the routine. Columns B and C contain the file and line number of the first occurrence. Columns D and E contain the file and line number of the last occurrence.

The routine uses worksheet "DiagInfo" as a crude progress indicator with the last line showing the current output file number and the last line number that was a multiple of 100. With my 63Mb test file, the routine took 2 minutes.

This will tell us what we are dealing with and allow us to plan accordingly.

Option Explicit 
Sub AnalyseFileAndSplitIntoBlocks()

  Dim Block As String
  Dim BlockLen As Long
  Dim CtrlChr As Long
  Dim CtrlChrStg As String
  Dim FileIn As Object
  Dim FileInNameRoot As String
  Dim FileOut As Object
  Dim FileOutNameRoot As String
  Dim Found As Boolean
  Dim FSO As Object
  Dim LineOut As String
  Dim NumFileOut As Long
  Dim NumLine As Long
  Dim PathCrnt As String
  Dim PosCrnt As Long
  Dim PosStart As Long
  Dim RowDiagCrnt As Long
  Dim RowDiagNext As Long
  Dim StartTime As Single
  Dim TrailingFromLastBlock As String

  StartTime = Timer

  ' ###### Replace names as required
  FileInNameRoot = "TestSplitLine In"
  FileOutNameRoot = "TestSplitLine Out"

  With Worksheets("DiagInfo")
    .Activate
    .Cells.EntireRow.Delete
    .Range("B1:C1").Merge
    With .Range("B1")
      .Value = "First"
      .HorizontalAlignment = xlCenter
    End With
    .Range("D1:E1").Merge
    With .Range("D1")
      .Value = "Last"
      .HorizontalAlignment = xlCenter
    End With
    .Range("A2").Value = "String"
    .Range("B2").Value = "File"
    .Range("C2").Value = "Line"
    .Range("D2").Value = "File"
    .Range("E2").Value = "Line"
    .Range("B2:E2").HorizontalAlignment = xlRight
    .Range("A1:E2").Font.Bold = True
    RowDiagNext = 3
    .Cells(RowDiagNext, 1).Select
  End With
  ActiveWindow.FreezePanes = False
  ActiveWindow.FreezePanes = True

  PathCrnt = ActiveWorkbook.Path
  Set FSO = CreateObject("Scripting.FileSystemObject")
  BlockLen = 1000000

  Set FileIn = FSO.OpenTextFile(PathCrnt & "\" & FileInNameRoot & ".txt", 1, 0)
  '  1 = Read.  0 = ASCII file

  NumFileOut = 0
  TrailingFromLastBlock = ""

  Do While FileIn.AtEndOfStream <> True
    Block = TrailingFromLastBlock & FileIn.read(BlockLen)
    Do While True
      ' Ensure block not split in middle of a string of control characters
      If (Right(Block, 1) < " " Or Right(Block, 1) = Chr(127)) And _
                                         FileIn.AtEndOfStream <> True Then
        ' The last character of block is a control character.  Get another
        Block = Block & FileIn.read(1)
      Else
        Exit Do
      End If
    Loop

    With Worksheets("DiagInfo")
      NumFileOut = NumFileOut + 1
      .Cells(RowDiagNext, 2).Value = NumFileOut
      NumLine = 1
      .Cells(RowDiagNext, 3).Value = NumLine
    End With

    Set FileOut = FSO.CreateTextFile(PathCrnt & "\" & FileOutNameRoot & " " & _
                            Right("000" & NumFileOut, 3) & ".txt", True, False)
    ' True = Can overwrite.  False = ASCII

    PosStart = 1        ' Start of first line
    PosCrnt = 1
    Do While PosCrnt <= Len(Block)
      If Mid(Block, PosCrnt, 1) < " " Or _
         Mid(Block, PosCrnt, 1) = Chr(127) Then
        ' Have found a control character.
        LineOut = Mid(Block, PosStart, PosCrnt - PosStart)
        ' Build display string of control character and
        ' any subsequent control characters.
        CtrlChrStg = ""
        Do While True
          CtrlChrStg = CtrlChrStg & " " & Asc(Mid(Block, PosCrnt, 1))
          PosCrnt = PosCrnt + 1
          If PosCrnt > Len(Block) Then
            ' This block finished
            Exit Do
          End If
          If Mid(Block, PosCrnt, 1) < " " Or _
             Mid(Block, PosCrnt, 1) = Chr(127) Then
            ' Another control character
          Else
            ' First display character of next line
            Exit Do
          End If
        Loop
        ' Search for control character string in worksheet DiagInfo
        With Worksheets("DiagInfo")
          Found = False
          For RowDiagCrnt = 3 To RowDiagNext - 1
            If .Cells(RowDiagCrnt, 1).Value = CtrlChrStg Then
              Found = True
              Exit For
            End If
          Next
          If Not Found Then
            ' Previously unknown string of control characters
            RowDiagCrnt = RowDiagNext
            RowDiagNext = RowDiagNext + 1
            .Cells(RowDiagNext, 1).Select
            .Cells(RowDiagCrnt, 1).Value = "'" & CtrlChrStg
            ' First occurrence
            .Cells(RowDiagCrnt, 2).Value = NumFileOut
            .Cells(RowDiagCrnt, 3).Value = NumLine
          End If
          ' Last occurrence
          .Cells(RowDiagCrnt, 4).Value = NumFileOut
          .Cells(RowDiagCrnt, 5).Value = NumLine
        End With
        FileOut.writeline Right("00000" & NumLine, 6) & " " & _
                                                     LineOut & CtrlChrStg
        PosStart = PosCrnt          ' Start of current line
        NumLine = NumLine + 1
        If NumLine Mod 100 = 0 Then
          With Worksheets("DiagInfo")
           .Cells(RowDiagNext, 2).Value = NumFileOut
           .Cells(RowDiagNext, 3).Value = NumLine
          End With
        End If
      Else
        PosCrnt = PosCrnt + 1
      End If
    Loop
    FileOut.Close
    ' Save trailing characters for next line
    TrailingFromLastBlock = Mid(Block, PosStart, Len(Block) - PosStart + 1)
  Loop

  FileIn.Close

  With Worksheets("DiagInfo")
    .Cells(RowDiagNext, 2).Value = ""
    .Cells(RowDiagNext, 3).Value = ""
    .Cells(3, 1).Select
    .Cells.Columns.AutoFit
  End With

  Debug.Print Timer - StartTime

End Sub

Revised solution

Review of the analysis output revealed the true problems were:

  • A large number of blank lines.
  • extra line feeds.

There were also tabs within the text but the questioner decided these were not a problem and were to be retained. The questioner wanted the blank lines removed and the line feeds replaced by spaces.

The routine below reads the input files in blocks of 100,000 bytes. There are significant overheads associated with updating long strings. Limited experimentation suggests that 100,000 is an acceptable compromise. If the last character of a block is a control character then the routine loops adding another character to the block until the last character is not a control character. This ensure that no sequence of control characters is split across two blocks. The routine first loops replacing CR LF CR LF by CR LF until there are no blank lines. The routine then looks for LFs not preceeded by CRs. Any that are found are replaced by spaces. On a 63 Mb file with a large number of blank lines and extra LFs, the routine took 22 seconds to complete its task.

The only statements requiring change are at the top of the routine.

Option Explicit
Sub RemoveUnwantedCtrlChars()

  Dim Block As String
  Dim BlockLen As Long
  Dim FileIn As Object
  Dim FileInName As String
  Dim FileOut As Object
  Dim FileOutName As String
  Dim FSO As Object
  Dim PathCrnt As String
  Dim PosCRLF As Long
  Dim PosLF As Long
  Dim PosLastCRLF As Long
  Dim PosLastLF As Long
  Dim StartTime As Single

  StartTime = Timer

  ' ## This assumes the input file is in the same folder
  ' ## as the workbook containing this macro.
  PathCrnt = ActiveWorkbook.Path

  ' ###### Replace names as required.
  FileInName = "TestSplitLine In.txt"
  FileOutName = "TestSplitLine Out.txt"

  Set FSO = CreateObject("Scripting.FileSystemObject")
  BlockLen = 100000

  Set FileIn = FSO.OpenTextFile(PathCrnt & "\" & FileInName, 1, 0)
  '  1 = Read.  0 = ASCII file

  Set FileOut = FSO.CreateTextFile(PathCrnt & "\" & FileOutName, True, False)
  ' True = Can overwrite.  False = ASCII

  Do While FileIn.AtEndOfStream <> True
    Block = FileIn.Read(BlockLen)
    Do While True
      ' Ensure block not split in middle of a string of control characters
      If (Right(Block, 1) < " " Or Right(Block, 1) = Chr(127)) And _
                                         FileIn.AtEndOfStream <> True Then
        ' The last character of block is a control character.  Get another
        ' character
        Block = Block & FileIn.Read(1)
      Else
        Exit Do
      End If
    Loop
    ' Remove all blank lines
    Do While InStr(1, Block, vbCr & vbLf & vbCr & vbLf) <> 0
      Block = Replace(Block, vbCr & vbLf & vbCr & vbLf, vbCr & vbLf)
    Loop
    ' Find all lone LFs and replace by " "
    PosLF = 1
    PosCRLF = 1
    Do While True
      PosLastLF = PosLF
      PosLastCRLF = PosCRLF
      PosLF = InStr(PosLF, Block, vbLf)
      PosCRLF = InStr(PosCRLF, Block, vbCr & vbLf)
      If PosLF = 0 Then
        ' No more LFs in this block
        Exit Do
      ElseIf PosCRLF <> 0 And PosLF > PosCRLF Then
        ' Have LF of CR LF.  No action required
        PosLF = PosLF + 1
        PosCRLF = PosLF
      Else
        ' Have a lone LF
        Block = Mid(Block, 1, PosLF - 1) & " " & Mid(Block, PosLF + 1)
        ' Move CRLF pointer back to position of replaced LF
        PosCRLF = PosLF
      End If
    Loop
    PosLF = 1
    FileOut.write Block
  Loop

  FileIn.Close
  FileOut.Close

  Debug.Print Timer - StartTime

End Sub

这篇关于清理大文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆