如何读取没有列线或正确缩进的文本文件 [英] How to read a text file without column lines or properly indented
问题描述
我有一个文本文件,它是从选举卷pdf文件创建的,其中3个人数据放在同一行。
我想得到每个来自文本文件的人员数据。我面临的问题是,在人行的名称之后,如果名称太长,则一行保持为空。在文本文件中假设下一行有第三人名,那么如何让读者识别数据是第三人并相应地分配它。
任何帮助这将是样本数据:
I have a text file which is created from electoral roll pdf file in which 3 persons data is placed on a same line.
I would like to get each persons data from the text file. Problem I m facing here is that after the name of person line, one line is kept empty incase the name is too long. Here in text file suppose 3rd persons name comes on next line then how to make reader identify that the data is of 3rd person and assign it accordingly.
Any help on this matter will be helfull.
Here is the sample data:
1 EPIC NO: XYZZ989898 2 EPIC NO: XYZZ989898 3 EPIC NO: XYZZ989898
Name : abcd xyz Name : abcd xyz Name : abcd lmno
xyz
Husband's abcdefghijklm xyz Father's abcd xyz Father's abcd xyz
Name: Name: Name:
House No: - House No: - House No: -
Age: 44 Sex: Female Age: 24 Sex: Male Age: 21 Sex: Female
这里是编码,我必须指定2个变量值(r和l),以允许读者将其识别为列的边界,但它不起作用,因为不同的文件将具有不同的缩进,我将不得不一次又一次地指定它。
here is the coding which i have to specify 2 variables value (r and l) to allow reader to identify it as boundries for columns but it is not working as different files will have different indentations and i'll have to specify it again and again.
If System.IO.File.Exists(FILE_NAME) = True Then
Dim objReader As New System.IO.StreamReader(FILE_NAME)
Dim lines As String() = IO.File.ReadAllLines(FILE_NAME)
Dim rsltstr As String = ""
Dim rsltstr1 As String = ""
Dim rsltstr2 As String = ""
Dim SrNo As Integer
Dim SrNo1 As Integer
Dim SrNo2 As Integer
Dim EPICNo As String
Dim EPICNo1 As String
Dim EPICNo2 As String
Dim age As Integer
Dim age1 As Integer
Dim age2 As Integer
Dim sex As String = ""
Dim sex1 As String = ""
Dim sex2 As String = ""
Dim nm As String = ""
Dim nm1 As String = ""
Dim nm2 As String = ""
Dim hno As String
Dim hno1 As String
Dim hno2 As String
Dim space() As Char = {" "}
Dim st() As Char = {"E ", "S ", "M ", "Q ", "R ", "# "}
For i = 0 To lines.Count - 1
If lines(i).Contains("EPIC NO") Then
If lines(i).Length > r Then
If lines(i).Length < l Then
rsltstr = lines(i).Substring(0, lines(i).Length)
Else
rsltstr = lines(i).Substring(0, l)
End If
If lines(i).Length > l And lines(i).Length < r Then
rsltstr1 = lines(i).Substring(l, (lines(i).Length - l))
ElseIf lines(i).Length >= r Then
rsltstr1 = lines(i).Substring(l, l)
End If
If lines(i).Length > r Then
rsltstr2 = lines(i).Substring(lines(i).Length - (lines(i).Length - r))
End If
rsltstr = rsltstr.Replace("EPIC NO", "")
rsltstr = rsltstr.Replace(":", ">")
Dim sridinfo As String() = rsltstr.Split(">")
sridinfo(0) = sridinfo(0).TrimStart(space)
sridinfo(0) = sridinfo(0).TrimStart(st)
SrNo = sridinfo(0).TrimEnd(space)
sridinfo(1) = sridinfo(1).TrimStart(space)
EPICNo = sridinfo(1).TrimEnd(space)
rsltstr1 = rsltstr1.Replace("EPIC NO", "")
rsltstr1 = rsltstr1.Replace(":", ">")
Dim sridinfo1 As String() = rsltstr1.Split(">")
sridinfo1(0) = sridinfo1(0).TrimStart(space)
sridinfo1(0) = sridinfo1(0).TrimStart(st)
SrNo1 = sridinfo1(0).TrimEnd(space)
sridinfo1(1) = sridinfo1(1).TrimStart(space)
EPICNo1 = sridinfo1(1).TrimEnd(space)
rsltstr2 = rsltstr2.Replace("EPIC NO", "")
rsltstr2 = rsltstr2.Replace(":", ">")
Dim sridinfo2 As String() = rsltstr2.Split(">")
sridinfo2(0) = sridinfo2(0).TrimStart(space)
sridinfo2(0) = sridinfo2(0).TrimStart(st)
SrNo2 = sridinfo2(0).TrimEnd(space)
sridinfo2(1) = sridinfo2(1).TrimStart(space)
EPICNo2 = sridinfo2(1).TrimEnd(space)
rsltstr = ""
rsltstr1 = ""
rsltstr2 = ""
i = i + 1
If Not lines(i).Contains("Name") Then
If lines(i).Length < l Then
rsltstr = lines(i).Substring(0, lines(i).Length)
Else
rsltstr = lines(i).Substring(0, l)
End If
If lines(i).Length > l And lines(i).Length < r Then
rsltstr1 = lines(i).Substring(l, (lines(i).Length - l))
ElseIf lines(i).Length >= r Then
rsltstr1 = lines(i).Substring(l, l)
End If
If lines(i).Length > r Then
rsltstr2 = lines(i).Substring(lines(i).Length - (lines(i).Length - r))
End If
rsltstr = rsltstr.TrimStart(space)
nm = rsltstr.TrimEnd(space)
rsltstr1 = rsltstr1.TrimStart(space)
nm1 = rsltstr1.TrimEnd(space)
rsltstr2 = rsltstr2.TrimStart(space)
nm2 = rsltstr2.TrimEnd(space)
rsltstr = ""
rsltstr1 = ""
rsltstr2 = ""
i = i + 1
End If
If lines(i).Length < l Then
rsltstr = lines(i).Substring(0, lines(i).Length)
Else
rsltstr = lines(i).Substring(0, l)
End If
If lines(i).Length > l And lines(i).Length < r Then
rsltstr1 = lines(i).Substring(l, (lines(i).Length - l))
ElseIf lines(i).Length >= r Then
rsltstr1 = lines(i).Substring(l, l)
End If
If lines(i).Length > r Then
rsltstr2 = lines(i).Substring(lines(i).Length - (lines(i).Length - r))
End If
rsltstr = rsltstr.Replace("Name", "")
rsltstr = rsltstr.Replace(":", "")
rsltstr = rsltstr.TrimStart(space)
nm = nm + " " + rsltstr.TrimEnd(space)
nm = nm.TrimStart(space)
nm = nm.TrimEnd(space)
rsltstr1 = rsltstr1.Replace("Name", "")
rsltstr1 = rsltstr1.Replace(":", "")
rsltstr1 = rsltstr1.TrimStart(space)
nm1 = nm1 + " " + rsltstr1.TrimEnd(space)
nm1 = nm1.TrimStart(space)
nm1 = nm1.TrimEnd(space)
rsltstr2 = rsltstr2.Replace("Name", "")
rsltstr2 = rsltstr2.Replace(":", "")
rsltstr2 = rsltstr2.TrimStart(space)
nm2 = nm2 + " " + rsltstr2.TrimEnd(space)
nm2 = nm2.TrimStart(space)
nm2 = nm2.TrimEnd(space)
rsltstr = ""
rsltstr1 = ""
rsltstr2 = ""
i = i + 1
If Not lines(i) = "" Then
If lines(i).Length < l Then
rsltstr = lines(i).Substring(0, lines(i).Length)
Else
rsltstr = lines(i).Substring(0, l)
End If
If lines(i).Length > l And lines(i).Length < r Then
rsltstr1 = lines(i).Substring(l, (lines(i).Length - l))
ElseIf lines(i).Length >= r Then
rsltstr1 = lines(i).Substring(l, l)
End If
If lines(i).Length > r Then
rsltstr2 = lines(i).Substring(lines(i).Length - (lines(i).Length - r))
End If
rsltstr = rsltstr.TrimStart(space)
nm = nm + " " + rsltstr.TrimEnd(space)
nm = nm.TrimStart(space)
nm = nm.TrimEnd(space)
rsltstr1 = rsltstr1.TrimStart(space)
nm1 = nm1 + " " + rsltstr1.TrimEnd(space)
nm1 = nm1.TrimStart(space)
nm1 = nm1.TrimEnd(space)
rsltstr2 = rsltstr2.TrimStart(space)
nm2 = nm2 + " " + rsltstr2.TrimEnd(space)
nm2 = nm2.TrimStart(space)
nm2 = nm2.TrimEnd(space)
rsltstr = ""
rsltstr1 = ""
rsltstr2 = ""
End If
i = i + 1
Do While Not lines(i).Contains("House No")
i = i + 1
Loop
If lines(i).Length < l Then
rsltstr = lines(i).Substring(0, lines(i).Length)
Else
rsltstr = lines(i).Substring(0, l)
End If
If lines(i).Length > l And lines(i).Length < r Then
rsltstr1 = lines(i).Substring(l, (lines(i).Length - l))
ElseIf lines(i).Length >= r Then
rsltstr1 = lines(i).Substring(l, l)
End If
If lines(i).Length > r Then
rsltstr2 = lines(i).Substring(lines(i).Length - (lines(i).Length - r))
End If
rsltstr = rsltstr.Replace("House No", "")
rsltstr = rsltstr.Replace(":", "")
rsltstr = rsltstr.TrimStart(space)
rsltstr = rsltstr.TrimEnd(space)
If rsltstr.Length > 10 Then
hno = rsltstr.Substring(rsltstr.Length - 10)
Else
hno = rsltstr
End If
rsltstr1 = rsltstr1.Replace("House No", "")
rsltstr1 = rsltstr1.Replace(":", "")
rsltstr1 = rsltstr1.TrimStart(space)
rsltstr1 = rsltstr1.TrimEnd(space)
If rsltstr1.Length > 10 Then
hno1 = rsltstr1.Substring(rsltstr1.Length - 10)
Else
hno1 = rsltstr1
End If
rsltstr2 = rsltstr2.Replace("House No", "")
rsltstr2 = rsltstr2.Replace(":", "")
rsltstr2 = rsltstr2.TrimStart(space)
rsltstr2 = rsltstr2.TrimEnd(space)
If rsltstr2.Length > 10 Then
hno2 = rsltstr2.Substring(rsltstr2.Length - 10)
Else
hno2 = rsltstr2
End If
rsltstr = ""
rsltstr1 = ""
rsltstr2 = ""
i = i + 1
If Not lines(i).Contains("Age") Then
If lines(i).Length < l Then
rsltstr = lines(i).Substring(0, lines(i).Length)
Else
rsltstr = lines(i).Substring(0, l)
End If
If lines(i).Length > l And lines(i).Length < r Then
rsltstr1 = lines(i).Substring(l, (lines(i).Length - l))
ElseIf lines(i).Length >= r Then
rsltstr1 = lines(i).Substring(l, l)
End If
If lines(i).Length > r Then
rsltstr2 = lines(i).Substring(lines(i).Length - (lines(i).Length - r))
End If
rsltstr = rsltstr.TrimStart(space)
rsltstr = rsltstr.TrimEnd(space)
If rsltstr.Length > 10 Then
hno = hno + " " + rsltstr.Substring(rsltstr.Length - 10)
hno = hno.TrimStart(space)
hno = hno.TrimEnd(space)
Else
hno = hno + " " + rsltstr
hno = hno.TrimStart(space)
hno = hno.TrimEnd(space)
End If
rsltstr1 = rsltstr1.TrimStart(space)
rsltstr1 = rsltstr1.TrimEnd(space)
If rsltstr1.Length > 10 Then
hno1 = hno1 + " " + rsltstr1.Substring(rsltstr1.Length - 10)
hno1 = hno1.TrimStart(space)
hno1 = hno1.TrimEnd(space)
Else
hno1 = hno1 + " " + rsltstr1
hno1 = hno1.TrimStart(space)
hno1 = hno1.TrimEnd(space)
End If
rsltstr2 = rsltstr2.TrimStart(space)
rsltstr2 = rsltstr2.TrimEnd(space)
If rsltstr2.Length > 10 Then
hno2 = hno2 + " " + rsltstr2.Substring(rsltstr2.Length - 10)
hno2 = hno2.TrimStart(space)
hno2 = hno2.TrimEnd(space)
Else
hno2 = hno2 + " " + rsltstr2
hno2 = hno2.TrimStart(space)
hno2 = hno2.TrimEnd(space)
End If
rsltstr = ""
rsltstr1 = ""
rsltstr2 = ""
i = i + 1
End If
If lines(i).Contains("Age") Then
If lines(i).Length < l Then
rsltstr = lines(i).Substring(0, lines(i).Length)
Else
rsltstr = lines(i).Substring(0, l)
End If
If lines(i).Length > l And lines(i).Length < r Then
rsltstr1 = lines(i).Substring(l, (lines(i).Length - l))
ElseIf lines(i).Length >= r Then
rsltstr1 = lines(i).Substring(l, l)
End If
If lines(i).Length > r Then
rsltstr2 = lines(i).Substring(lines(i).Length - (lines(i).Length - r))
End If
rsltstr = rsltstr.Replace("Age", "")
rsltstr = rsltstr.Replace(":", "")
rsltstr = rsltstr.Replace("Sex", "-")
Dim agsx As String() = rsltstr.Split("-")
agsx(0) = agsx(0).TrimStart(space)
age = agsx(0).TrimEnd(space)
agsx(1) = agsx(1).TrimStart(space)
agsx(1) = agsx(1).TrimEnd(space)
If agsx(1).Contains("Female") And agsx(1).Length >= 6 Then
sex = "Female"
Else
sex = "Male"
End If
rsltstr1 = rsltstr1.Replace("Age", "")
rsltstr1 = rsltstr1.Replace(":", "")
rsltstr1 = rsltstr1.Replace("Sex", "-")
Dim agsx1 As String() = rsltstr1.Split("-")
agsx1(0) = agsx1(0).TrimStart(space)
age1 = agsx1(0).TrimEnd(space)
agsx1(1) = agsx1(1).TrimStart(space)
agsx1(1) = agsx1(1).TrimEnd(space)
If agsx1(1).Contains("female") And agsx1(1).Length >= 6 Then
sex1 = "Female"
Else
sex1 = "Male"
End If
rsltstr2 = rsltstr2.Replace("Age", "")
rsltstr2 = rsltstr2.Replace(":", "")
rsltstr2 = rsltstr2.Replace("Sex", "-")
Dim agsx2 As String() = rsltstr2.Split("-")
agsx2(0) = agsx2(0).TrimStart(space)
age2 = agsx2(0).TrimEnd(space)
agsx2(1) = agsx2(1).TrimStart(space)
agsx2(1) = agsx2(1).TrimEnd(space)
If agsx2(1).Contains("female") And agsx2(1).Length >= 6 Then
sex2 = "Female"
Else
sex2 = "Male"
End If
rsltstr = ""
rsltstr1 = ""
rsltstr2 = ""
End If
next
To convert pdf to txt file i use pdf2text pilot software.
To convert pdf to txt file i use pdf2text pilot software.
推荐答案
As Richard MacCutchan says, its a matter of reading each line one at a time ... and the good bit is, you have the keywords to tell you where the information should be - so you need to write a parser/state machine looking for the ’tokens’ which would be for example ’Name :’. You can extract the name data from the name line, but I would also be marking/noting the start column positions. The End column positions are the next Token Start position minus 1, remembering <eol> (end-of-line) is also a token
The Name second line isnt hard - there are no ’House No:’ tokens on that line, so it must be the continuation part of the Name line - so, you have the column positions for the name data, the first and second columns are empty, the third column contains ’xyz’, which must be added to the data for the third name
I think you need a generalised set of routines to handle this, in the guise of a parser/state machine
As Richard MacCutchan says, its a matter of reading each line one at a time ... and the good bit is, you have the keywords to tell you where the information should be - so you need to write a parser/state machine looking for the 'tokens' which would be for example 'Name :'. You can extract the name data from the name line, but I would also be marking/noting the start column positions. The End column positions are the next Token Start position minus 1, remembering <eol> (end-of-line) is also a token
The Name second line isnt hard - there are no 'House No:' tokens on that line, so it must be the continuation part of the Name line - so, you have the column positions for the name data, the first and second columns are empty, the third column contains 'xyz', which must be added to the data for the third name
I think you need a generalised set of routines to handle this, in the guise of a parser/state machine
这篇关于如何读取没有列线或正确缩进的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!