如何在长文本中查找所有出现的特定字符串 [英] How to find all occurrences of specific string in long text

查看:56
本文介绍了如何在长文本中查找所有出现的特定字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一行一行中有一些长文本(例如关于许多书籍的信息).

I have some long text (e.g. information about many books) in one string and in one line.

我只想找到 ISBN(只有数字 - 每个数字都由字符 ISBN 阻止).我找到了如何在第一个位置提取这个数字的代码.问题是如何为所有文本创建循环.我可以将它用于此示例流阅读器吗?感谢您的回答.

I want to find just ISBN (only number - each number prevents by chars ISBN). I found code how to extract this number on first position. The problem is how to create loop for all text. Can I use it for this example streamreader? Thank you for your answers.

示例:

Sub Main()
    Dim getLiteratura As String = "'Author 1. Name of book 1. ISBN 978-80-251-2025-5.', 'Author 2. Name of Book 2. ISBN 80-01-01346.', 'Author 3. Name of book. ISBN 80-85849-83.'"
    Dim test As Integer = getLiteratura.IndexOf("ISBN")
    Dim getISBN As String = getLiteratura.Substring(test + 5, getLiteratura.IndexOf(".", test + 1) - test - 5)

    Console.Write(getISBN)
    Console.ReadKey()
End Sub

推荐答案

由于您可以将起始位置传递给 IndexOf 方法,因此您可以通过从最后一个位置开始搜索来遍历字符串迭代停止.例如:

Since you can pass the start position into the IndexOf method, you can loop through the string by starting the search from where the last iteration left off. For instance:

Dim getLiteratura As String = "'Author 1. Name of book 1. ISBN 978-80-251-2025-5.', 'Author 2. Name of Book 2. ISBN 80-01-01346.', 'Author 3. Name of book. ISBN 80-85849-83.'"
Dim isbns As New List(Of String)()
Dim position As Integer = 0
While position <> -1
    position = getLiteratura.IndexOf("ISBN", position)
    If position <> -1 Then
        Dim endPosition As Integer = getLiteratura.IndexOf(".", position + 1)
        If endPosition <> -1 Then
            isbns.Add(getLiteratura.Substring(position + 5, endPosition - position - 5))
        End If
        position = endPosition
    End If
End While

如果数据已经全部加载到字符串中,那么这将与您可能找到的方法一样有效.但是,该方法的可读性或灵活性不高.如果您关心的不仅仅是效率问题,您可能需要考虑使用 RegEx:

That would be about as efficient of a method as you are likely to find, if the data is already all loaded into a string. However, that method is not very readable or flexible. If those things concern you more than mere efficiency, you may want to consider using RegEx:

For Each i As Match In Regex.Matches(getLiteratura, "ISBN (?<isbn>.*?)\.")
    isbns.Add(i.Groups("isbn").Value)
Next

如您所见,它不仅更易于阅读,而且还可以进行配置.您可以将模式外部存储在资源、配置文件、数据库等中.

As you can see, not only is it much easier to read, it is also configurable. You could store the pattern externally in a resource, configuration file, database, etc.

如果数据尚未全部加载到字符串中,并且效率是最重要的问题,您可能需要考虑使用流读取器,以便一次只将一小部分数据加载到内存中.这个逻辑会稍微复杂一些,但仍然不会太难.

If the data isn't already all loaded into a string, and efficiency is an utmost concern, you may want to look into using a stream reader so that you only load a small subset of the data into memory at once. That logic would be a bit more complicated, but still not overly difficult.

这里有一个简单的例子,说明如何通过 StreamReader 做到这一点:

Here's a simple example of how you could do it via a StreamReader:

Dim isbns As New List(Of String)()
Using reader As StreamReader = New StreamReader(stream)
    Dim builder As New StringBuilder()
    Dim isbnRegEx As New Regex("ISBN (?<isbn>.*?)\.")
    While Not reader.EndOfStream
        Dim charValue As Integer = reader.Read()
        If charValue <> -1 Then
            builder.Append(Convert.ToChar(charValue))
            Dim matches As MatchCollection = isbnRegEx.Matches(builder.ToString())
            If matches.Count <> 0 Then
                For Each i As Match In matches
                    isbns.Add(i.Groups("isbn").Value)
                Next
                builder.Clear()
            End If
        End If
    End While
End Using

如您所见,在该示例中,一旦找到匹配项,它就会将其添加到列表中,然后清除用作缓冲区的 builder.这样一来,内存中一次保存的数据量永远不会超过一个记录"的大小.

As you can see, in that example, as soon as a match is found, it adds it to the list and then clears out the builder which is being used as a buffer. That way, the amount of data being held in memory at one time is never more than the size of one "record".

由于根据您的评论,您无法正常工作,这里是一个完整的工作示例,它输出 ISBN 编号,没有任何周围的字符.只需创建一个新的 VB.NET 控制台应用程序并粘贴以下代码:

Since, based on your comments, you're having trouble getting it to work properly, here is a full working sample which outputs just the ISBN numbers, without any of the surrounding characters. Just create a new VB.NET console application and paste in the following code:

Imports System.Text.RegularExpressions

Module Module1
    Public Sub Main()
        Dim data As String = "'Author 1. Name of book 1. ISBN 978-80-251-2025-5.', 'Author 2. Name of Book 2. ISBN 80-01-01346.', 'Author 3. Name of book. ISBN 80-85849-83.'"
        For Each i As String In GetIsbns(data)
            Console.WriteLine(i)
        Next
        Console.ReadKey()
    End Sub

    Public Function GetIsbns(data As String) As List(Of String)
        Dim isbns As New List(Of String)()
        For Each i As Match In Regex.Matches(data, "ISBN (?<isbn>.*?)\.")
            isbns.Add(i.Groups("isbn").Value)
        Next
        Return isbns
    End Function
End Module

这篇关于如何在长文本中查找所有出现的特定字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆