如何提取文本字符串中的文本 [英] How to extract text within a string of text

查看:254
本文介绍了如何提取文本字符串中的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的问题,我希望在不使用VBA的情况下解决问题,但如果这是唯一可以解决的问题,那么就这样。



我有一个包含多行(全部列)的文件。每一行都有这样的数据:



1 7.82E-13> gi | 297848936 | ref | XP_00 | 4氢氧化物gi | 297338191 | gb | 23343 | randomrandom



2 5.09E-09> gi | 168010496 | ref | xp_00 | 2-pyruvate



等...



我想要的是一些方法来提取数字字符串以gi开头并以|结尾。对于一些行,这可能意味着多达5个gi数字,对于其他数据,它将只是一个。



我希望输出如下所示:



297848936,297338191



168010496



等...

解决方案

这是使用正则表达式对象的非常灵活的VBA答案。该功能的作用是提取每个子组匹配它找到(括号内的东西),用任何你想要的字符串(默认是,)隔开。您可以在这里找到正则表达式的信息: http://www.regular-expressions.info/



您可以这样称呼,假设第一个字符串在A1中:

  = RegexExtract(A1,gi [|](\d +)[|])

因为这样寻找所有出现的gi其次是一系列数字,然后是另一个|,对于你的问题的第一行,这将给你这个结果:

  297848936,297338191 

只需运行这个列,你就完成了! >

 函数RegexExtract(ByVal文本为String,_ 
ByVal extract_what As String,_
可选分隔符As String = ,)As String

Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject(vbscript.regexp)
Dim i As Long,j As Long
Dim result As String

RE.pattern = extract_what
RE.Global = True
设置allMatches = RE.Execute(text)

对于i = 0 To allMatches.count - 1
对于j = 0对allMatches.Item(i).submatches.count - 1
result = result& (separator& allMatches.Item(i).submatches.Item(j))
下一个
下一个

如果Len(result)<> 0然后
result = Right $(result,Len(result) - Len(separator))
End If

RegexExtract = result

结束函数


I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.

I have a file with multiple rows (all one column). Each row has data that looks something like this:

1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom

2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate

etc...

What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.

What I would hope the output would look like would be something like:

297848936,297338191

168010496

etc...

解决方案

Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here: http://www.regular-expressions.info/

You would call it like this, assuming that first string is in A1:

=RegexExtract(A1,"gi[|](\d+)[|]")

Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:

297848936, 297338191

Just run this down the column and you're all done!

Function RegexExtract(ByVal text As String, _
                      ByVal extract_what As String, _
                      Optional separator As String = ", ") As String

Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String

RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)

For i = 0 To allMatches.count - 1
    For j = 0 To allMatches.Item(i).submatches.count - 1
        result = result & (separator & allMatches.Item(i).submatches.Item(j))
    Next
Next

If Len(result) <> 0 Then
    result = Right$(result, Len(result) - Len(separator))
End If

RegexExtract = result

End Function

这篇关于如何提取文本字符串中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆