将Excel单元格中的HTML解析为多个单元格 [英] Parsing HTML in excel cell to multiple cells
问题描述
我的excel单元中的数据非常混乱,看起来非常混乱,它包含大约100行HTML标签:
I have the following - really messy looking - data in my excel cell, which consists of around 100 rows of HTML Tags:
找到以下两个示例:
<ul class=""list-unstyled"">
<li><span title=""Website"" class=""glyphicon glyphicon-link text-gray""></span> <a href=""https://google.org/"" target=""_blank"">Website</a></li>
<li><span title=""Website"" class=""glyphicon glyphicon-link text-gray""></span> <a href=""https://www.google.com/"" target=""_blank"">Website 2</a></li>
<li><span title=""Product"" class=""glyphicon glyphicon-search text-gray""></span> <a href=""http://amazon.com"" target=""_blank"">Product</a></li>
<li><span title=""Product"" class=""glyphicon glyphicon-search text-gray""></span> <a href=""https://amazon.de/"" target=""_blank"">Product 2</a></li>
<li><span title=""Tags"" class=""glyphicon glyphicon glyphicon-tag text-gray""></span>
<small><span class=""label label-warning"">Available</span></small>
<small><span class=""label label-warning"">Country</span></small>
</li>
</ul>
或
<ul class=""list-unstyled"">
<li><span title=""Website"" class=""glyphicon glyphicon-link text-gray""></span> <a href=""https://google.org/"" target=""_blank"">Website</a></li>
<li><span title=""Website"" class=""glyphicon glyphicon-link text-gray""></span> <a href=""https://www.google.com/"" target=""_blank"">Website 2</a></li>
<li><span title=""Product"" class=""glyphicon glyphicon-search text-gray""></span> <a href=""http://amazon.com"" target=""_blank"">Product</a></li>
<li><span title=""Tags"" class=""glyphicon glyphicon glyphicon-tag text-gray""></span>
<small><span class=""label label-warning"">Not Available</span></small>
<small><span class=""label label-warning"">State</span></small>
</li>
</ul>
我的目标是创建一个如下所示的表:
My goal is to create a table that looks like the following:
| Website 1 | Website 2 | Website 3 | Product 1 | Product 2 | Product 3 | Available | Country |
|---------------------|---------------------|-----------|-------------------|--------------------|-----------|---------------|---------|
| https://google.org/ | https://google.com/ | | http://amazon.com | https://amazon.de/ | | Available | Country |
| https://google.org/ | https://google.com/ | | http://amazon.com | | | Not Available | State |
老实说,我不知道如何应对这一挑战.
I honestly have no clue how to approach this challenge.
您身边有什么建议吗?
推荐答案
方法是:创建函数,该函数将HTML
代码作为字符串作为参数,并将返回具有与表标题相同的键的字典.函数的主体为:
The approach is: create function, that will take HTML
code as string as a parameter and will return dictionary with keys same as your table headers. The body of a function is:
Function ParseHTML(str As String) As Scripting.Dictionary
Set ParseHTML = New Scripting.Dictionary
Dim txt As String
Dim website As Long: website = 0
Dim product As Long: product = 0
Dim i As Long: i = 0
Do While True
'get all text between <li> and <\li> tags
'then extract all data from it: title attribute and link
txt = Mid(str, InStr(1, str, "<li>") + 4, InStr(1, str, "</li>") - InStr(1, str, "<li>") - 4)
'select which case it is: website, product or tags
Select Case Mid(txt, InStr(1, txt, "title") + 8, InStr(1, txt, "class") - InStr(1, txt, "title") - 11)
Case Is = "Website"
website = website + 1
'here you extract the link
ParseHTML.Add "Website " & website, Mid(txt, InStr(1, txt, "<a href") + 10, InStr(1, txt, "target") - InStr(1, txt, "<a href") - 13)
Case Is = "Product"
product = product + 1
'here you extract the link
ParseHTML.Add "Product " & product, Mid(txt, InStr(1, txt, "<a href") + 10, InStr(1, txt, "target") - InStr(1, txt, "<a href") - 13)
Case Is = "Tags"
'if we reached Tags, then all websites are over and need different processing
Exit Do
End Select
'delete processed text
str = Mid(str, InStr(1, str, "</li>") + 5)
Loop
'since in your table you have 3 places for websites and products, so we need to add them
For i = website + 1 To 3
ParseHTML.Add "Website " & i, ""
Next i
For i = product + 1 To 3
ParseHTML.Add "Product " & i, ""
Next i
'now txt is the interior of last <li></li> tag and now we focus on what is
'between <small> and </small> tags
'also we don't need str variable anymore, so we can re-use it
str = Mid(txt, InStr(1, txt, "<small>") + 7, InStr(1, txt, "</small>") - InStr(1, txt, "<small>") - 7)
ParseHTML.Add "Available", Mid(str, InStr(1, str, ">") + 1, Len(str) - InStr(1, str, ">") - 7)
'remove processed part of html
txt = Mid(txt, InStr(1, txt, "</small>") + 8)
'take care of last <small> tag
str = Mid(txt, InStr(1, txt, "<small>") + 7, InStr(1, txt, "</small>") - InStr(1, txt, "<small>") - 7)
ParseHTML.Add "Country", Mid(str, InStr(1, str, ">") + 1, Len(str) - InStr(1, str, ">") - 7)
End Function
因此,总而言之,该函数返回带有键的字典
"Website 1"
,"Website 2"
,"Website 3"
,"Product 1"
,"Product 2"
,"Product 3"
,"Available"
,"Country"
.
So, to sum up, the function returns dictionary with keys
"Website 1"
, "Website 2"
, "Website 3"
, "Product 1"
, "Product 2"
, "Product 3"
, "Available"
, "Country"
.
现在,有了该功能,可以轻松填写所需的表格.这是一种实现方法:
Now, having that function, it's easy to fill the table you want. Here's one way of doing it:
Sub ProcessHTML()
'determine last row in A column
Dim lastRow As Long: lastRow = Cells(Rows.Count, 1).End(xlUp).Row
Dim dict As Scripting.Dictionary
Dim i As Long
Dim j As Long
For i = 2 To lastRow
'parse HTML code with our function
Set dict = ParseHTML(Cells(i, 1).Value)
For j = 2 To 9
'write corresponding values from dictionary to cells in table
Cells(i, j).Value = dict(Cells(1, j).Value)
Next j
'get rid of object
Set dict = Nothing
Next i
End Sub
它适用于这样排列的表(已填充):
It works with table arranged like that (already filled):
在列中包含这些标题非常重要.
It's very important to have these headers in column.
重要
在运行任何命令之前,请在您的VBA编辑器中转到: 工具->引用,然后在弹出的窗口中选择 Microsoft脚本运行时.
Before running anything, in your VBA editor go to: Tools -> References, and in the window that will pop-up you need to select Microsoft Scripting Runtime.
这篇关于将Excel单元格中的HTML解析为多个单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!