使用 Libreoffice Basic 读取 HTML 页面 [英] Reading HTML page using Libreoffice Basic

查看：74 发布时间：2021/7/17 18:45:54 html screen-scraping libreoffice-calc libreoffice-basic

本文介绍了使用 Libreoffice Basic 读取 HTML 页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 LibreOffice Basic 的新手.我正在尝试在 LibreOffice Calc 中编写一个宏，该宏将从一个单元格(例如 Stark)中读取一个高贵的维斯特洛家族的名称，并通过在冰与火维基上的相关页面.它应该像这样工作:

I'm new to LibreOffice Basic. I'm trying to write a macro in LibreOffice Calc that will read the name of a noble House of Westeros from a cell (e.g. Stark), and output the Words of that House by looking it up on the relevant page on A Wiki of Ice and Fire. It should work like this:

这是伪代码:

Read HouseName from column A
Open HtmlFile at "http://www.awoiaf.westeros.org/index.php/House_" & HouseName
Iterate through HtmlFile to find line which begins "<table class="infobox infobox-body"" // Finds the info box for the page.
Read Each Row in the table until Row begins Words
Read the contents of the next <td> tag, and return this as a string.

我的问题是第二行，我不知道如何读取 HTML 文件.我应该如何在 LibreOffice Basic 中执行此操作?

My problem is with the second line, I don't know how to read a HTML file. How should I do this in LibreOffice Basic?

推荐答案

这主要有两个问题.1. 性能您的 UDF 将需要在存储它的每个单元格中获取 HTTP 资源.2. HTML不幸的是，OpenOffice 或 LibreOffice 中没有 HTML 解析器.只有一个 XML 解析器.这就是为什么我们不能直接用 UDF 解析 HTML.

There are two mainly issues with this. 1. Performance Your UDF will need get the HTTP resource in every cell, in which it is stored. 2. HTML Unfortunately there is no HTML parser in OpenOffice or LibreOffice. There is only a XML parser. Thats why we cannot parse HTML directly with the UDF.

这会奏效，但速度慢且不是很普遍:

This will work, but slow and not very universal:

Public Function FETCHHOUSE(sHouse as String) as String

   sURL = "http://awoiaf.westeros.org/index.php/House_" & sHouse

   oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
   oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
   on error goto falseHouseName
   oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
   on error goto 0
   dim delimiters() as long
   sContent = oInpDataStream.readString(delimiters(), false)

   lStartPos = instr(1, sContent, "<table class=" & chr(34) & "infobox infobox-body" )
   if lStartPos = 0 then
     FETCHHOUSE = "no infobox on page"
     exit function
   end if   
   lEndPos = instr(lStartPos, sContent, "</table>")
   sTable = mid(sContent, lStartPos, lEndPos-lStartPos + 8)

   lStartPos = instr(1, sTable, "Words" )
   if lStartPos = 0 then
     FETCHHOUSE = "no Words on page"
     exit function
   end if        
   lEndPos = instr(lStartPos, sTable, "</tr>")
   sRow = mid(sTable, lStartPos, lEndPos-lStartPos + 5)

   oTextSearch = CreateUnoService("com.sun.star.util.TextSearch")
   oOptions = CreateUnoStruct("com.sun.star.util.SearchOptions")
   oOptions.algorithmType = com.sun.star.util.SearchAlgorithms.REGEXP
   oOptions.searchString = "<td[^<]*>"
   oTextSearch.setOptions(oOptions)
   oFound = oTextSearch.searchForward(sRow, 0, Len(sRow))
   If  oFound.subRegExpressions = 0 then 
     FETCHHOUSE = "Words header but no Words content on page"
     exit function   
   end if
   lStartPos = oFound.endOffset(0) + 1
   lEndPos = instr(lStartPos, sRow, "</td>")
   sWords = mid(sRow, lStartPos, lEndPos-lStartPos)

   FETCHHOUSE = sWords
   exit function

   falseHouseName:
   FETCHHOUSE = "House name does not exist"

End Function

更好的方法是，如果您可以从 Wiki 提供的 Web API 中获取所需的信息.你知道维基背后的人吗?如果是这样，那么您可以将其作为建议放在此处.

The better way would be, if you could get the needed informations from a Web API that would offered from the Wiki. You know the people behind the Wiki? If so, then you could place this there as a suggestion.

问候

阿克塞尔

这篇关于使用 Libreoffice Basic 读取 HTML 页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Libreoffice Basic 读取 HTML 页面 [英] Reading HTML page using Libreoffice Basic

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用 Libreoffice Basic 读取 HTML 页面 [英] Reading HTML page using Libreoffice Basic

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭