如何使用带有VBA的Selenium从PDF抓取特定文本 [英] How to scrape a particular text from PDF using Selenium with VBA

查看：85 发布时间：2021/5/5 20:06:05 excel vba selenium web-scraping

本文介绍了如何使用带有VBA的Selenium从PDF抓取特定文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在做一个自动化项目，从打开浏览器，访问URL，登录到它，单击几个链接，最后单击一个链接，这些链接将在浏览器本身中打开PDF文件.现在，我想要从PDF到Excel的一行(如字符串).

我使用了以下代码，这是来自GitHub的作者的礼貌.使用该代码，我只能刮取PDF的第一行.我使用的PDF是动态的，有时我需要的信息在第5行，有时在第25行，依此类推...

希望我已经解释了，如有任何错误，请原谅我.

 私有子句柄_PDF_Chrome()昏暗的驱动程序作为新的ChromeDriver驱动程序.获取"http://static.mozilla.com/moco/zh-CN/pdf/mozilla_privacypolicy.pdf"'使用pugin API(异步)返回第一行.const JS_READ_PDF_FIRST_LINE_CHROME As String = _" addEventListener('message'，function(e){'& _"if(e.data.type =='getSelectedTextReply'){"&_"var txt = e.data.selectedText;"&_"callback(txt&& txt.match(/^.+$/m)[0]);"&_"}"&_"});"&_" plugin.postMessage({type:'initialize'}，'*');"&_" plugin.postMessage({type:'selectAll'}，'*');"&_" plugin.postMessage({type:'getSelectedText'}，'*');';'声明第一行一线昏暗firstline = driver.ExecuteAsyncScript(JS_READ_PDF_FIRST_LINE_CHROME)断言等于网站隐私权政策"，第一行驱动程序退出结束子

解决方案

假设您的代码确实起作用，则需要更改正则表达式和索引.

  addEventListener('message'，function(e){if(e.data.type =='getSelectedTextReply'){var txt = e.data.selectedText;callback(txt&& txt.match(/[^ \ r \ n] +/g)[4]);}}));plugin.postMessage({type:'initialize'}，'*');plugin.postMessage({type:'selectAll'}，'*');plugin.postMessage({type:'getSelectedText'}，'*');

I am doing a automation project, where it starts with opening browser, visiting a URL, logging to it, clicking on few links and finally click a link which opens a PDF file in browser itself. Now I want to get a line from the PDF to the Excel (like string).

I have used the below code, which was the courtesy of the author from GitHub. With the code I am only able to scrape the first line of the PDF. The PDF I use is dynamic and some times the info I require is at the 5th line and sometimes it is at the 25th line and so on...

Hope I have explained it, pardon me for any errors.

Private Sub Handle_PDF_Chrome()
Dim driver As New ChromeDriver
driver.Get "http://static.mozilla.com/moco/en-US/pdf/mozilla_privacypolicy.pdf"

' Return the first line using the pugin API (asynchronous).
Const JS_READ_PDF_FIRST_LINE_CHROME As String = _
"addEventListener('message',function(e){" & _
" if(e.data.type=='getSelectedTextReply'){" & _
"  var txt=e.data.selectedText;" & _
"  callback(txt && txt.match(/^.+$/m)[0]);" & _
" }" & _
"});" & _
"plugin.postMessage({type:'initialize'},'*');" & _
"plugin.postMessage({type:'selectAll'},'*');" & _
"plugin.postMessage({type:'getSelectedText'},'*');"

' Assert the first line
Dim firstline
firstline = driver.ExecuteAsyncScript(JS_READ_PDF_FIRST_LINE_CHROME)
Assert.Equals "Websites Privacy Policy", firstline

driver.Quit
End Sub

解决方案

Assuming your code does function you need to change the regex and index.

The regex becomes

[^\r\n]+

to retrieve all lines (ignoring empty lines). You then index with 4 to get line 5.

Regex explanation:

addEventListener('message',function(e){if(e.data.type=='getSelectedTextReply'){var txt=e.data.selectedText;
callback(txt && txt.match(/[^\r\n]+/g)[4]);}});
plugin.postMessage({type:'initialize'},'*');
plugin.postMessage({type:'selectAll'},'*');
plugin.postMessage({type:'getSelectedText'},'*');

这篇关于如何使用带有VBA的Selenium从PDF抓取特定文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用带有VBA的Selenium从PDF抓取特定文本 [英] How to scrape a particular text from PDF using Selenium with VBA

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用带有VBA的Selenium从PDF抓取特定文本 [英] How to scrape a particular text from PDF using Selenium with VBA

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭