如何使用带有VBA的Selenium从PDF抓取特定文本 [英] How to scrape a particular text from PDF using Selenium with VBA

查看:85
本文介绍了如何使用带有VBA的Selenium从PDF抓取特定文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个自动化项目,从打开浏览器,访问URL,登录到它,单击几个链接,最后单击一个链接,这些链接将在浏览器本身中打开PDF文件.现在,我想要从PDF到Excel的一行(如字符串).

我使用了以下代码,这是来自GitHub的作者的礼貌.使用该代码,我只能刮取PDF的第一行.我使用的PDF是动态的,有时我需要的信息在第5行,有时在第25行,依此类推...

希望我已经解释了,如有任何错误,请原谅我.

 私有子句柄_PDF_Chrome()昏暗的驱动程序作为新的ChromeDriver驱动程序.获取"http://static.mozilla.com/moco/zh-CN/pdf/mozilla_privacypolicy.pdf"'使用pugin API(异步)返回第一行.const JS_READ_PDF_FIRST_LINE_CHROME As String = _" addEventListener('message',function(e){'& _"if(e.data.type =='getSelectedTextReply'){"&_"var txt = e.data.selectedText;"&_"callback(txt&& txt.match(/^.+$/m)[0]);"&_"}"&_"});"&_" plugin.postMessage({type:'initialize'},'*');"&_" plugin.postMessage({type:'selectAll'},'*');"&_" plugin.postMessage({type:'getSelectedText'},'*');';'声明第一行一线昏暗firstline = driver.ExecuteAsyncScript(JS_READ_PDF_FIRST_LINE_CHROME)断言等于网站隐私权政策",第一行驱动程序退出结束子 

解决方案

假设您的代码确实起作用,则需要更改正则表达式和索引.

  addEventListener('message',function(e){if(e.data.type =='getSelectedTextReply'){var txt = e.data.selectedText;callback(txt&& txt.match(/[^ \ r \ n] +/g)[4]);}}));plugin.postMessage({type:'initialize'},'*');plugin.postMessage({type:'selectAll'},'*');plugin.postMessage({type:'getSelectedText'},'*'); 

I am doing a automation project, where it starts with opening browser, visiting a URL, logging to it, clicking on few links and finally click a link which opens a PDF file in browser itself. Now I want to get a line from the PDF to the Excel (like string).

I have used the below code, which was the courtesy of the author from GitHub. With the code I am only able to scrape the first line of the PDF. The PDF I use is dynamic and some times the info I require is at the 5th line and sometimes it is at the 25th line and so on...

Hope I have explained it, pardon me for any errors.

Private Sub Handle_PDF_Chrome()
Dim driver As New ChromeDriver
driver.Get "http://static.mozilla.com/moco/en-US/pdf/mozilla_privacypolicy.pdf"

' Return the first line using the pugin API (asynchronous).
Const JS_READ_PDF_FIRST_LINE_CHROME As String = _
"addEventListener('message',function(e){" & _
" if(e.data.type=='getSelectedTextReply'){" & _
"  var txt=e.data.selectedText;" & _
"  callback(txt && txt.match(/^.+$/m)[0]);" & _
" }" & _
"});" & _
"plugin.postMessage({type:'initialize'},'*');" & _
"plugin.postMessage({type:'selectAll'},'*');" & _
"plugin.postMessage({type:'getSelectedText'},'*');"

' Assert the first line
Dim firstline
firstline = driver.ExecuteAsyncScript(JS_READ_PDF_FIRST_LINE_CHROME)
Assert.Equals "Websites Privacy Policy", firstline

driver.Quit
End Sub

解决方案

Assuming your code does function you need to change the regex and index.

The regex becomes

[^\r\n]+

to retrieve all lines (ignoring empty lines). You then index with 4 to get line 5.

Regex explanation:

addEventListener('message',function(e){if(e.data.type=='getSelectedTextReply'){var txt=e.data.selectedText;
callback(txt && txt.match(/[^\r\n]+/g)[4]);}});
plugin.postMessage({type:'initialize'},'*');
plugin.postMessage({type:'selectAll'},'*');
plugin.postMessage({type:'getSelectedText'},'*');

这篇关于如何使用带有VBA的Selenium从PDF抓取特定文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆