如何使用带有VBA的Selenium从PDF抓取特定文本 [英] How to scrape a particular text from PDF using Selenium with VBA
问题描述
我正在做一个自动化项目,从打开浏览器,访问URL,登录到它,单击几个链接,最后单击一个链接,这些链接将在浏览器本身中打开PDF文件.现在,我想要从PDF到Excel的一行(如字符串).
我使用了以下代码,这是来自GitHub的作者的礼貌.使用该代码,我只能刮取PDF的第一行.我使用的PDF是动态的,有时我需要的信息在第5行,有时在第25行,依此类推...
希望我已经解释了,如有任何错误,请原谅我.
私有子句柄_PDF_Chrome()昏暗的驱动程序作为新的ChromeDriver驱动程序.获取"http://static.mozilla.com/moco/zh-CN/pdf/mozilla_privacypolicy.pdf"'使用pugin API(异步)返回第一行.const JS_READ_PDF_FIRST_LINE_CHROME As String = _" addEventListener('message',function(e){'& _"if(e.data.type =='getSelectedTextReply'){"&_"var txt = e.data.selectedText;"&_"callback(txt&& txt.match(/^.+$/m)[0]);"&_"}"&_"});"&_" plugin.postMessage({type:'initialize'},'*');"&_" plugin.postMessage({type:'selectAll'},'*');"&_" plugin.postMessage({type:'getSelectedText'},'*');';'声明第一行一线昏暗firstline = driver.ExecuteAsyncScript(JS_READ_PDF_FIRST_LINE_CHROME)断言等于网站隐私权政策",第一行驱动程序退出结束子
假设您的代码确实起作用,则需要更改正则表达式和索引.
addEventListener('message',function(e){if(e.data.type =='getSelectedTextReply'){var txt = e.data.selectedText;callback(txt&& txt.match(/[^ \ r \ n] +/g)[4]);}}));plugin.postMessage({type:'initialize'},'*');plugin.postMessage({type:'selectAll'},'*');plugin.postMessage({type:'getSelectedText'},'*');
I am doing a automation project, where it starts with opening browser, visiting a URL, logging to it, clicking on few links and finally click a link which opens a PDF file in browser itself. Now I want to get a line from the PDF to the Excel (like string).
I have used the below code, which was the courtesy of the author from GitHub. With the code I am only able to scrape the first line of the PDF. The PDF I use is dynamic and some times the info I require is at the 5th line and sometimes it is at the 25th line and so on...
Hope I have explained it, pardon me for any errors.
Private Sub Handle_PDF_Chrome()
Dim driver As New ChromeDriver
driver.Get "http://static.mozilla.com/moco/en-US/pdf/mozilla_privacypolicy.pdf"
' Return the first line using the pugin API (asynchronous).
Const JS_READ_PDF_FIRST_LINE_CHROME As String = _
"addEventListener('message',function(e){" & _
" if(e.data.type=='getSelectedTextReply'){" & _
" var txt=e.data.selectedText;" & _
" callback(txt && txt.match(/^.+$/m)[0]);" & _
" }" & _
"});" & _
"plugin.postMessage({type:'initialize'},'*');" & _
"plugin.postMessage({type:'selectAll'},'*');" & _
"plugin.postMessage({type:'getSelectedText'},'*');"
' Assert the first line
Dim firstline
firstline = driver.ExecuteAsyncScript(JS_READ_PDF_FIRST_LINE_CHROME)
Assert.Equals "Websites Privacy Policy", firstline
driver.Quit
End Sub
Assuming your code does function you need to change the regex and index.
The regex becomes
[^\r\n]+
to retrieve all lines (ignoring empty lines). You then index with 4 to get line 5.
Regex explanation:
addEventListener('message',function(e){if(e.data.type=='getSelectedTextReply'){var txt=e.data.selectedText;
callback(txt && txt.match(/[^\r\n]+/g)[4]);}});
plugin.postMessage({type:'initialize'},'*');
plugin.postMessage({type:'selectAll'},'*');
plugin.postMessage({type:'getSelectedText'},'*');
这篇关于如何使用带有VBA的Selenium从PDF抓取特定文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!