需要获取HTML源代码为字符串CEFPython [英] Need to get HTML source as string CEFPython
问题描述
我正在尝试使用CEFPython从Web URL获取HTML源代码作为字符串
我希望对MainFrame
的源内容进行爬网并在
I am trying to get HTML source as string from web URL using CEFPython
I want MainFrame
's source content to be crawled and get string in
def save_screenshot(browser):
# Browser object provides GetUserData/SetUserData methods
# for storing custom data associated with browser. The
# "OnPaint.buffer_string" data is set in RenderHandler.OnPaint.
buffer_string = browser.GetUserData("OnPaint.buffer_string")
if not buffer_string:
raise Exception("buffer_string is empty, OnPaint never called?")
mainFrame = browser.GetMainFrame()
print("Main frame is ", mainFrame)
# print("buffer string" ,buffer_string)
# visitor object
visitorObj = cef_string()
temp = mainFrame.GetSource(visitorObj).GetString()
print("temp : ", temp)
visitorText = mainFrame.GetText(temp)
siteHTML = mainFrame.GetSource(visitorText)
print("siteHTML is ", siteHTML)
问题: 代码没有为siteHTML返回
Problem: The code is returning nothing for siteHTML
推荐答案
您的mainframe.GetSource(visitor)
是异步的.因此,您不能从中调用GetString()
.
Your mainframe.GetSource(visitor)
is asynchronous. Therefore you cannot call GetString()
from it.
这是这样做的方法,不幸的是,您需要以异步方式进行思考:
This is the way to do, unfortunately you need to think in asynchronous manner:
class Visitor(object)
def Visit(self, value):
print("This is the HTML source:")
print(value)
myvisitor = Visitor()
mainFrame = browser.GetMainFrame()
mainFrame.GetSource(myvisitor)
还要注意的一件事:上例中的访问者对象myvisitor
在弱引用中传递给GetSource()
.换句话说,您必须使该对象保持活动状态,直到将源传递回去为止.如果将上述代码段的最后三行放在一个函数中,则必须确保该函数在完成作业之前不会返回.
One more thing to beware of: the visitor object myvisitor
in the above example is passed on to GetSource()
in weak reference. In other words, you must keep that object alive until the source is passed back. If you put the last three lines in the above snippet in a function, you have to make sure the function does not return until the job is done.
这篇关于需要获取HTML源代码为字符串CEFPython的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!