通过 Selenium 在 facebook 上解码类名称 [英] Decoding Class names on facebook through Selenium
问题描述
我注意到 facebook 有一些奇怪的类名,看起来是计算机生成的.我不知道这些类是至少随着时间的推移保持不变还是在某个时间间隔内发生变化?也许有这方面经验的人可以回答.我唯一能看到的是,当我退出 Chrome 并再次打开它时,它仍然是一样的,所以至少他们不会更改每个浏览器会话.
I noticed that facebook has some weird class names that look computer generated. What I don't know is if these classes are at least constant over time or they change in some time interval? Maybe someone who has experience with that can answer. Only thing I can see is that when I exit Chrome and open it again it is still the same, so at least they don't change every browser session.
所以我猜想抓取 facebook 的最好方法是在用户界面中使用一些元素并假设结构总是相同的,例如从关于"部分获取地址,如下所示:
So I'd guess the best way to go about scraping facebook would be to use some elements in user interface and assume structure is always the same, like for example to get address from About section something like this:
from selenium import webdriver
driver = webdriver.Chrome("C:/chromedriver.exe")
driver.get("https://www.facebook.com/pg/Burma-Superstar-620442791345784/about/?ref=page_internal")
# wait some time
address_elements = driver.find_elements_by_xpath("//span[text()='FIND US']/../following-sibling::div//button[text()='Get Directions']/../../preceding-sibling::div[1]/div/span")
for item in address_elements:
print item.text
推荐答案
你说得非常正确.Facebook 是通过 ReactJS 从 HTML DOM:
You were pretty correct. Facebook is built through ReactJS which is pretty much evident from the presence of the following keywords and tags within the HTML DOM:
{"react_render":true,"reflow":true}
[React-prod"]
[ReactDOM-prod"]
ReactComposerTaggerType:{r:["t5r69"],be:1}
因此,动态生成的类名必然会在某些时间间隔之后发生变化.
So, the dynamically generated class names are bound to change after certain timegaps.
解决方案是使用静态属性来构建一个动态定位器策略.
The solution would be to use the static attributes to construct a dynamic Locator Strategy.
要检索文本正下方地址的第一行 FIND US,您需要引入 WebDriverWait 与 expected_conditions 作为 visibility_of_element_located()
,您可以使用以下优化的解决方案:
To retrieve the first line of the address just below the text FIND US you need to induce WebDriverWait in conjunction with expected_conditions as visibility_of_element_located()
and you can use the following optimized solution:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[normalize-space()='FIND US']//following::span[2]"))))
参考文献
您可以在以下位置找到一些相关讨论:
References
You can find some relevant discussions in:
注意:Scraping Facebook 违反了他们的 第 3.2.3 节的服务条款,您可能会受到质疑,甚至可能登陆 Facebook 监狱.使用 Facebook Graph API
相反.
Note: Scraping Facebook violates their Terms of Service of section 3.2.3 and you are liable to be questioned and may even land up in Facebook Jail. Use
Facebook Graph API
instead.
这篇关于通过 Selenium 在 facebook 上解码类名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!