获取具有随机类名的元素 [英] Get element with a randomized class name

查看:25
本文介绍了获取具有随机类名的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

看起来 Instagram 网页上 每天都在变化.现在它是 FFVAD 明天它会是别的东西.例如(我把它缩短了,链接很长):

<img class="FFVAD" alt="标记你最好的朋友"decode="auto" style="" size="293px" src="https://scontent-lax3-2.cdninstagram.com/vp/0436c00a3ac9428b2b8c977b45abd022/5BAB3EBC/t51.2885-15/s640x640/sh0.08/e35/33110483_592294374595943748610483_592294374_5610894377561888888888888888868837786945jpg_jpg0jpg_40jpg_40jpg_408610888588588585858585-15.

也就是说,我需要修复脚本并对 Class ID 进行硬编码,以便能够抓取网页.

var = driver.find_elements_by_class_name('FFVAD')

有人告诉我,我可以使用 img.get_attribute('class') 来查找 class ID 并将其存储以备后用.但我仍然不明白这是如何实现的,所以 selenium 或汤可以从 html 标签 中获取 Class ID 并在以后存储或解析它.

我现在得到的就是这个.有点脏,也不对,但想法就在那里.

导入请求导入 selenium.webdriver 作为 webdriverurl = ('https://www.instagram.com/kitties')驱动程序 = webdriver.Firefox()driver.get(url)last_height = driver.execute_script("返回 document.body.scrollHeight")而真:imgs_dedupe = driver.find_elements_by_class_name('FFVAD')对于 imgs_dedupe 中的 img:帖子 = img.get_attribute('class')打印帖子driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")time.sleep(scroll_delay)new_height = driver.execute_script("返回document.body.scrollHeight")如果 new_height == last_height:休息last_height = 新高度

当我运行它时,我得到了这个输出,因为页面上有 3 个图像,所以我得到了 3x Class ID

python tag_print.pyFFVADFFVADFFVAD

解决方案

您当前正在按硬编码的类名搜索元素.

如果类名是随机的,则不能再对其进行硬编码.您必须:

  • 通过一些其他特征搜索元素(例如元素层次结构、其他一些属性等;XPath 可以做到这一点)

    在[10]中:driver.find_elements_by_xpath('//article//img')出[10]:[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="55c48964-8cd0-4472-b35b-214a5a9bfbf7")<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="b7f7c8a4-e343-49ca-b416-49f72e67ae07")><selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="728f6148-6a03-4c9a-9933-36859d65eb51")>

    • 您还可以按元素的视觉特征进行搜索:尺寸、可见性、位置.但是,这不能仅由 XPath 完成,您必须获取所有 <img> 标签并使用 JS 手动检查每个标签.
      (看下面的例子,因为它很长.)
  • 以某种方式从其他页面逻辑中学习这个类名(如果页面逻辑本身可以找到并使用它,它必须存在于其他地方,并且该逻辑必须被其他东西找到,等等等等)

    在这种情况下,类名是 renderImage 函数中局部变量的一部分,因此只能通过 DOM 探索其 AST 来挽救它.该函数本身被埋在 webpack 机器内的某个地方(它似乎将所有资源打包到几个全局对象中一个字母的名字).或者,您可以将所有包含的 JS 文件作为原始数据读取,并在其中查找 renderImage 的定义.因此,在这种情况下,虽然理论上仍有可能,但难度不成比例.

<小时>

通过视觉特征获取元素的例子

在任何页面上,这会发现 3 张大小相同的图像并排放置(这是它们在 https://www.instagram.com/kitties).

由于HTMLElements不能直接传递给Python(至少,我找不到任何方法),我们需要传递一些唯一的ID来定位它们,比如uniqueXPath 的.

(JS代码可能更优雅,我对语言没有太多经验)

在 [22]: script = """//https://stackoverflow.com/questions/2661818/javascript-get-xpath-of-a-node/43688599#43688599函数 getXPathForElement(element) {const idx = (sib, name) =>同胞?idx(sib.previousElementSibling, name||sib.localName) + (sib.localName == name): 1;const segs = elm =>!榆树||elm.nodeType !== 1?['']: elm.id &&document.querySelector(`#${elm.id}`) === elm?[`id("${elm.id}")`]: [...segs(elm.parentNode), `${elm.localName.toLowerCase()}[${idx(elm)}]`];return segs(element).join('/');}//https://plainjs.com/javascript/styles/get-the-position-of-an-element-relative-to-the-document-24/函数 offsetTop(el){返回 window.pageYOffset + el.getBoundingClientRect().top;}var expected_images=3;var found_groups=new Map();for (e of document.getElementsByTagName('img')) {让 group_id = e.offsetWidth + "x" + e.offsetHeight;if (!(found_groups.has(group_id))) found_groups.set(group_id,[]);found_groups.get(group_id).push(e);}for ([k,v] of found_groups) {if (v.length != expected_images) {found_groups.delete(k);continue;}var offset_top = offsetTop(v[0]);对于 (e of v){让_c_oft = offsetTop(e);如果(_c_oft !== offset_top){found_groups.delete(k);休息;}}}如果(found_groups.size != 1){控制台日志(found_groups);抛出过滤后出现意外的图像模式";}var found_group = found_groups.values().next().value;结果=[]for (e of found_group) {result.push(getXPathForElement(e));}返回结果;"""在 [23]: d.execute_script(script)出[23]:[u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[1]]/a[1]/div[1]/div[1]/img[1]',u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[2]/a[1]/div[1]/div[1]/img[1]',u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[3]/a[1]/div[1]/div[1]/img[1]']在 [27]: [d.find_element_by_xpath(xp) for xp in _]出[27]:[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="55c48964-8cd0-4472-b35b-214a5a9bfbf7")<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="b7f7c8a4-e343-49ca-b416-49f72e67ae07")><selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="728f6148-6a03-4c9a-9933-36859d65eb51")>

It looks like the <class id> for <img class> on Instagram's web page is changing every day. Right now it is FFVAD and tomorrow it will be something else. For example (I made it shorter, links are long):

<img class="FFVAD" alt="Tag your best friend" decoding="auto" style="" sizes="293px" src="https://scontent-lax3-2.cdninstagram.com/vp/0436c00a3ac9428b2b8c977b45abd022/5BAB3EBC/t51.2885-15/s640x640/sh0.08/e35/33110483_592294374461447_8669459880035221504_n.jpg">

By saying that, I need to fix the script and hardcode the Class ID in order to be able scrape the web-page.

var = driver.find_elements_by_class_name('FFVAD')

Somebody told me that I could use img.get_attribute('class') to find the class ID and store it for later. But I still don't understand how this can be achieved, so selenium or soup could grab the Class ID from the html tag and store or parse it later.

All I got now is this. It's little dirty, and not right, but the idea is there.

import requests
import selenium.webdriver as webdriver

url = ('https://www.instagram.com/kitties')
driver = webdriver.Firefox()
driver.get(url)
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    imgs_dedupe = driver.find_elements_by_class_name('FFVAD')

    for img in imgs_dedupe:
        posts = img.get_attribute('class')
        print posts

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    if new_height == last_height:
        break
    last_height = new_height

When I run it, I get this output, and because there are 3 images on the page, I get 3x Class ID

python tag_print.py 
FFVAD
FFVAD
FFVAD

解决方案

You're currently searching for the element by a hardcoded class name.

If the class name is randomized, you cannot hardcode it any longer. You have to either:

  • Search the element by some other characteristics (e.g. element hierarchy, some other attributes, etc; XPath can do that)

    In [10]: driver.find_elements_by_xpath('//article//img')
    Out[10]:
    [<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="55c48964-8cd0-4472-b35b-214a5a9bfbf7")>,
     <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="b7f7c8a4-e343-49ca-b416-49f72e67ae07")>,
     <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="728f6148-6a03-4c9a-9933-36859d65eb51")>]
    

    • You can also search by the element's visual characteristics: size, visibility, position. This cannot be done solely by XPath though, you'll have to get all <img> tags and inspect each one with JS by hand.
      (See an example below because it's long.)
  • Learn this class name somehow from other page logic (it must be present somewhere else if the page's logic itself can find and use it, and that logic must be found by something else, etc etc)

    In this case, the class name is a part of a local variable in the renderImage function, so it's only salvageable via DOM by exploring its AST. The function itself is buried somewhere inside webpack machinery (it seems to pack all resources into a few global objects with one-letter names). Alternatively, you can read all included JS files as raw data and look for the definition of renderImage in them. So, in this case, it's disproportionally hard, though theoretically possible still.


Example of getting elements by visual characteristics

On any page whatsoever, this would find 3 images of the same size, located side by side (this is the way they are at https://www.instagram.com/kitties).

Since HTMLElements can't be passed to Python directly (at least, I couldn't find any way to), we need to pass some unique IDs instead to locate them by, like unique XPath's.

(The JS code could probably be more elegant, I don't have much experience with the language)

In [22]: script = """
  //https://stackoverflow.com/questions/2661818/javascript-get-xpath-of-a-node/43688599#43688599
  function getXPathForElement(element) {
      const idx = (sib, name) => sib 
          ? idx(sib.previousElementSibling, name||sib.localName) + (sib.localName == name)
          : 1;
      const segs = elm => !elm || elm.nodeType !== 1 
          ? ['']
          : elm.id && document.querySelector(`#${elm.id}`) === elm
              ? [`id("${elm.id}")`]
              : [...segs(elm.parentNode), `${elm.localName.toLowerCase()}[${idx(elm)}]`];
      return segs(element).join('/');
  }

  //https://plainjs.com/javascript/styles/get-the-position-of-an-element-relative-to-the-document-24/
  function offsetTop(el){
    return window.pageYOffset + el.getBoundingClientRect().top;
  }

  var expected_images=3;
  var found_groups=new Map();
  for (e of document.getElementsByTagName('img')) {
    let group_id = e.offsetWidth + "x" + e.offsetHeight;
    if (!(found_groups.has(group_id))) found_groups.set(group_id,[]);
    found_groups.get(group_id).push(e);
  }
  for ([k,v] of found_groups) {
    if (v.length != expected_images) {found_groups.delete(k);continue;}
    var offset_top = offsetTop(v[0]);
    for (e of v){
      let _c_oft = offsetTop(e);
      if (_c_oft !== offset_top){
        found_groups.delete(k);
        break;
      }
    }
  }
  if (found_groups.size != 1) {
    console.log(found_groups);
    throw 'Unexpected pattern of images after filtering';
  }

  var found_group = found_groups.values().next().value;


  result=[]
  for (e of found_group) {
    result.push(getXPathForElement(e));
  }
  return result;
"""

In [23]: d.execute_script(script)
Out[23]:
[u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[1]/a[1]/div[1]/div[1]/img[1]',
 u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[2]/a[1]/div[1]/div[1]/img[1]',
 u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[3]/a[1]/div[1]/div[1]/img[1]']

In [27]: [d.find_element_by_xpath(xp) for xp in _]
Out[27]:
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="55c48964-8cd0-4472-b35b-214a5a9bfbf7")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="b7f7c8a4-e343-49ca-b416-49f72e67ae07")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="728f6148-6a03-4c9a-9933-36859d65eb51")>]

这篇关于获取具有随机类名的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆