当我使用 xpath 从网站提取信息时没有收集数据 [英] No data collected when I extract info from a website using xpath

查看:16
本文介绍了当我使用 xpath 从网站提取信息时没有收集数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从网站中提取信息.该网站在以下路径中有信息:

<div class="accordion-block__text">服务器</div></div>...<div class="block__col"><b>Country</b></div>

运行

尝试:# 国家c=driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'Country')]").get_attribute('textContent')country.append(c)除了:country.append(错误")

我创建了一个包含所有错误的 df.我对所有领域都感兴趣(但为了解决这个问题,只有一个会很好),包括 Trustscore(数字),但我不知道是否有可能得到它.我在 Chrome 上使用 selenium,网络驱动程序.该网站是 https://www.scamadviser.com/check-website.

代码

这是完整的代码:

def 诈骗(df):chrome_options = webdriver.ChromeOptions()信任=[]国家 = []isp_country = []query=df['URL'].unique().tolist()driver=webdriver.Chrome('mypath',chrome_options=chrome_options))对于查询中的 x:等待 = WebDriverWait(驱动程序,10)response=driver.get('https://www.scamadviser.com/check-website/'+x)尝试:等待 = WebDriverWait(驱动程序,30)# 缺少信任分数# 国家c=driver.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'Country')]")).get_attribute('innerText')country.append(c)# ISP 国家ic=driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'ISP')]").get_attribute('innerText')isp_country.append(ic)除了:# 缺少信任分数country.append(错误")isp_country.append(错误")# 创建数据框dict = {'URL':查询,'Trustscore':信任,'国家':国家,'ISP':isp_country}df=pd.DataFrame(dict)驱动程序退出()返回 df

您可以尝试例如 df['URL'] 等于

stackoverflow.comgitHub.com

解决方案

您正在寻找 innerText 而不是 textContent.

代码:

尝试:# 国家c = driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'Country')]").get_attribute('innerText')打印(c)country.append(c)除了:country.append(错误")

更新 1 :

如果已经使用的定位器是正确的.

driver.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'国家')]"))

或者可以尝试使用此 xpath 的两个选项:-

//div[contains(@class,'block__col')]/b[text()='Country']

更新 2 :

试试:等待 = WebDriverWait(驱动程序,30)# 缺少信任分数

# 国家时间.sleep(2)ele = driver.find_element_by_xpath("//div[contains(@class,'block__col')]/b[text()='Country']")driver.execute_script("arguments[0].scrollIntoView(true);", ele)country.append(ele.get_attribute('innerText'))时间.sleep(2)# ISP 国家ic = driver.find_element_by_xpath("//div[contains(@class,'block__col')]/b[text()='ISP']")driver.execute_script("arguments[0].scrollIntoView(true);", ele)isp_country.append(ic.get_attribute('innerText'))

更新 3 :

获取公司数据国家名称.

使用这个xpath:

//div[text()='公司数据']/../following-sibling::div/descendant::b[text()='Country']/../following-sibling::div

另外,在使用这个 xpath 之前,请确保一些事情.

  1. 以全屏模式启动浏览器.
  2. 使用 js 滚动,然后使用 sroll 进入视图或操作链.

代码:-

driver.maximize_window()时间.sleep(2)driver.execute_script("window.scrollTo(0, 1000)")时间.sleep(2)driver.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='公司数据']"))))# 现在使用提到的 xpath.company_data_country_name` = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Country']/../following-sibling::div")))打印(company_data_country_name.text)

I'd need to extract information from a website. This website has information inside the following path:

<div class="accordion-block__question">
<div class="accordion-block__text">Server</div></div>
...
<div class="block__col"><b>Country</b></div>

Running

try: 
            # Country
            c=driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'Country')]").get_attribute('textContent')
            country.append(c)   
except: 
            country.append("Error")

I create a df with all errors. I'd interest in all the fields (but for fixing this issue, just one would be great), included the Trustscore (number), but I don't know if it'd possible to get it. I'm using selenium, web driver on Chrome. The website is https://www.scamadviser.com/check-website.

CODE

This is the entire code:

def scam(df):
    chrome_options = webdriver.ChromeOptions()

    trust=[]
    country = [] 
    isp_country = [] 
        
    query=df['URL'].unique().tolist() 
    driver=webdriver.Chrome('mypath',chrome_options=chrome_options))
    
    for x in query:
        
        wait = WebDriverWait(driver, 10)
        response=driver.get('https://www.scamadviser.com/check-website/'+x)
        
        try: 
            wait = WebDriverWait(driver, 30)
            # missing trustscore

            # Country
            c=driver.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'Country')]")).get_attribute('innerText')
            country.append(c)  

            # ISP country
        ic=driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'ISP')]").get_attribute('innerText')
            isp_country.append(ic)
        
        except: 
            # missing trustscore
            country.append("Error")
            isp_country.append("Error")
            

    # Create dataframe
    dict = {'URL': query, 'Trustscore':trust, 'Country': country, 'ISP': isp_country} 
    df=pd.DataFrame(dict)

    driver.quit()
    
    return df

You can try for example with df['URL'] equal to

stackoverflow.com
gitHub.com

解决方案

You are looking for innerText not textContent.

Code :

try: 
  # Country
  c = driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'Country')]").get_attribute('innerText')
  print(c)
  country.append(c)   
except: 
   country.append("Error")

Updated 1 :

In case already used locator is correct.

driver.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", driver.find_element_by_xpath("//div[contains(@class,'block__col') and contains(text(),'Country')]"))

or may be try with both the options with this xpath :-

//div[contains(@class,'block__col')]/b[text()='Country']

Udpated 2 :

try: wait = WebDriverWait(driver, 30) # missing trustscore

# Country
time.sleep(2)
ele = driver.find_element_by_xpath("//div[contains(@class,'block__col')]/b[text()='Country']")
driver.execute_script("arguments[0].scrollIntoView(true);", ele)
country.append(ele.get_attribute('innerText'))

time.sleep(2)
# ISP country
ic = driver.find_element_by_xpath("//div[contains(@class,'block__col')]/b[text()='ISP']")
driver.execute_script("arguments[0].scrollIntoView(true);", ele)
isp_country.append(ic.get_attribute('innerText'))

Udpate 3 :

to get the Company data, Country name.

use this xpath :

//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Country']/../following-sibling::div

also, make sure few things before using this xpath.

  1. Launch browser in full screen mode.
  2. Scroll using js, and then use sroll into view or Actions chain.

Code :-

driver.maximize_window()
time.sleep(2)
driver.execute_script("window.scrollTo(0, 1000)")
time.sleep(2)
driver.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']"))))
# now use the mentioned xpath.

company_data_country_name` = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Country']/../following-sibling::div")))
print(company_data_country_name.text)

这篇关于当我使用 xpath 从网站提取信息时没有收集数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆