如果将信息添加到数据框中的条件 [英] If condition for adding information into a dataframe

查看:28
本文介绍了如果将信息添加到数据框中的条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要创建一个包含以下列的数据框:

I'd need to create a dataframe with the following columns:

WEB | Country | Organisation

我正在从网站中提取这些信息:但是,有些网站在该网站上没有任何信息.这导致我在更新数据帧时出现一些问题.不幸的是,该代码一次只能在一个网站上运行,否则会出现验证码.请参阅下面的代码以了解单个输出:

I'm extracting these information from a website: however, there are some webs which do not have any information on the website. This is causing me some issues in updating the dataframe. Unfortunately, the code can work only one website a time, otherwise a captcha appears. Please see below the code to have an idea on the individual output:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

    element=[]
    organisation=[]

    x=['stackoverflow.com'] # ['livevsfox.ca'] I would suggest to try first one, then the other one

    frame_dict={}

    
    element.append(x) # I am keeping this just because I'd like to consider a for loop in future
    
    chrome_options = webdriver.ChromeOptions()
                driver=webdriver.Chrome('path')
        
    response=driver.get('website/'+x) # here x should stackoverflow.com, then the other web
    
    try:
    
        wait = WebDriverWait(driver, 30)
        driver.execute_script("window.scrollTo(0, 1000)")
        
        try: 

            error = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"section.selection div.container h2"))) # updated after answer from another post and comment below

        except: 
            continue

        # Country
        c = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Country']/../following-sibling::div"))).text
        country.append(c)   
        
        # Organisation
        try:
            org=wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Organisation']/../following-sibling::div"))).text
            organisation.append(org)  
        except: 
            organisation.append("Data not available")

    except: 
      break

    driver.quit()

    frame_dict.update({'WEB': element, 'Organisation': organisation, 'Country': country}) 
    df=pd.DataFrame.from_dict(frame_dict)

代码应执行以下操作:

  • 对于x = stackoverflow.com(这只是工作网址的一个例子),打开chrome;如果有信息,则提取有关组织和国家的信息;如果没有,则在数据框中添加Missing";退出镀铬;
  • 对于x = livevsfox.ca,打开chrome;如果有信息,则提取有关组织和国家的信息;如果没有,则在 OrganisationCountry 列中添加Missing";退出铬.
  • for x = stackoverflow.com (this is just an example of working url), open chrome; if there is info, then extract information on organisation and country; if there is not, add 'Missing' to the dataframe; exit chrome;
  • for x = livevsfox.ca, open chrome; if there is info, then extract information on organisation and country; if there is not, then add 'Missing' in Organisation and Country columns; exit chrome.

那么预期的输出是:

WEB                      Country      Organisation
stackoverflow.com          US       Stack Exchange, Inc.
livevsfox.ca             Missing       Missing

livevsfox.ca 实际上返回以下消息:

livevsfox.ca returns, in fact, the following message:

Sorry, livevsfox.ca could not be found or reached (error code 404)

当我查找 stackoverflow.com 时没有出现的消息.由于 stackoverflow.com 有国家和组织,我可以在数据框中添加此信息,但我不能为 livevsfox.ca 做同样的事情.我认为可能的解决方案如下:

message that does not appear when I look for stackoverflow.com. Since stackoverflow.com has Country and Organisation, I can add this info in the dataframe, but I can't do the same for livevsfox.ca . I'm thinking a possible solution could be the following:

  • 检查 h2 class 元素是否包含上述消息(抱歉,无法找到或到达 x(错误代码 404)"):这将表示该网络未检测到任何信息;
  • 如果网络没有信息,则在数据框中添加Missing(或NA,由您决定);
  • 否则,网络会在数据框中添加信息(所有者和国家/地区).
  • check if the h2 class element contains the message above ("Sorry, x could not be found or reached (error code 404)") : this would mean that the web has no information detected;
  • if the web has no information, then add Missing (or NA, up to you) in the dataframe;
  • otherwise, the web has information (Owner & Country) to be added in the dataframe.

希望你能提供一些帮助.

I hope you can provide some help.

推荐答案

我已经找到了解决这个问题的方法.

I have found a solution to this problem.

首先,我检测h2 class元素如下:

First, I detect the h2 class element as follows:

  message = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"section.section div.container h2"))).text

然后,我检查 message 是否包含特定文本;例如.

Then, I check if message contains specific text; for example.

if 'Sorry,' in message:

如果是,那么我将值附加到我将添加到数据框中的列表中:

If it does, then I append the value to my lists that I will add into the dataframe:

 organisation.append('Missing') 
 country.append('Missing')

代码:

try:

      message = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"section.section div.container h2"))).text
      if 'Sorry,' in message: 
                    
        organisation.append('Missing') 
        country.append('Missing')
except: 
      continue

这篇关于如果将信息添加到数据框中的条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆