将 div 类中的信息提取到 json 对象(或数据框) [英] extract the information in a div class to a json object (or data frame)

查看:39
本文介绍了将 div 类中的信息提取到 json 对象(或数据框)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于this 页面,我想单击 ID(例如,第 1 行的 ID 为 270516746)并将信息(每行没有相同的标题)提取/下载到某种形式的 python 对象中,最好是 json对象,或数据帧(json 可能更简单).

我已经到了可以走到我想拉下来的桌子的地步了:

导入操作系统从硒导入网络驱动程序从 selenium.webdriver.support.ui 导入选择from selenium.webdriver.common.by import By从 selenium.webdriver.chrome.options 导入选项将熊猫导入为 pd导入系统驱动程序 = webdriver.Chrome()driver.get('http://mahmi.org/explore.php?filterType=&filter=&page=1')#找到包含 ID、序列、生物活性和相似性的表element = driver.find_elements_by_css_selector('table.table-striped tr')for row in element[1:2]: #change this, only for testingid,seq,bioact,sim = row.text.split()#now 我已经列出了每一行的 ID、序列、生物活性和相似性.#点击每个ID获取每个ID的完整数据打印(ID)button = driver.find_element_by_xpath('//button[text()="270516746"]') #这是一个硬编码的例子button.click()#然后将所有信息下拉到一个json文件中?full_table = driver.find_element_by_xpath('.///*[@id="source-proteins"]')打印(full_table)

然后我被困在可能是最后一步的事情上,一旦在上面的行中单击按钮,我就找不到如何说.to_json()"或.to_dataframe()".

如果有人能给我建议,我将不胜感激.

更新 1:删除并合并到上面.

更新 2:根据下面的建议,要使用 beautifulsoup,我的问题是如何导航到弹出窗口的 'modal-body' 类,然后使用 Beautiful Soup:

#then 将所有信息下拉到 json 文件中?full_table = driver.find_element_by_class_name("modal-body")汤 = BeautifulSoup(full_table,'html.parser')打印(汤)

返回错误:

 汤 = BeautifulSoup(full_table,'html.parser')文件/Users/kela/anaconda/envs/selenium_scripts/lib/python3.6/site-packages/bs4/__init__.py",第287行,在__init__elif len(markup) <= 256 和 (类型错误:WebElement"类型的对象没有 len()

更新 3:然后我尝试只使用 beautifulsoup 来抓取页面:

from bs4 import BeautifulSoup进口请求url = 'http://mahmi.org/explore.php?filterType=&filter=&page=1'html_doc = requests.get(url).content汤 = BeautifulSoup(html_doc, 'html.parser')容器 = 汤.find("div", {"class": "modal-body"})打印(容器)

然后打印:

但这不是我想要的输出,因为它没有打印 json 层(例如,源蛋白质 div 下有更多信息).

更新 4,当我添加到上面的原始代码时(更新之前):

full_table = driver.find_element_by_class_name("modal-body")以 open('test_outputfile.json', 'w') 作为输出:json.dump(full_table, 输出)

输出是'TypeError: Object of type 'WebElement' is not JSON serializable',我现在想弄清楚.

更新 5:尝试复制这种方法,我补充说:

full_div = driver.find_element_by_css_selector('div.modal-body')对于 full_div 中的元素:new_element = element.find_element_by_css_selector('<li>调查类型:宏基因组</li>')打印(新元素.文本)

(我刚刚添加了 li 元素只是为了看看它是否可以工作),但是我得到了错误:

回溯(最近一次调用最后一次): 中的文件scrape_mahmi.py",第 28 行对于 full_div 中的元素:类型错误:WebElement"对象不可迭代

更新 6:我尝试遍历 ul/li 元素,因为我看到我想要的是 li 文本嵌入 ul 中的 li 中 ul 中的 div;所以我试过了:

html_list = driver.find_elements_by_tag_name('ul')对于 html_list 中的 each_ul:items = each_ul.find_elements_by_tag_name('li')对于项目中的项目:next_ul = item.find_elements_by_tag_name('ul')对于 next_ul 中的 each_ul:next_li = each_ul.find_elements_by_tag_name('li')对于 next_li 中的 each_li:打印(each_li.text)

这没有错误,我只是没有输出.

解决方案

我不知道你是否找到了答案,但我说的是不需要 selenium 的方法.因此,您可以获取每个肽段的 XHR,以从模态框中获取详细信息.尽管要小心,这只是一个粗略的大纲,您需要将项目放入 json 转储或您喜欢的任何方式.这是我的方法.

from bs4 import BeautifulSoup将熊猫导入为 pd进口请求从 xml.etree 导入 ElementTree as et导入 xmltodicturl = "http://mahmi.org/explore.php?filterType=&filter=&page=1"html = requests.get(url).contentdf_list = pd.read_html(html)df = df_list[-1]标题 = {"连接": "保持活动","用户代理": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"}pep_ids = df['ID'].tolist()#pep_ids = ['270516746','268297434'] ## 你可以先用这个来检查输出base_url = 'http://mahmi.org/api/peptides/sourceProteins/'对于 pep_ids 中的 pep_id:final_url = base_url + str(pep_id)page = requests.get(final_url, headers=headers)树 = et.fromstring(page.content)对于 tree.iter('*') 中的孩子:打印(child.tag,child.text)

For each row in the table on this page, I would like to click on the ID (e.g. the ID of row 1 is 270516746) and extract/download the information (which does NOT have the same headers for each row) into some form of python object, ideally either a json object, or a dataframe (json is probably easier).

I've gotten to the point where I can get to the table I want to pull down:

import os
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd
import sys

driver = webdriver.Chrome()
driver.get('http://mahmi.org/explore.php?filterType=&filter=&page=1')

#find the table with ID, Sequence, Bioactivity and Similarity
element = driver.find_elements_by_css_selector('table.table-striped tr')
for row in element[1:2]: #change this, only for testing
        id,seq,bioact,sim = row.text.split()


#now i've made a list of each rows id, sequence, bioactivity and similarity.
#click on each ID to get the full data of each
        print(id)
        button = driver.find_element_by_xpath('//button[text()="270516746"]') #this is one example hard-coded
        button.click()

 #then pull down all the info to a json file?
        full_table = driver.find_element_by_xpath('.//*[@id="source-proteins"]')
        print(full_table)

And then I'm stuck on what's probably the very last step, I can't find how to say '.to_json()', or '.to_dataframe()' once the button is clicked in the line above.

If someone could advise I would appreciate it.

Update 1: Deleted and incorporated into above.

Update 2: Further to suggestion below, to use beautifulsoup, my issue is how do I navigate to the 'modal-body' class of the pop-up window, and then use beautiful soup:

#then pull down all the info to a json file?
        full_table = driver.find_element_by_class_name("modal-body")
        soup = BeautifulSoup(full_table,'html.parser')
        print(soup)   

returns the error:

    soup = BeautifulSoup(full_table,'html.parser')
  File "/Users/kela/anaconda/envs/selenium_scripts/lib/python3.6/site-packages/bs4/__init__.py", line 287, in __init__
    elif len(markup) <= 256 and (
TypeError: object of type 'WebElement' has no len()

Update 3: Then I tried to scrape the page using ONLY beautifulsoup:

from bs4 import BeautifulSoup 
import requests

url = 'http://mahmi.org/explore.php?filterType=&filter=&page=1'
html_doc = requests.get(url).content
soup = BeautifulSoup(html_doc, 'html.parser')
container = soup.find("div", {"class": "modal-body"})
print(container)

and it prints:

<div class="modal-body">
<h4><b>Reference information</b></h4>
<p>Id: <span id="info-ref-id">XXX</span></p>
<p>Bioactivity: <span id="info-ref-bio">XXX</span></p>
<p><a id="info-ref-seq">Download sequence</a></p><br/>
<h4><b>Source proteins</b></h4>
<div id="source-proteins"></div>
</div>

But this is not the output that I want, as it's not printing the json layers (e.g. there is more info beneath the source-proteins div).

Update 4, when I add to the original code above (before the updates):

full_table = driver.find_element_by_class_name("modal-body")
with open('test_outputfile.json', 'w') as output:
    json.dump(full_table, output)

The output is 'TypeError: Object of type 'WebElement' is not JSON serializable', which I'm trying to figure out now.

Update 5: Trying to copy this approach, I added:

full_div = driver.find_element_by_css_selector('div.modal-body')
for element in full_div:
    new_element = element.find_element_by_css_selector('<li>Investigation type: metagenome</li>')
    print(new_element.text)

(where I just added the li element just to see if it would work), but I get the error:

Traceback (most recent call last):
  File "scrape_mahmi.py", line 28, in <module>
    for element in full_div:
TypeError: 'WebElement' object is not iterable

Update 6: I tried looping through ul/li elements, because I saw that what I wanted were li text embedded in a ul in a li in a ul in a div; so I tried:

html_list = driver.find_elements_by_tag_name('ul')
for each_ul in html_list:
       items = each_ul.find_elements_by_tag_name('li')
       for item in items:
               next_ul = item.find_elements_by_tag_name('ul')
               for each_ul in next_ul:
                       next_li = each_ul.find_elements_by_tag_name('li')
                       for each_li in next_li:
                               print(each_li.text)

There's no error for this, I just get no output.

解决方案

I do not know if you found the answer but I was talking about the approach where selenium is not required. So you can get the XHR for each peptide to get the details from modal box. Although be careful this is just a rough outline you need put the items in a json dumps or whichever way you like. Here is my approach.

from bs4 import BeautifulSoup
import pandas as pd
import requests
from xml.etree import ElementTree as et
import xmltodict


url = "http://mahmi.org/explore.php?filterType=&filter=&page=1"
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
headers = {
    "Connection": "keep-alive",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}

pep_ids = df['ID'].tolist()
#pep_ids = ['270516746','268297434'] ## You can use this first to check output

base_url= 'http://mahmi.org/api/peptides/sourceProteins/'
for pep_id in pep_ids:
    final_url = base_url + str(pep_id)
    page = requests.get(final_url, headers=headers)
    tree = et.fromstring(page.content)
    for child in tree.iter('*'):
        print(child.tag,child.text)

这篇关于将 div 类中的信息提取到 json 对象(或数据框)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆