使用BeautifulSoup和pandas将列表项标题下的文本刮到一列中 [英] Scrape the text below a header inside a list item into a column with BeautifulSoup and pandas

查看：50 发布时间：2020/9/20 8:11:01 python html pandas web-scraping beautifulsoup

本文介绍了使用BeautifulSoup和pandas将列表项标题下的文本刮到一列中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用BeautifulSoup和熊猫刮擦并存储一些物品.下面的代码仅部分起作用.如您所见，它会刮擦"Engine426/425 HP"，而我只希望将字符串"426/425 HP"存储在引擎"列中.我想在下面的HTML中抓取所有4个h5字符串(请参考下面的所需输出).希望有人能帮助我，谢谢！

I am trying to scrape and store some items using BeautifulSoup and pandas. The code below only partially works. As you can see it scrapes 'Engine426/425 HP' whereas I only want the string '426/425 HP' to be stored in the 'engine' column. I would like to scrape all 4 h5 strings in the HTML below (Please refer to the desired output below). I hope someone can help me out, thanks!

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re

main_url = "https://www.example.com/"

def getAndParseURL(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text, 'html.parser')
    return(soup)

soup = getAndParseURL(main_url)

engine = []

engine.append(soup.find("ul", class_ = re.compile('list-inline lot-breakdown-list')).li.text)

scraped_data = pd.DataFrame({'engine': engine})

scraped_data.head()

              engine
0   Engine426/425 HP

HTML

<div class="lot-breakdown">
    <ul class="list-inline lot-breakdown-list">
        <li>
            <h5>Engine</h5>426/425 HP</li>
        <li>
            <h5>Trans</h5>Automatic</li>
        <li>
            <h5>Color</h5>Alpine White</li>
        <li>
            <h5>Interior</h5>Black</li>
    </ul>
</div>

所需的输出

scraped_data[['engine', 'trans', 'color', 'interior']] = pd.DataFrame([['426/425 HP', 'Automatic', 'Alpine White', 'Black']], index=scraped_data.index)
scraped_data

              engine        trans          color  interior
0         426/425 HP    Automatic   Alpine White     Black

推荐答案

您可以通过多种方式实现这一目标:

You can achieve that in too many ways :

    from bs4 import BeautifulSoup , NavigableString
    import requests

    main_url = "https://www.example.com/"

    def getAndParseURL(url):
        result = requests.get(url)
        soup = BeautifulSoup(result.text, 'html.parser')
        return(soup)

    soup = getAndParseURL(main_url)
    #ul   = soup.select('ul[class="list-inline lot-breakdown-list"] li')
    #for li in ul :
         #x = li.find(text=True, recursive=False) # Will give you the text of the li skipping the text of child tag
         #y = ' '.join([t for t in li.contents if type(t)== NavigableString]) # contents [<h5>Engine</h5>, '426/425 HP'] the text you want has a type of NavigableString and That's what we are returning .
    ul = soup.select('ul[class="list-inline lot-breakdown-list"] li', recursive=True)
    lis_e = []
    for li in ul:
        lis = []
        lis.append(li.contents[1])
        lis_e.extend(lis)

    engine.append(lis_e[0])
    trans.append(lis_e[1])
    color.append(lis_e[2])
    interior.append(lis_e[3])

    scraped_data = pd.DataFrame({'engine': engine, 'transmission': trans, 'color': color, 'interior': interior})
    scraped_data

这篇关于使用BeautifulSoup和pandas将列表项标题下的文本刮到一列中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用BeautifulSoup和pandas将列表项标题下的文本刮到一列中 [英] Scrape the text below a header inside a list item into a column with BeautifulSoup and pandas

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用BeautifulSoup和pandas将列表项标题下的文本刮到一列中 [英] Scrape the text below a header inside a list item into a column with BeautifulSoup and pandas

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭