Python-使用BeautifulSoup 4在特定注释节点之间提取数据 [英] Python - Extracting data between specific comment nodes with BeautifulSoup 4

查看:178
本文介绍了Python-使用BeautifulSoup 4在特定注释节点之间提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

希望从网站中选择特定数据,例如价格,公司信息等.幸运的是,网站设计者放置了很多标签,例如

Looking to pick out specific data from a website such as prices, company info etc. Luckily, the website designer has put lots of tags such as

<!-- Begin Services Table -->
' desired data
<!-- End Services Table -->

为了使BS4返回给定标签之间的字符串,我需要什么样的代码?

What kind of code would I need in order for BS4 to return the strings between the given tags?

import requests
from bs4 import BeautifulSoup

url = "http://www.100ll.com/searchresults.phpclear_previous=true&searchfor="+'KPLN'+"&submit.x=0&submit.y=0"

response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")

text_list = soup.find(id="framediv").find_all(text=True)
start_index = text_list.index(' Begin Fuel Information Table ') + 1
end_index = text_list.index(' End Fuel Information Table ')
for item in text_list[start_index:end_index]:
    print(item)

这是有问题的网站:

http://www.100ll.com/showfbo.php?HashID=cf5f18404c062da6fa11e3af41358873

推荐答案

如果要在这些特定注释之后选择table元素,则可以选择所有注释节点,并根据需要过滤它们文本,然后选择下一个同级table元素:

If you want to select the table element after those specific comment(s), then you can select all the comment nodes, filter them based on the desired text, and then select the the next sibling table element:

import requests
from bs4 import BeautifulSoup
from bs4 import Comment

response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")

comments = soup.find_all(string=lambda text:isinstance(text,Comment))

for comment in comments:
    if comment.strip() == 'Begin Services Table':
        table = comment.find_next_sibling('table')
        print(table)

或者,如果要获取这两个注释之间的所有数据,则可以找到第一个注释,然后遍历所有下一个兄弟姐妹,直到找到结束注释:

Alternatively, if you want to get all data between those two comments, then you could find the first comment and then iterate over all the next siblings until you find the closing comment:

import requests
from bs4 import BeautifulSoup
from bs4 import Comment

response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")

data = []

for comment in soup.find_all(string=lambda text:isinstance(text, Comment)):
    if comment.strip() == 'Begin Services Table':
        next_node = comment.next_sibling

        while next_node and next_node.next_sibling:
            data.append(next_node)
            next_node = next_node.next_sibling

            if not next_node.name and next_node.strip() == 'End Services Table': break;

print(data)

这篇关于Python-使用BeautifulSoup 4在特定注释节点之间提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆