如何通过匹配字符串在Python中提取父级html标记 [英] How to extract parent html tag in Python by matching the string

查看:112
本文介绍了如何通过匹配字符串在Python中提取父级html标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要通过匹配html中的字符串来在html中提取父标签。
(即)
我有很多原始的html资源。每个来源都包含带有一些字符的文本值VIN: * **。这个文本值(VIN: * )在各个源文件中以各种格式放置,例如< ul>,< div>等。



然后,我需要提取所有值以及VIN: * 字符串。这意味着我需要得到它的父标签。



例如,

 < div class =class1> 

库存号码:
Z2079
< br>
** VIN:
2T2HK31UX9C110701 **
< br>
型号:
9424
< / div>

这里我有html源代码的VIN。类似于我对其他html源代码也有不同格式的VIN。



这些值必须在Python中提取。



有没有什么办法通过在Python中匹配字符串来有效地提取父标记? strong 建议使用 BeautifulSoup 进行;它为解析HTML提供了一些令人难以置信的便利功能。例如,在这种情况下,我将如何去查找包含VIN的每个文本节点:

  soup = your_html_here 
vins = soup.findAll(text = lambda(x):x.lower.index('vin')!= -1)

从那里开始,您只需遍历该集合,获取每个节点的父节点,获取父节点的内容,然后在您认为合适的时候解析它们:

  for v in vins:
parent_html = v.parent.contents
#此处更多代码


I need to extract the parent tags in html by matching the string in html. (i.e) I have many raw html sources. Each source contains the text value "VIN:*"** with some characters. This text value(VIN:*) is placed in various formats in each source like "< ul >" , "< div >".etc..

Then I need to extract all values along with that "VIN:*" string. It means I need to get its parent tag.

For example,

<div class="class1">

                            Stock Number:
                            Z2079
                            <br>
                            **VIN:
                            2T2HK31UX9C110701**
                            <br>
                            Model Code:
                            9424
                            <img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>

Here I have the "VIN" for the html source. Similar to that I have VIN for other html sources also in different format.

These values have to be extracted in Python.

Is there any way to extract the parent tag by matching the string in Python also in effective way?

解决方案

I would strongly recommend going with BeautifulSoup on this; it provides some incredibly convenient functionality for parsing HTML. Here, for example, is how I would go about finding every text node that contains "VIN" in either case:

soup = your_html_here
vins = soup.findAll(text = lambda(x): x.lower.index('vin') != -1)

From there, you simply walk through that collection, grab each node's parent, grab said parent's contents, and parse them as you see fit:

for v in vins:
    parent_html = v.parent.contents
    # more code here

这篇关于如何通过匹配字符串在Python中提取父级html标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆