从HTML文档中提取特定的字符串 [英] Extracting a specific string out an HTML document

查看:213
本文介绍了从HTML文档中提取特定的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从脱机HTML文档中仅抽取和提取特定字符串,并将该信息写入* .txt文件。

I need to sample and extract only a specific string out of an offline HTML document and write that information nice and clean into a *.txt file.

因此,对于例如,假设这是HTML文件的一部分:

So for example, lets assume that this is a section of the HTML file:

    <span id="dataView01">001.00 SPL</span>
    <span id="dataView02">543.00 SPL</span>
    <span id="dataView03">056.00 SPL</span>
    <span id="dataView04">228.00 SPL</span>

我需要这样做:

   001.00 SPL
   543.00 SPL
   056.00 SPL
   228.00 SPL

您可以帮我解决这个问题,
谢谢。

Could you please help me with this, Thanks.

推荐答案

使用HTML解析器,如 BeautifulSoup

示例:

Use an HTML parser like BeautifulSoup.
Example:

from bs4 import BeautifulSoup as bs
import re

markup = '''<span id="dataView01">001.00 SPL</span>
    <span id="dataView02">543.00 SPL</span>
    <span id="dataView03">056.00 SPL</span>
    <span id="dataView04">228.00 SPL</span>'''

soup = bs(markup)
tags = soup.find_all('span', id=re.compile(r'[dataView]\d+'))
for t in tags:  
    print(t.text)

结果:

Result:


001.00 SPL
543.00 SPL
056.00 SPL
228.00 SPL






下一步;写入.txt文件:


Next step; write to .txt file:

import csv

with open('output.txt','wb') as fou:
    csv_writer = csv.writer(fou)
    for tag in tags:
        split_on_whitespace = t.text.split()
        csv_writer.writerow(split_on_whitespace)

这篇关于从HTML文档中提取特定的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆