与Python重复文本提取 [英] Repeat text extraction with Python

查看:320
本文介绍了与Python重复文本提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下的code,我想用之间提取文本信息<字体颜色=#FF0000>和< / FONT> 。它工作正常,但仅提取一个单位(第一个),而我想提取这些标记之间的所有文本单位。我试着用一个bash循环code这样做,但没有奏效。

 导入OS目录路径=C:\\\\ \\\\ My_folder TMP    在os.listdir文件(目录路径):    打印(文件)    path_for_files = os.path.join(目录路径,文件)    文字=打开(path_for_files,模式='R',编码='UTF-8')。阅读()    starting_tag ='<字体颜色='
    ending_tag ='< / FONT>'    地面=文本[text.find(starting_tag):text.find(ending_tag)    results_dir =C:\\\\ \\\\ My_folder TMP
    results_file =文件[: - 4] +'TXT'    path_for_files = os.path.join(results_dir,results_file)    打开(path_for_files,模式='W',编码='UTF-8')。写(结果)


解决方案

您可以用美丽的汤的CSS选择器。

 >>>从BS4进口BeautifulSoup
>>> S =富<字体颜色=#FF0000> foobar的< / FONT>栏
>>>汤= BeautifulSoup(S,'LXML')
>>>因为我在soup.select('字体[颜色=#FF0000]'):
    打印(i.text)
 FOOBAR

I have the following code which I would like to use to extract texts information between <font color='#FF0000'> and </font>. It works fine but it only extracts one unit (the first one) whereas I would like to extract all textual units between these tags. I tried to do this with a bash loop code but it didn't work.

import os

directory_path ='C:\\My_folder\\tmp'

    for files in os.listdir(directory_path):

    print(files)

    path_for_files = os.path.join(directory_path, files)

    text = open(path_for_files, mode='r', encoding='utf-8').read()

    starting_tag = '<font color='
    ending_tag = '</font>'

    ground = text[text.find(starting_tag):text.find(ending_tag)]

    results_dir = 'C:\\My_folder\\tmp'
    results_file = files[:-4] + 'txt'

    path_for_files = os.path.join(results_dir, results_file)

    open(path_for_files, mode='w', encoding='UTF-8').write(result)

解决方案

You could use Beautiful Soup's css selectors.

>>> from bs4 import BeautifulSoup
>>> s = "foo <font color='#FF0000'> foobar </font> bar"
>>> soup = BeautifulSoup(s, 'lxml')
>>> for i in soup.select('font[color="#FF0000"]'):
    print(i.text)


 foobar 

这篇关于与Python重复文本提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆