与Python重复文本提取 [英] Repeat text extraction with Python
本文介绍了与Python重复文本提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下的code,我想用之间提取文本信息<字体颜色=#FF0000>和< / FONT>
。它工作正常,但仅提取一个单位(第一个),而我想提取这些标记之间的所有文本单位。我试着用一个bash循环code这样做,但没有奏效。
导入OS目录路径=C:\\\\ \\\\ My_folder TMP 在os.listdir文件(目录路径): 打印(文件) path_for_files = os.path.join(目录路径,文件) 文字=打开(path_for_files,模式='R',编码='UTF-8')。阅读() starting_tag ='<字体颜色='
ending_tag ='< / FONT>' 地面=文本[text.find(starting_tag):text.find(ending_tag) results_dir =C:\\\\ \\\\ My_folder TMP
results_file =文件[: - 4] +'TXT' path_for_files = os.path.join(results_dir,results_file) 打开(path_for_files,模式='W',编码='UTF-8')。写(结果)
解决方案
您可以用美丽的汤的CSS选择器。
>>>从BS4进口BeautifulSoup
>>> S =富<字体颜色=#FF0000> foobar的< / FONT>栏
>>>汤= BeautifulSoup(S,'LXML')
>>>因为我在soup.select('字体[颜色=#FF0000]'):
打印(i.text)
FOOBAR
I have the following code which I would like to use to extract texts information between <font color='#FF0000'> and </font>
. It works fine but it only extracts one unit (the first one) whereas I would like to extract all textual units between these tags. I tried to do this with a bash loop code but it didn't work.
import os
directory_path ='C:\\My_folder\\tmp'
for files in os.listdir(directory_path):
print(files)
path_for_files = os.path.join(directory_path, files)
text = open(path_for_files, mode='r', encoding='utf-8').read()
starting_tag = '<font color='
ending_tag = '</font>'
ground = text[text.find(starting_tag):text.find(ending_tag)]
results_dir = 'C:\\My_folder\\tmp'
results_file = files[:-4] + 'txt'
path_for_files = os.path.join(results_dir, results_file)
open(path_for_files, mode='w', encoding='UTF-8').write(result)
解决方案
You could use Beautiful Soup's css selectors.
>>> from bs4 import BeautifulSoup
>>> s = "foo <font color='#FF0000'> foobar </font> bar"
>>> soup = BeautifulSoup(s, 'lxml')
>>> for i in soup.select('font[color="#FF0000"]'):
print(i.text)
foobar
这篇关于与Python重复文本提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文