与Python重复文本提取 [英] Repeat text extraction with Python

查看：320 发布时间：2016/8/4 9:06:28 python xml bash loops text-extraction

本文介绍了与Python重复文本提取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下的code，我想用之间提取文本信息＆LT;字体颜色=＃FF0000＆GT;和＆lt; / FONT＆GT; 。它工作正常，但仅提取一个单位（第一个），而我想提取这些标记之间的所有文本单位。我试着用一个bash循环code这样做，但没有奏效。

 导入OS目录路径=C：\\\\ \\\\ My_folder TMP    在os.listdir文件（目录路径）：    打印（文件）    path_for_files = os.path.join（目录路径，文件）    文字=打开（path_for_files，模式='R'，编码='UTF-8'）。阅读（）    starting_tag ='＆LT;字体颜色='
    ending_tag ='＆LT; / FONT＆GT;'    地面=文本[text.find（starting_tag）：text.find（ending_tag）    results_dir =C：\\\\ \\\\ My_folder TMP
    results_file =文件[： -  4] +'TXT'    path_for_files = os.path.join（results_dir，results_file）    打开（path_for_files，模式='W'，编码='UTF-8'）。写（结果）

解决方案

您可以用美丽的汤的CSS选择器。

 ＆GT;＆GT;＆GT;从BS4进口BeautifulSoup
＆GT;＆GT;＆GT; S =富＆LT;字体颜色=＃FF0000＆GT; foobar的＆LT; / FONT＆GT;栏
＆GT;＆GT;＆GT;汤= BeautifulSoup（S，'LXML'）
＆GT;＆GT;＆GT;因为我在soup.select（'字体[颜色=＃FF0000]'）：
    打印（i.text）
 FOOBAR

I have the following code which I would like to use to extract texts information between <font color='#FF0000'> and </font>. It works fine but it only extracts one unit (the first one) whereas I would like to extract all textual units between these tags. I tried to do this with a bash loop code but it didn't work.

import os

directory_path ='C:\\My_folder\\tmp'

    for files in os.listdir(directory_path):

    print(files)

    path_for_files = os.path.join(directory_path, files)

    text = open(path_for_files, mode='r', encoding='utf-8').read()

    starting_tag = '<font color='
    ending_tag = '</font>'

    ground = text[text.find(starting_tag):text.find(ending_tag)]

    results_dir = 'C:\\My_folder\\tmp'
    results_file = files[:-4] + 'txt'

    path_for_files = os.path.join(results_dir, results_file)

    open(path_for_files, mode='w', encoding='UTF-8').write(result)

解决方案

You could use Beautiful Soup's css selectors.

>>> from bs4 import BeautifulSoup
>>> s = "foo <font color='#FF0000'> foobar </font> bar"
>>> soup = BeautifulSoup(s, 'lxml')
>>> for i in soup.select('font[color="#FF0000"]'):
    print(i.text)


 foobar

这篇关于与Python重复文本提取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

与Python重复文本提取 [英] Repeat text extraction with Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

与Python重复文本提取 [英] Repeat text extraction with Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭