beautifulsoup解析webscraping文件夹中的每个html文件 [英] beautifulsoup parse every html files in a folder webscraping

查看:54
本文介绍了beautifulsoup解析webscraping文件夹中的每个html文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的任务是从目录中读取每个html文件.条件是确定每个文件是否包含标签

My task is to read every html file from a directory. Conditions are to find whether each file contains tags

(1) <strong>OO</strong>  
(2) <strong>QQ</strong>

然后

推荐答案

write 函数嵌套在 for 循环中,这就是为什么要在 index.txt ,只需将 write 移出循环,然后将您所有的parti文本放入变量 parti_names 中,如下所示:

Your write function is nested in the for loop, that's why you write multiple lines to your index.txt, just move the write out of the loop and put all your parti text to a variable parti_names like this:

participants = soup.find(find_participant)
parti_names = ""
for parti in participants.find_next_siblings("p"):
    if parti.find("strong", text=re.compile(r"(Operator)")):
        break
    parti_names += parti.get_text(strip=True)+","
    print parti.get_text(strip=True)

indexFile = open('index.txt', 'a+')
indexFile.write(filename + ', ' + title.get_text(strip=True) + ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + parti_names + '\n' )
indexFile.close()

更新:

您可以使用 basename 来获取文件名:

You can work with basename to get the file name:

from os.path import basename

# you can call it directly with basename
print(basename("C:/Users/.../output/100107-.html"))

输出:

100107-.html

这篇关于beautifulsoup解析webscraping文件夹中的每个html文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆