beautifulsoup解析文件夹webscrapping中的每个html文件 [英] beautifulsoup parse every html files in a folder webscrapping

查看:154
本文介绍了beautifulsoup解析文件夹webscrapping中的每个html文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的任务是从目录中读取每个html文件。条件是要找出每个文件是否包含标签

 (1)< strong> OO< / strong> 
(2)< strong> QQ< / strong>

然后

解决方案你的 write 函数嵌套在 for 循环中,这就是为什么你写了多行到 index.txt ,只需将 write 从循环中移出并将所有parti文本放入一个变量 parti_names 像这样:

 参与者= soup.find(find_participant)
parti_names如果parti.find(strong,text = re.compile(r(运算符))):$ b $ =
为参与者parti中的参与者.find_next_siblings(p):
b break b $ b parti_names + = parti.get_text(strip = True)+,
print parti.get_text(strip = True)

indexFile = open('index.txt ','a +')
indexFile.write(文件名+','+ title.get_text(strip = True)+ ticker.get_text(strip = True)+','+ d_date.get_text(strip = True) +','+ parti_names +'\ n')
indexFile.close()

<强> Upda您可以使用 basename 来获取文件名称:

$

 来自os.path import basename 

#你可以用basename直接调用它
print(basename( C:/ Users /.../ output / 100107-.html))

输出:

  100107-.html 


My task is to read every html file from a directory. Conditions are to find whether each file contains tags

(1) <strong>OO</strong>  
(2) <strong>QQ</strong>

Then

解决方案

Your write function is nested in the for loop, that's why you write multiple lines to your index.txt, just move the write out of the loop and put all your parti text to a variable parti_names like this:

participants = soup.find(find_participant)
parti_names = ""
for parti in participants.find_next_siblings("p"):
    if parti.find("strong", text=re.compile(r"(Operator)")):
        break
    parti_names += parti.get_text(strip=True)+","
    print parti.get_text(strip=True)

indexFile = open('index.txt', 'a+')
indexFile.write(filename + ', ' + title.get_text(strip=True) + ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + parti_names + '\n' )
indexFile.close()

Update:

You can work with basename to get the file name:

from os.path import basename

# you can call it directly with basename
print(basename("C:/Users/.../output/100107-.html"))

Output:

100107-.html

这篇关于beautifulsoup解析文件夹webscrapping中的每个html文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆