beautifulsoup解析文件夹webscrapping中的每个html文件 [英] beautifulsoup parse every html files in a folder webscrapping
本文介绍了beautifulsoup解析文件夹webscrapping中的每个html文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
(1)< strong> OO< / strong>
(2)< strong> QQ< / strong>
然后
解决方案你的
write
函数嵌套在 for
循环中,这就是为什么你写了多行到 index.txt
,只需将 write
从循环中移出并将所有parti文本放入一个变量 parti_names
像这样: 参与者= soup.find(find_participant)
parti_names如果parti.find(strong,text = re.compile(r(运算符))):$ b $ =
为参与者parti中的参与者.find_next_siblings(p):
b break b $ b parti_names + = parti.get_text(strip = True)+,
print parti.get_text(strip = True)
indexFile = open('index.txt ','a +')
indexFile.write(文件名+','+ title.get_text(strip = True)+ ticker.get_text(strip = True)+','+ d_date.get_text(strip = True) +','+ parti_names +'\ n')
indexFile.close()
<强> Upda您可以使用 basename
来获取文件名称:
来自os.path import basename
#你可以用basename直接调用它
print(basename( C:/ Users /.../ output / 100107-.html))
输出:
100107-.html
My task is to read every html file from a directory. Conditions are to find whether each file contains tags
(1) <strong>OO</strong>
(2) <strong>QQ</strong>
Then
解决方案
Your write
function is nested in the for
loop, that's why you write multiple lines to your index.txt
, just move the write
out of the loop and put all your parti text to a variable parti_names
like this:
participants = soup.find(find_participant)
parti_names = ""
for parti in participants.find_next_siblings("p"):
if parti.find("strong", text=re.compile(r"(Operator)")):
break
parti_names += parti.get_text(strip=True)+","
print parti.get_text(strip=True)
indexFile = open('index.txt', 'a+')
indexFile.write(filename + ', ' + title.get_text(strip=True) + ticker.get_text(strip=True) + ', ' + d_date.get_text(strip=True) + ', ' + parti_names + '\n' )
indexFile.close()
Update:
You can work with basename
to get the file name:
from os.path import basename
# you can call it directly with basename
print(basename("C:/Users/.../output/100107-.html"))
Output:
100107-.html
这篇关于beautifulsoup解析文件夹webscrapping中的每个html文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文