遍历多个文件,并使用Beautiful Soup从HTML附加文本 [英] Iterate through multiple files and append text from HTML using Beautiful Soup
问题描述
我有一个下载HTML文件的目录(其中46个),并且尝试遍历每个HTML文件,读取它们的内容,剥离HTML并将仅文本附加到文本文件中.但是,我不确定我在哪里搞乱了,因为什么都没有写到我的文本文件中?
I have a directory of downloaded HTML files (46 of them) and I am attempting to iterate through each of them, read their contents, strip the HTML, and append only the text into a text file. However, I'm unsure where I'm messing up, though, as nothing gets written to my text file?
import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (path)
soup = BeautifulSoup(markup)
with open("example.txt", "a") as myfile:
myfile.write(soup)
f.close()
-----更新---- 我已经更新了以下代码,但是仍然无法创建文本文件.
-----update---- I've updated my code as below, however the text file still doesn't get created.
import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(markup)
with open("example.txt", "a") as myfile:
myfile.write(soup)
myfile.close()
-----更新2 -----
-----update 2-----
啊,我发现我的目录不正确,所以现在我有了:
Ah, I caught that I had my directory incorrect, so now I have:
import os
import glob
from bs4 import BeautifulSoup
path = "c:\\users\\me\\downloads\\"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(markup)
with open("example.txt", "a") as myfile:
myfile.write(soup)
myfile.close()
执行此操作时,出现此错误:
When this is executed, I get this error:
Traceback (most recent call last):
File "C:\Users\Me\Downloads\bsoup.py, line 11 in <module>
myfile.write(soup)
TypeError: must be str, not BeautifulSoup
我通过更改来解决了最后一个错误
I fixed this last error by changing
myfile.write(soup)
到
myfile.write(soup.get_text())
-----更新3 ----
-----update 3 ----
它现在可以正常工作,下面是工作代码:
It's working properly now, here's the working code:
import os
import glob
from bs4 import BeautifulSoup
path = "c:\\users\\me\\downloads\\"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(open(markup, "r").read())
with open("example.txt", "a") as myfile:
myfile.write(soup.get_text())
myfile.close()
推荐答案
实际上您不是在阅读html文件,这应该可以工作,
actually you are not reading html file, this should work,
soup=BeautifulSoup(open(webpage,'r').read(), 'lxml')
这篇关于遍历多个文件,并使用Beautiful Soup从HTML附加文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!