试图收集使用BeautifulSoup本地文件数据 [英] Trying to collect data from local files using BeautifulSoup
本文介绍了试图收集使用BeautifulSoup本地文件数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想运行一个python脚本解析HTML文件,并收集所有以目标=_空白
属性。
我试过以下,但它没有得到来自任何BS4。 SoupStrainer说,在文档它会采取ARGS以同样的方式作为的findAll等,应这项工作?我失去了一些愚蠢的错误?
导入OS
进口SYS从BS4进口BeautifulSoup,SoupStrainer
从unipath导入路径高清的main(): ROOT =路径(os.path.realpath(__ __文件))。祖先(3)
SRC = ROOT.child(SRC)
TEMPLATEDIR = src.child(模板) 在os.walk(dirpath,迪尔斯,档案)(TEMPLATEDIR):
在路径(路径(dirpath,f)在对文件F):
如果path.endswith(HTML):
在BeautifulSoup((目标=_空白)的路径,parse_only = SoupStrainer)链接:
打印链接如果__name__ ==__main__:
sys.exit(main()中)
解决方案
我认为你需要像这样
如果path.endswith(HTML):
HTMLFILE =开(dirpath)
在BeautifulSoup(HTMLFILE,parse_only = SoupStrainer(目标=_空白))链接:
打印链接
I want to run a python script to parse html files and collect a list of all the links with a target="_blank"
attribute.
I've tried the following but it's not getting anything from bs4. SoupStrainer says in the docs it'll take args in the same way as findAll etc, should this work? Am I missing some stupid error?
import os
import sys
from bs4 import BeautifulSoup, SoupStrainer
from unipath import Path
def main():
ROOT = Path(os.path.realpath(__file__)).ancestor(3)
src = ROOT.child("src")
templatedir = src.child("templates")
for (dirpath, dirs, files) in os.walk(templatedir):
for path in (Path(dirpath, f) for f in files):
if path.endswith(".html"):
for link in BeautifulSoup(path, parse_only=SoupStrainer(target="_blank")):
print link
if __name__ == "__main__":
sys.exit(main())
解决方案
I think you need something like this
if path.endswith(".html"):
htmlfile = open(dirpath)
for link in BeautifulSoup(htmlfile,parse_only=SoupStrainer(target="_blank")):
print link
这篇关于试图收集使用BeautifulSoup本地文件数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文