试图收集使用BeautifulSoup本地文件数据 [英] Trying to collect data from local files using BeautifulSoup

查看:173
本文介绍了试图收集使用BeautifulSoup本地文件数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想运行一个python脚本解析HTML文件,并收集所有以目标=_空白属性。

的链接的列表

我试过以下,但它没有得到来自任何BS4。 SoupStrainer说,在文档它会采取ARGS以同样的方式作为的findAll等,应这项工作?我失去了一些愚蠢的错误?

 导入OS
进口SYS从BS4进口BeautifulSoup,SoupStrainer
从unipath导入路径高清的main():    ROOT =路径(os.path.realpath(__ __文件))。祖先(3)
    SRC = ROOT.child(SRC)
    TEMPLATEDIR = src.child(模板)    在os.walk(dirpath,迪尔斯,档案)(TEMPLATEDIR):
        在路径(路径(dirpath,f)在对文件F):
            如果path.endswith(HTML):
                在BeautifulSoup((目标=_空白)的路径,parse_only = SoupStrainer)链接:
                    打印链接如果__name__ ==__main__:
    sys.exit(main()中)


解决方案

我认为你需要像这样

 如果path.endswith(HTML):
    HTMLFILE =开(dirpath)
    在BeautifulSoup(HTMLFILE,parse_only = SoupStrainer(目标=_空白))链接:
        打印链接

I want to run a python script to parse html files and collect a list of all the links with a target="_blank" attribute.

I've tried the following but it's not getting anything from bs4. SoupStrainer says in the docs it'll take args in the same way as findAll etc, should this work? Am I missing some stupid error?

import os
import sys

from bs4 import BeautifulSoup, SoupStrainer
from unipath import Path

def main():

    ROOT = Path(os.path.realpath(__file__)).ancestor(3)
    src = ROOT.child("src")
    templatedir = src.child("templates")

    for (dirpath, dirs, files) in os.walk(templatedir):
        for path in (Path(dirpath, f) for f in files):
            if path.endswith(".html"):
                for link in BeautifulSoup(path, parse_only=SoupStrainer(target="_blank")):
                    print link

if __name__ == "__main__":
    sys.exit(main())

解决方案

I think you need something like this

if path.endswith(".html"):
    htmlfile = open(dirpath)
    for link in BeautifulSoup(htmlfile,parse_only=SoupStrainer(target="_blank")):
        print link

这篇关于试图收集使用BeautifulSoup本地文件数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆