在网页中加载更多内容并写入文件时出现问题 [英] Loading more content in a webpage and issues writing to a file

查看:79
本文介绍了在网页中加载更多内容并写入文件时出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一个Web抓取项目,该项目涉及根据搜索词从网站抓取URL,将它们存储在CSV文件中(在单个列下),最后从这些链接中抓取信息并将它们存储在文本中文件.

我目前遇到2个问题.

  1. 仅刮擦前几个链接.我无法提取链接 其他页面(该网站包含加载更多"按钮).我不知道 如何在代码中使用XHR对象.
  2. 代码的后半部分仅读取最后一个链接(存储在 csv文件),抓取相应信息并将其存储在 文本文件.它从一开始就没有遍历所有链接. 我无法确定我在文件方面出了什么问题 处理和f.seek(0).

    from pprint import pprint
    import requests
    import lxml
    import csv
    import urllib2
    from bs4 import BeautifulSoup
    
    def get_url_for_search_key(search_key):
        base_url = 'http://www.marketing-interactive.com/'
        response = requests.get(base_url + '?s=' + search_key)
        soup = BeautifulSoup(response.content, "lxml")
        return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
        results = soup.findAll('a', {'rel': 'bookmark'})
    
    for r in results:
        if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
            newlinks.append(r["href"])
    
    pprint(get_url_for_search_key('digital advertising'))
    with open('ctp_output.csv', 'w+') as f:
        f.write('\n'.join(get_url_for_search_key('digital advertising')))
        f.seek(0)  
    

    读取CSV文件,抓取各自的内容并将其存储在.txt文件中

    with open('ctp_output.csv', 'rb') as f1:
        f1.seek(0)
        reader = csv.reader(f1)
    
        for line in reader:
            url = line[0]       
            soup = BeautifulSoup(urllib2.urlopen(url))
    
            with open('ctp_output.txt', 'a+') as f2:
                for tag in soup.find_all('p'):
                    f2.write(tag.text.encode('utf-8') + '\n')
    

解决方案

关于第二个问题,您的模式已关闭.您需要将w+转换为a+.此外,您的缩进不可用.

with open('ctp_output.csv', 'rb') as f1:
    f1.seek(0)
    reader = csv.reader(f1)

    for line in reader:
        url = line[0]       
        soup = BeautifulSoup(urllib2.urlopen(url))

        with open('ctp_output.txt', 'a+') as f2:
            for tag in soup.find_all('p'):
                f2.write(tag.text.encode('utf-8') + '\n')

+后缀将创建该文件(如果不存在).但是,w+将在每次迭代之前擦除所有内容.另一方面,a+将附加到文件(如果存在),否则将创建.

对于您的第一个问题,别无选择,只能切换到可以自动单击浏览器按钮的功能.你得看硒.替代方法是手动搜索该按钮,从href或文本中提取网址,然后发出 second 请求.我留给你.

I am working on a web scraping project which involves scraping URLs from a website based on a search term, storing them in a CSV file(under a single column) and finally scraping the information from these links and storing them in a text file.

I am currently stuck with 2 issues.

  1. Only the first few links are scraped. I'm unable to extract links from other pages(Website contains load more button). I don't know how to use the XHR object in the code.
  2. The second half of the code reads only the last link(stored in the csv file), scrapes the respective information and stores it in a text file. It does not go through all the links from the beginning. I am unable to figure out where I have gone wrong in terms of file handling and f.seek(0).

    from pprint import pprint
    import requests
    import lxml
    import csv
    import urllib2
    from bs4 import BeautifulSoup
    
    def get_url_for_search_key(search_key):
        base_url = 'http://www.marketing-interactive.com/'
        response = requests.get(base_url + '?s=' + search_key)
        soup = BeautifulSoup(response.content, "lxml")
        return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
        results = soup.findAll('a', {'rel': 'bookmark'})
    
    for r in results:
        if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
            newlinks.append(r["href"])
    
    pprint(get_url_for_search_key('digital advertising'))
    with open('ctp_output.csv', 'w+') as f:
        f.write('\n'.join(get_url_for_search_key('digital advertising')))
        f.seek(0)  
    

    Reading CSV file, scraping respective content and storing in .txt file

    with open('ctp_output.csv', 'rb') as f1:
        f1.seek(0)
        reader = csv.reader(f1)
    
        for line in reader:
            url = line[0]       
            soup = BeautifulSoup(urllib2.urlopen(url))
    
            with open('ctp_output.txt', 'a+') as f2:
                for tag in soup.find_all('p'):
                    f2.write(tag.text.encode('utf-8') + '\n')
    

解决方案

Regarding your second problem, your mode is off. You'll need to convert w+ to a+. In addition, your indentation is off.

with open('ctp_output.csv', 'rb') as f1:
    f1.seek(0)
    reader = csv.reader(f1)

    for line in reader:
        url = line[0]       
        soup = BeautifulSoup(urllib2.urlopen(url))

        with open('ctp_output.txt', 'a+') as f2:
            for tag in soup.find_all('p'):
                f2.write(tag.text.encode('utf-8') + '\n')

The + suffix will create the file if it doesn't exist. However, w+ will erase all contents before writing at each iteration. a+ on the other hand will append to a file if it exists, or create it if it does not.

For your first problem, there's no option but to switch to something that can automate clicking browser buttons and whatnot. You'd have to look at selenium. The alternative is to manually search for that button, extract the url from the href or text, and then make a second request. I leave that to you.

这篇关于在网页中加载更多内容并写入文件时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆