如何在这个beautifulsoup Python脚本上迭代CSV输出中的列? [英] How to iterate column in CSV output on this beautifulsoup Python script?

查看:100
本文介绍了如何在这个beautifulsoup Python脚本上迭代CSV输出中的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个beautifulsoup Python脚本,该脚本在网站上的组件中查找href链接,并将这些链接逐行输出到CSV文件.我计划每天通过cron作业运行脚本,并且我想在CSV中添加第二列,标记为看到的次数".因此,当脚本运行时,如果它在列表中已经找到一个链接,则只需将其添加到该列中的数字上即可.例如,如果是第二次看到特定链接,则该链接为"N + 1"或该列中仅为2.但是,如果这是Python脚本第一次看到该链接,它将只是将该链接添加到列表的底部.我不知道如何攻击它,因为我是Python的新手.

我已经开发了Python脚本,以在XML站点地图的所有页面上从组件中刮取链接.但是,由于cron作业每天都会运行脚本,因此我不确定如何在CSV输出的查看次数"列上进行迭代.我不希望文件被覆盖,我只希望迭代查看次数"列,或者如果是第一次看到链接,则将该链接放在列表的底部./p>

这是到目前为止我拥有的Python脚本:

sitemap_url = 'https://www.lowes.com/sitemap/navigation0.xml'

import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm
import time

# def get_urls(url):
page = requests.get(sitemap_url)
soup = BeautifulSoup(page.content, 'html.parser')
links = [element.text for element in soup.findAll('loc')]
# return links
print('Found {:,} URLs in the sitemap! Now beginning crawl of each URL...'\
        .format(len(links)))     

csv_file = open('cms_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['hrefs', 'Number of times seen:'])

for i in tqdm(links):
    #print("beginning of crawler code")
    r = requests.get(i)
    data = r.text

    soup = BeautifulSoup(data, 'lxml')

    all_a = soup.select('.carousel-small.seo-category-widget a')
    for a in all_a:
         hrefs = a['href']
         print(hrefs)
         csv_writer.writerow([hrefs, 1])

csv_file.close()

当前状态: 当前,每次脚本运行后,CSV输出中的查看次数:"列将被覆盖.

所需状态: 我想要数字出现的次数:只要脚本找到在上一个爬网中看到的链接,或者这是第一次第一次看到该链接,我希望它在CSV的此列中说为"1". >

非常感谢您的帮助!

解决方案

因此,这实际上不是关于bs4的问题,而是更多关于如何在python中处理数据结构的问题.

您的脚本缺少加载您已经知道的数据的部分.一种解决方法是构建一个字典,将所有href用作键,然后将计数作为值.

因此,给定具有这样的行的csv ...

href,seen_count
https://google.com/1234,4
https://google.com/3241,2

...您首先需要构建字典

csv_list = list(open("cms_scrape.csv", "r", encoding="utf-8"))
# we skip the first line, since it hold your header and not data
csv_list = csv_list[1:]

# now we convert this to a dict
hrefs_dict = {}
for line in csv_list:
    url, count = line.split(",")
    # remove linebreak from count and convert to int
    count = int(count.strip())
    hrefs_dict[url] = count

这会产生一个这样的字典:

{ 
  "https://google.com/1234": 4,
  "https://google.com/3241": 2
}

现在,您可以检查遇到的所有href是否作为此字典中的键存在.如果是,则将计数增加一.如果否,则在dict中插入href,并将计数设置为1.

要将其应用到您的代码中,建议您首先刮除数据,并在所有刮除完成后写入文件.像这样:

for i in tqdm(links):
    #print("beginning of crawler code")
    r = requests.get(i)
    data = r.text
    soup = BeautifulSoup(data, 'lxml')
    all_a = soup.select('.carousel-small.seo-category-widget a')
    for a in all_a:
         href = a['href']
         print(href)
         # if href is a key in hrefs_dict increase the value by one
         if href in hrefs_dict:
             hrefs_dict[href] += 1

         # else insert it into the hrefs_dict and set the count to 1
         else:             
             hrefs_dict[href] = 1

现在,完成抓取后,遍历字典中的每一行并将其写入文件.通常建议您在写入文件时使用上下文管理器(以避免在意外忘记关闭文件时避免阻塞).因此,"with"会同时处理文件的打开和关闭:

with open('cms_scrape.csv', 'w') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['hrefs', 'Number of times seen:'])

    # loop through the hrefs_dict
    for href, count in hrefs_dict.items():
        csv_writer.writerow([href, count])

因此,如果您实际上不必为此使用csv文件,则建议您使用JSON或Pickle.这样,您就可以读取和存储字典,而无需来回转换为csv.

我希望这可以解决您的问题...

I have a beautifulsoup Python script that looks for href links in a component on a website and outputs those links line-by-line to a CSV file. I'm planning on running the script every day via a cron job, and I'd like to add a second column in the CSV labeled "Number of times seen". So when the script runs, if it finds a link already in the list, it would just add to the number in that column. For example, if it's the second time it's seen a particular link, it would be "N+1" or just 2 in that column. But if it's the first time the Python script saw that link, it would just add the link to the bottom fo the list. I'm not sure how to attack this as I'm pretty new to Python.

I've developed the Python script to scrape the links from the component on all of the pages in a XML sitemap. However, I'm not sure how to iterate on the "Number of times seen" column in the CSV output as the cron job runs the script every day. I don't want the file to be overwritten, I only want the "Number of times seen" column to iterate, or if it's the first time the link was seen, for the link to be put at the bottom of the list.

Here's the Python script that I have so far:

sitemap_url = 'https://www.lowes.com/sitemap/navigation0.xml'

import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm
import time

# def get_urls(url):
page = requests.get(sitemap_url)
soup = BeautifulSoup(page.content, 'html.parser')
links = [element.text for element in soup.findAll('loc')]
# return links
print('Found {:,} URLs in the sitemap! Now beginning crawl of each URL...'\
        .format(len(links)))     

csv_file = open('cms_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['hrefs', 'Number of times seen:'])

for i in tqdm(links):
    #print("beginning of crawler code")
    r = requests.get(i)
    data = r.text

    soup = BeautifulSoup(data, 'lxml')

    all_a = soup.select('.carousel-small.seo-category-widget a')
    for a in all_a:
         hrefs = a['href']
         print(hrefs)
         csv_writer.writerow([hrefs, 1])

csv_file.close()

Current state: Currently, every time the script runs, the "Number of times seen:" column in the CSV output is overwritten.

Desired state: I want the "Number of times seen:" column to iterate whenever the script finds a link it's seen in a previous crawl, or if it's the first time that link has been seen, I want it to say "1" in this column in the CSV.

Thanks a ton for your help!!

解决方案

So, this isn't actually a questing about bs4, but more about how to handle data structures in python.

Your script lacks the part that loads the data you already know. One way to go about this would be the build a dict that has all your hrefs as keys and then the count as value.

So given a csv with rows like this...

href,seen_count
https://google.com/1234,4
https://google.com/3241,2

... you first need to build the dict

csv_list = list(open("cms_scrape.csv", "r", encoding="utf-8"))
# we skip the first line, since it hold your header and not data
csv_list = csv_list[1:]

# now we convert this to a dict
hrefs_dict = {}
for line in csv_list:
    url, count = line.split(",")
    # remove linebreak from count and convert to int
    count = int(count.strip())
    hrefs_dict[url] = count

That yields a dict like this:

{ 
  "https://google.com/1234": 4,
  "https://google.com/3241": 2
}

Now you can check if all hrefs you come across exist as a key in this dict. If yes - increase the count by one. If no, insert the href in the dict and se the count to 1.

To apply this to your code I'd suggest you scrape the data first and write to file once all scraping is completed. Like so:

for i in tqdm(links):
    #print("beginning of crawler code")
    r = requests.get(i)
    data = r.text
    soup = BeautifulSoup(data, 'lxml')
    all_a = soup.select('.carousel-small.seo-category-widget a')
    for a in all_a:
         href = a['href']
         print(href)
         # if href is a key in hrefs_dict increase the value by one
         if href in hrefs_dict:
             hrefs_dict[href] += 1

         # else insert it into the hrefs_dict and set the count to 1
         else:             
             hrefs_dict[href] = 1

Now when the scraping is done, go through every line in the dict and write it to your file. It's generally recommended that you use context managers when you write to files (to avoid blocking if you accidentally forget to close the file). So the "with" takes care of both the opening and closing of the file:

with open('cms_scrape.csv', 'w') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['hrefs', 'Number of times seen:'])

    # loop through the hrefs_dict
    for href, count in hrefs_dict.items():
        csv_writer.writerow([href, count])

So if you don't actually have to use a csv-file for this I'd suggest using JSON or Pickle. That way you can read and store the dict without needing to convert back and forth to csv.

I hope this solves your problems...

这篇关于如何在这个beautifulsoup Python脚本上迭代CSV输出中的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆