无法在excel文件中正确写入提取的项目? [英] Unable to write extracted items properly in an excel file?

查看:148
本文介绍了无法在excel文件中正确写入提取的项目?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在python中编写了一些代码来解析标题和从网页链接。最初,我试图解析左侧栏中的链接,然后通过跟踪每个链接从每个页面中删除上述文档。我完美无瑕。我尝试将不同链接的文档保存在单个excel文件中的不同页面中。但是,它创建了几个表格,从脚本中的标题变量中提取所需部分作为工作表名称。我遇到的问题是 - 当保存数据时,链接中每页的最后一条记录将保存在我的excel表中,而不是完整的记录。这是我尝试使用的脚本:

  import request 
from lxml import html
from pyexcel_ods3 import save_data

web_link =http://www.wiseowl.co.uk/videos/
main_url =http://www.wiseowl.co.uk

def get_links(page):

response = requests.Session()。get(page)
tree = html.fromstring(response.text)
data = { }
titles = tree.xpath(// ul [@ class ='woMenuList'] // li [@ class ='woMenuItem'] / a / @ href)
标题中的标题:
如果作者不在标题和年不标题:
get_docs(data,main_url + title)

def get_docs(data,url):

response = requests.Session()。get(url)
tree = html.fromstring(response.text)

heading = tree.findtext('.// h1 [@ class =gamma]')

在tree.xpath中的项目(// p [@ class ='woVideoListDefaultSeriesTitle']):
title = item.findtext './/a')
link = item.xpath('.// a / @ href')[0]
#print(title,link)
data.update({heading.split()[ - 4] [(title)]]})
save_data(mth.ods,data)

如果__name__ =='__main__':
get_links(web_link)


解决方案

当您更新 / code> dict以前的值被替换。



如果替换此行,您可以修复此问题:

  .update({heading.split()[ -  4]:[[(title)]]})

有了(这有点丑,但它有效):

  data [heading.split()[ -4]] = data.get(heading.split()[ -  4],[])+ [[(title)]] 


I've written some code in python to parse title and link from a webpage. Initially, I tried to parse the links from the left sided bar then scrape those aforesaid documents from each page by tracking down each links. I did this flawlessly. I tried to save the documents of different links in different pages in a single excel file. However, It creates several "Sheets" extracting the desired portion as the sheet name from heading variable from my script. The problem I'm facing is- when the data are saved, only the last record of each page from the links are saved in my excel sheets instead of the full records. Here is the script I tried with:

import requests
from lxml import html
from pyexcel_ods3 import save_data

web_link = "http://www.wiseowl.co.uk/videos/"
main_url = "http://www.wiseowl.co.uk"

def get_links(page):

    response = requests.Session().get(page)
    tree = html.fromstring(response.text)
    data = {}
    titles = tree.xpath("//ul[@class='woMenuList']//li[@class='woMenuItem']/a/@href")
    for title in titles:
        if "author" not in title and "year" not in title:
            get_docs(data, main_url + title)

def get_docs(data, url):

    response = requests.Session().get(url)
    tree = html.fromstring(response.text)

    heading = tree.findtext('.//h1[@class="gamma"]')

    for item in tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']"):
        title = item.findtext('.//a')
        link = item.xpath('.//a/@href')[0]
        # print(title, link)
        data.update({heading.split(" ")[-4]: [[(title)]]})
    save_data("mth.ods", data)

if __name__ == '__main__':
    get_links(web_link)

解决方案

When you update the values in the data dict the previous values get replaced.

You can fix this if you replace this line:

data.update({heading.split(" ")[-4]: [[(title)]]})

With this ( it's a bit ugly but it works ) :

data[heading.split(" ")[-4]] = data.get(heading.split(" ")[-4], []) + [[(title)]]

这篇关于无法在excel文件中正确写入提取的项目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆