无法在excel文件中正确写入提取的项目？ [英] Unable to write extracted items properly in an excel file?

查看：148 发布时间：2017/9/4 1:20:28 python excel xpath web-scraping lxml

本文介绍了无法在excel文件中正确写入提取的项目？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经在python中编写了一些代码来解析标题和从网页链接。最初，我试图解析左侧栏中的链接，然后通过跟踪每个链接从每个页面中删除上述文档。我完美无瑕。我尝试将不同链接的文档保存在单个excel文件中的不同页面中。但是，它创建了几个表格，从脚本中的标题变量中提取所需部分作为工作表名称。我遇到的问题是 - 当保存数据时，链接中每页的最后一条记录将保存在我的excel表中，而不是完整的记录。这是我尝试使用的脚本：

  import request 
 from lxml import html 
 from pyexcel_ods3 import save_data 
 
 web_link =http://www.wiseowl.co.uk/videos/
 main_url =http://www.wiseowl.co.uk
 
 def get_links（page）：
 
 response = requests.Session（）。get（page）
 tree = html.fromstring（response.text）
 data = { } 
 titles = tree.xpath（// ul [@ class ='woMenuList'] // li [@ class ='woMenuItem'] / a / @ href）
标题中的标题： 
如果作者不在标题和年不标题：
 get_docs（data，main_url + title）
 
 def get_docs（data，url）：
 
 response = requests.Session（）。get（url）
 tree = html.fromstring（response.text）
 
 heading = tree.findtext（'.// h1 [@ class =gamma]'）
 
在tree.xpath中的项目（// p [@ class ='woVideoListDefaultSeriesTitle']）：
 title = item.findtext './/a'）
 link = item.xpath（'.// a / @ href'）[0] 
＃print（title，link）
 data.update（{heading.split（）[ -  4] [（title）]]}）
 save_data（mth.ods，data）
 
如果__name__ =='__main__'：
 get_links（web_link）

解决方案

当您更新 / code> dict以前的值被替换。

 
 
 如果替换此行，您可以修复此问题：
  .update（{heading.split（）[ -  4]：[[（title）]]}）
  
有了（这有点丑，但它有效）：
  data [heading.split（）[ -4]] = data.get（heading.split（）[ -  4]，[]）+ [[（title）]] 
  
 
I've written some code in python to parse title and link from a webpage. Initially, I tried to parse the links from the left sided bar then scrape those aforesaid documents from each page by tracking down each links. I did this flawlessly. I tried to save the documents of different links in different pages in a single excel file. However, It creates several "Sheets" extracting the desired portion as the sheet name from heading variable from my script. The problem I'm facing is- when the data are saved, only the last record of each page from the links are saved in my excel sheets instead of the full records. Here is the script I tried with:
import requests
from lxml import html
from pyexcel_ods3 import save_data

web_link = "http://www.wiseowl.co.uk/videos/"
main_url = "http://www.wiseowl.co.uk"

def get_links(page):

    response = requests.Session().get(page)
    tree = html.fromstring(response.text)
    data = {}
    titles = tree.xpath("//ul[@class='woMenuList']//li[@class='woMenuItem']/a/@href")
    for title in titles:
        if "author" not in title and "year" not in title:
            get_docs(data, main_url + title)

def get_docs(data, url):

    response = requests.Session().get(url)
    tree = html.fromstring(response.text)

    heading = tree.findtext('.//h1[@class="gamma"]')

    for item in tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']"):
        title = item.findtext('.//a')
        link = item.xpath('.//a/@href')[0]
        # print(title, link)
        data.update({heading.split(" ")[-4]: [[(title)]]})
    save_data("mth.ods", data)

if __name__ == '__main__':
    get_links(web_link)

 解决方案 
When you update the values in the data dict the previous values get replaced.  

You can fix this if you replace this line:  
data.update({heading.split(" ")[-4]: [[(title)]]})
With this ( it's a bit ugly but it works ) :  
data[heading.split(" ")[-4]] = data.get(heading.split(" ")[-4], []) + [[(title)]]


                        
这篇关于无法在excel文件中正确写入提取的项目？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

无法在excel文件中正确写入提取的项目？ [英] Unable to write extracted items properly in an excel file?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

无法在excel文件中正确写入提取的项目？ [英] Unable to write extracted items properly in an excel file?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭