导出 pandas 数据框,文本列包含utf-8文本和URL到Excel [英] Export Pandas data frame with text column containg utf-8 text and URLs to Excel

查看:356
本文介绍了导出 pandas 数据框,文本列包含utf-8文本和URL到Excel的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的熊猫数据帧由每个tweet(300.000行)的推文和元数据组成。我的一些同事需要在Excel中处理这些数据,这就是为什么我需要导出它。



我想使用 .to_csv .to_excel 这些都是由熊猫提供的,但我无法使其正常工作。



当我使用 .to_csv 时,我的问题是它在数据框的文本部分保持失败。我已经玩过不同的分隔符,但文件从来没有100%对齐。文本列似乎包含混淆Excel的选项卡,管道字符等。

  df.to_csv('test.csv' sep ='\t',encoding ='utf-8')

当我尝试使用 .to_excel 连同 xlsxwriter 引擎我遇到了一个不同的问题,这就是我的文本列包含很多网址(我想)。 xlswriter 尝试制作这些URL的特殊点击链接,而不是将其作为字符串处理。我已经找到了一些关于如何规避这些信息的信息,但是我再也无法让它工作。



应该使用以下代码来禁用功能,我认为是造成麻烦:

  workbook = xlsxwriter.Workbook(filename,{'strings_to_urls':False})

但是,当使用 to_excel 在将数据框加载到Excel文件之前,要调整此工作簿对象的设置。



简而言之,如何从Pandas数据导出具有极大变化文本的列框架到Excel了解的东西?



编辑:
示例:

  @geertwilderspvv @telegraaf ach Wilders toch,nep-voorzitter van een nep-partij met maaréénlid,\\\
zeur niet over nep-premier of parlement!

所以在这种情况下显然是我的数据线路刹车。我会尝试找到更多的例子。



edit2:

 <?xml version =1.0encoding =UTF-8standalone =yes?> 
< recoveryLog xmlns =http://schemas.openxmlformats.org/spreadsheetml/2006/main>< logFileName> error047600_01.xml< / logFileName>< summary> Er zijn fouten aangetroffen in bestand C :\Users\Guy Mahieu\Desktop\Vu ipython notebook\pandas_simple.xlsx< / summary>< removedRecords summary =Hier volgt een lijst van verwijderde records:>< removedRecord> Verwijderde records:Formule货车

荷兰语翻译:



错误被发现在文件。以下是删除记录的列表:已删除的记录:部分的公式/xl/worksheets/sheet1.xml

解决方案

不能认为目前可以通过Pandas API传递XlsxWriter构造函数,但可以解决 strings_to_url 问题,如下所示:

  import pandas as pd 

df = pd.DataFrame({'Data':['http://python.org']})

#使用XlsxWriter作为引擎创建一个Pandas Excel writer。
writer = pd.ExcelWriter('pandas_simple.xlsx',engine ='xlsxwriter')

#不要将类似url的字符串转换为urls。
writer.book.strings_to_urls = False

#将数据框转换为XlsxWriter Excel对象。
df.to_excel(writer,sheet_name ='Sheet1')

#关闭Pandas Excel编写器并输出Excel文件。
writer.save()

更新:在最近的版本的熊猫,您可以直接将XlsxWriter构造函数选项传递给 ExcelWriter(),而不需要设置 writer.book.strings_to_urls 间接地:

  writer = pd.ExcelWriter('pandas_simple.xlsx',
engine ='xlsxwriter',
options = {'strings_to_urls':False})


My Pandas data frame consists of Tweets and meta data of each tweet (300.000 rows). Some of my colleagues need to work with this data in Excel which is why I need to export it.

I wanted to use either .to_csv or .to_excel which are both provided by Pandas but I can't get it to work properly.

When I use .to_csv my problem is that it keeps failing in the text part of the data frame. I've played around with different separators but the file is never 100% aligned. The text column seems to contain tabs, pipe characters etc. which confuses Excel.

df.to_csv('test.csv', sep='\t', encoding='utf-8')

When I try to use .to_excel together with the xlsxwriter engine I'm confronted with a different problem, which is that my text column contains to many URLs (I think). xlswriter tries to make special clickable links of these URLs instead of just handling them as strings. I've found some information on how to circumvent this but, again, I can't get it to work.

The following bit of code should be used to disable the function that I think is causing trouble:

workbook = xlsxwriter.Workbook(filename, {'strings_to_urls': False})

However, when using to_excel I can't seem to adjust this setting of the Workbook object before I load the data frame into the Excel file.

In short how do I export a column with wildly varying text from a Pandas data frame to something that Excel understands?

edit: example:

@geertwilderspvv @telegraaf ach Wilders toch, nep-voorzitter van een nep-partij met maar één lid, \nzeur niet over nep-premier of parlement!

So in this case It is obviously a line brake that is my data. I will try to find some more examples.

edit2:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><logFileName>error047600_01.xml</logFileName><summary>Er zijn fouten aangetroffen in bestand C:\Users\Guy Mahieu\Desktop\Vu ipython notebook\pandas_simple.xlsx</summary><removedRecords summary="Hier volgt een lijst van verwijderde records:"><removedRecord>Verwijderde records: Formule van het onderdeel /xl/worksheets/sheet1.xml</removedRecord></removedRecords></recoveryLog>

Translation of Dutch stuff:

Errors were found in "file". Here follows a list of removed records: removed records: formula of the part /xl/worksheets/sheet1.xml

解决方案

I don't think it is currently possible to pass XlsxWriter constructor options via the Pandas API but you can workaround the strings_to_url issue as follows:

import pandas as pd

df = pd.DataFrame({'Data': ['http://python.org']})

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')

# Don't convert url-like strings to urls.
writer.book.strings_to_urls = False

# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

Update: In recent version of Pandas you can pass XlsxWriter constructor options to ExcelWriter() directly and you do not need to set writer.book.strings_to_urls indirectly:

writer = pd.ExcelWriter('pandas_simple.xlsx', 
                        engine='xlsxwriter', 
                        options={'strings_to_urls': False})

这篇关于导出 pandas 数据框,文本列包含utf-8文本和URL到Excel的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆