从长数据集和大数据集中清除一列 [英] Clean one column from long and big data set

查看：81 发布时间：2020/9/20 20:00:13 python pandas data-cleaning bigdata

本文介绍了从长数据集和大数据集中清除一列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图从长数据集和大数据集中仅清除一列.数据有18列，超过100k的csv文件超过10k行，我只想清理其中的一列.

I am trying to clean only one column from the long and big data sets. The data has 18 columns, more than 10k+ rows about 100s of csv files, Of which I want to clean only one column.

从长列表中仅输入几个字段

Input fields only few from the long list

userLocation,   userTimezone,   Coordinates,
India,          Hawaii,    {u'type': u'Point', u'coordinates': [73.8567, 18.5203]}
California,     USA     
          ,     New Delhi,  
Ft. Sam Houston,Mountain Time (US & Canada),{u'type': u'Point', u'coordinates': [86.99643, 23.68088]}
Kathmandu,Nepal, Kathmandu, {u'type': u'Point', u'coordinates': [85.3248024, 27.69765658]}

完整的输入文件: Dropbox链接

代码:

    import pandas as pd

    data = pandas.read_cvs('input.csv')

    df =  ['tweetID', 'tweetText', 'tweetRetweetCt', 'tweetFavoriteCt',       
           'tweetSource', 'tweetCreated', 'userID', 'userScreen',
           'userName', 'userCreateDt', 'userDesc', 'userFollowerCt', 
           'userFriendsCt', 'userLocation', 'userTimezone', 'Coordinates',
           'GeoEnabled', 'Language']

    df0 = ['Coordinates']

其他列将按其输出原样写入.在这之后该怎么做?

Other columns are to written as it is in output. After this how to go about ?

输出:

userLocation,   userTimezone, Coordinate_one, Coordinate_one,
India,          Hawaii,         73.8567, 18.5203
California,     USA     
          ,     New Delhi,  
Ft. Sam Houston,Mountain Time (US & Canada),86.99643, 23.68088
Kathmandu,Nepal, Kathmandu, 85.3248024, 27.69765658

可能的最简单建议或将我引导至某个示例将很有帮助.

The possible easiest suggestion or direct me to some example will be a lot helpful.

推荐答案

这里有很多错误.

该文件不是简单的csv，并且假定的data = pd.read_csv('input.csv')未正确解析.
提交的坐标"似乎是一个json字符串
在同一领域中有NaN

The file is not a simple csv and is not being appropriately parsed by your assumed data = pd.read_csv('input.csv').
The 'Coordinates' filed seems to be a json string
There are NaN's in that same field

这是我到目前为止所做的.您将需要自己进行一些工作，以更适当地解析此文件

This is what I've done so far. You'll want to do some work on your own parsing this file more appropriately

import pandas as pd

df1 = pd.read_csv('./Turkey_28.csv')

coords = df1[['tweetID', 'Coordinates']].set_index('tweetID')['Coordinates']

coords = coords.dropna().apply(lambda x: eval(x))
coords = coords[coords.apply(type) == dict]

def get_coords(x):
    return pd.Series(x['coordinates'], index=['Coordinate_one', 'Coordinate_two'])

coords = coords.apply(get_coords)

df2 = pd.concat([coords, df1.set_index('tweetID').reindex(coords.index)], axis=1)

print df2.head(2).T

tweetID                                         714602054988275712
Coordinate_one                                             23.2745
Coordinate_two                                             56.6165
tweetText        I'm at MK Appartaments in Dobele https://t.co/...
tweetRetweetCt                                                   0
tweetFavoriteCt                                                  0
tweetSource                                             Foursquare
tweetCreated                                   2016-03-28 23:56:21
userID                                                   782541481
userScreen                                            MartinsKnops
userName                                             Martins Knops
userCreateDt                                   2012-08-26 14:24:29
userDesc         I See Them Try But They Can't Do What I Do. Be...
userFollowerCt                                                 137
userFriendsCt                                                  164
userLocation                                        DOB Till I Die
userTimezone                                            Casablanca
Coordinates      {u'type': u'Point', u'coordinates': [23.274462...
GeoEnabled                                                    True
Language                                                        en

这篇关于从长数据集和大数据集中清除一列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从长数据集和大数据集中清除一列 [英] Clean one column from long and big data set

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从长数据集和大数据集中清除一列 [英] Clean one column from long and big data set

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭