我想使用 pandas 为Excel文件过滤数据 [英] I Want To Filter Data For Excel Files Using Pandas

查看:69
本文介绍了我想使用 pandas 为Excel文件过滤数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为熊猫中的Excel文件过滤数据.基于列值,即字符串值.

I am trying to filter Data, for Excel Files in Pandas. Based on the Column Value i.e. String Value.

我已经尝试了以下方法来实现我想要的目标:-

I Have tried the following to achieve what I want :-

截至2019年8月8日,下面显示的最新代码

Latest Code shown Below as of 07/08/2019

In [13]: 

import pandas as pd

xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

data = pd.read_excel(xls, sheet_name="Sheet1")

pd.options.display.max_rows = 1000

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])

df[(df['Venue'].str.contains('[a-zA-Z]') &  (df['DISPLAY/'].str.contains('DISPLAY') &  df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C']  != 'LANC'))] 

Out[13]: 

         Venue      A/C     DISPLAY/                   Date     BID                    

475     SHAWBURY    DAK     DISPLAY     2008-07-24 00:00:00     188
476     SHAWBURY   SPIT     DISPLAY     2008-07-24 00:00:00     188
477     COTTESMORE SPIT     DISPLAY                     NaN     757                
478     COTTESMORE  DAK     DISPLAY                     NaN     757               
484     SUNDERLAND SPIT     DISPLAY                     NaN     333               
487   EAST FORTUNE SPIT     DISPLAY                     NaN     406               
489     WINDERMERE   HS     DISPLAY     2008-07-25 00:00:00     138
490     WINDERMERE  DAK     DISPLAY     2008-07-25 00:00:00     138
504      WIGTON    DHS      DISPLAY     2008-07-26 00:00:00     144
506     WINDERMERE   HS     DISPLAY     2008-07-26 00:00:00     138
507     WINDERMERE  DAK     DISPLAY     2008-07-26 00:00:00     138
508     SUNDERLAND   HS     DISPLAY                     NaN     333                
509     SUNDERLAND  DAK     DISPLAY                     NaN     333               

例如,我想将SUNDERLAND线路更改为一条线路,即:-

I want for example the SUNDERLAND Lines, to be changed to one line i.e. :-

SUNDERLAND DHS DISPLAY NaN 00:00:00

对于WINDERMERE,有两行,即:-

And for WINDERMERE two Lines i.e. :-

WINDERMERE DHS DISPLAY 2008-07-25 00:00:00
WINDERMERE DHS DISPLAY 2008-07-26 00:00:00

所以我基本上要说的是,从Jupyter Notebook中显示的Excel表数据中,我如何全部关闭:-

So what I am basically saying, from the Excel Table Data, displayed in Jupyter Notebook, how do I change all off :-

VENUE DAK DISPLAY 2008-09-10 00:00:00 
VENUE HS DISPLAY 2008-09-10 00:00:00

(上方和下方各行中的日期将是该行中显示的日期,而不是上方和下方的特定日期,但上面显示的WINDERMERE除外,这是正确的)

(The Dates in the rows above and below, will be whatever is shown in that row, not the specific date given above and the ones below, except the WINDERMERE ones, shown above which is correct)

VENUE HS DISPLAY 2008-09-06 00:00:00 
VENUE DAK DISPLAY 2008-09-06 00:00:00

VENUE DAK DISPLAY 2008-05-25 00:00:00 
VENUE SPIT DISPLAY 2008-05-25 00:00:00 

反之亦然,例如DAK和HS

And the other way around, like with DAK and HS

两条路线的地点相同,即SUNDERLAND,即:-

VENUE same both lines i.e. SUNDERLAND i.e. :-

SUNDERLAND 
SUNDERLAND

收件人:-

VENUE DHS DISPLAY 2008-09-10 00:00:00 

就像说的是VENUE WIGTON一样

like it says for the VENUE WIGTON

然后DAK和SPIT会:-

And the DAK and SPIT Ones to :-

VENUE DS DISPLAY 2008-09-10 00:00:00 

即肖伯里

如果可能的话,也可以将"Year Year"(年份)更改为2009,因为2008年是错误的日期.

Also if possible, to change the Year to 2009, in the date because 2008 is wrong.

并摆脱其他数据.

我已经注意到,现在在出价"列中.在我不想保留的数据中,出价编号在两行中都相同,即相同的VENUE.因此,我应该输入什么来删除出价编号只有1的行,即VENUE EAST FORTUNE.

I have noticed that in the BID Column, which I now include. In the Data I wan't to keep, the Bid Numbers are the same in both rows, i.e. the same VENUE . So what should I type, to remove rows, where there is only 1 of a Bid Number i.e. the VENUE EAST FORTUNE.

但是要保持竞标编号相同的行,即如图所示,两行直接位于彼此下方?除了"A/C列"中DHS或DSS的值是什么?

But keep the Rows, where the Bid Numbers are the same, i.e. both lines directly underneath each other as shown in the picture ? Except where in the A/C Column the values are DHS or DSS ?

Hi Benoit, here is the output I get, which is different to yours for some reason.

[[475, 'SHAWBURY', 'DAK', 'DISPLAY', '2008-07-24 00:00:00', 188],
[476, 'SHAWBURY', 'SPIT', 'DISPLAY', '2008-07-24 00:00:00', 188],
[477, 'COTTESMORE', 'SPIT', 'DISPLAY', None, 757],
[478, 'COTTESMORE', 'DAK', 'DISPLAY', None, 757],
[484, 'SUNDERLAND', 'SPIT', 'DISPLAY', None, 333],
[487, 'EAST FORTUNE', 'SPIT', 'DISPLAY', None, 406],
[489, 'WINDERMERE', 'HS', 'DISPLAY', '2008-07-25 00:00:00', 138],
[490, 'WINDERMERE', 'DAK', 'DISPLAY', '2008-07-25 00:00:00', 138],
[504, 'WIGTON', 'DHS', 'DISPLAY', '2008-07-26 00:00:00', 144],
[506, 'WINDERMERE', 'HS', 'DISPLAY', '2008-07-26 00:00:00', 138],
[507, 'WINDERMERE', 'DAK', 'DISPLAY', '2008-07-26 00:00:00', 138],
[508, 'SUNDERLAND', 'HS', 'DISPLAY', None, 333],
[509, 'SUNDERLAND', 'DAK', 'DISPLAY', None, 333]]

最新数据:-

Venue   A/C     DISPLAY/    Date    BID
25  SHUTTLEWORTH    DAK     DISPLAY     NaN     529
55  KEMBLE  DAK     DISPLAY     NaN     461
69  NORTHWICH   SPIT    DISPLAY     2008-05-10 00:00:00     514
72  POCKLINGTON     SPIT    DISPLAY     2009-05-10 00:00:00     821
75  BERLIN  DAK     DISPLAY     2008-05-12 00:00:00     587
78  MILDENHALL  SPIT    DISPLAY     2009-05-15 00:00:00     920
93  DUXFORD     HS  DISPLAY     NaN     611
103     CRANWELL    HS  DISPLAY     2008-05-20 00:00:00     44
145     SCARBOROUGH     DAK     DISPLAY     2008-05-25 00:00:00     610
150     SCARBOROUGH     SPIT    DISPLAY     2008-05-25 00:00:00     610
151     CORBRIDGE   SPIT    DISPLAY     NaN     353
167     BRIDGEND-CNX    SPIT    DISPLAY     2008-05-31 00:00:00     527
173     TARRANT RUSHDEN     HS  DISPLAY     NaN     132
174     TARRANT RUSHDEN     DAK     DISPLAY     NaN     132
179     NORTHOLT    SPIT    DISPLAY     2009-06-05 00:00:00     870
214     BRIZE NORTON    HS  DISPLAY     NaN     939
218     ROPLEY  HS  DISPLAY     2008-06-13 00:00:00     355
223     THWAITES    HS  DISPLAY     NaN     364
231     ROPLEY  HS  DISPLAY     NaN     355
240     COSFORD     HS  DISPLAY     2008-06-14 00:00:00     667
241     QUORN   HS  DISPLAY     NaN     314
244     COSFORD     DAK     DISPLAY     2008-06-14 00:00:00     NaN
260     REDHILL     SPIT    DISPLAY     NaN     686
269     KEMBLE  DAK     DISPLAY     NaN     316
270     KEMBLE  HS  DISPLAY     NaN     316
280     KEMBLE  SPIT    DISPLAY     2008-06-21 00:00:00     316
285     KEMBLE  DAK     DISPLAY     2008-06-21 00:00:00     316
317     BRUNTINGTHORPE  SPIT    DISPLAY     NaN     106
329     ELDWICK     SPIT    DISPLAY     NaN     430
333     PLYMOUTH HOE    DAK     DISPLAY     2008-06-28 00:00:00     528
340     OLD BUCKENHAM   HS  DISPLAY     NaN     424
344     COSFORD     SPIT    DISPLAY     2008-06-28 00:00:00     68
350     TAIN    SPIT    DISPLAY     NaN     433
355     WITTERING   SPIT    DISPLAY     2008-07-03 00:00:00     376
362     WADDINGTON  HS  DISPLAY     2008-07-04 00:00:00     666
372     ANNAN   DAK     DISPLAY     NaN     606
373     ANNAN   SPIT    DISPLAY     NaN     606
375     WADDINGTON  HS  DISPLAY     2008-07-05 00:00:00     666
389     SHAWBURY    SPIT    DISPLAY     2008-07-05 00:00:00     183
393     CRANWELL    SPIT    DISPLAY     2008-07-07 00:00:00     823
399     SWANSEA     HS  DISPLAY     NaN     335
403     DUXFORD LEGENDS     HS  DISPLAY     2008-07-11 00:00:00     37
405     YEOVILTON   HS  DISPLAY     NaN     549
407     WOODHOUSE   HS  DISPLAY     2008-07-11 00:00:00     545
429     CAPEL-LE-FERNE  HS  DISPLAY     2008-07-12 00:00:00     298
430     ARDINGLY(SUSSEX)    HS  DISPLAY     NaN     189
439     CAERNARFON  SPIT    DISPLAY     2008-07-18 00:00:00     481
445     PENARTH     DAK     DISPLAY     NaN     303
448     FAIRFORD(RIAT)  SPIT    DISPLAY     2008-07-19 00:00:00     625
455     CHOLMONDELEY CASTLE     SPIT    DISPLAY     NaN     494
459     HAMPSTHWAITE    DAK     DISPLAY     NaN     828
465     ODIHAM(CAS COMM)    SPIT    DISPLAY     2009-07-21 00:00:00     NaN
469     ODIHAM FAMS DAY     SPIT    DISPLAY     2008-07-22 00:00:00     6
470     HENLOW FAM DAY  SPIT    DISPLAY     NaN     146
475     SHAWBURY    DAK     DISPLAY     2008-07-24 00:00:00     188
476     SHAWBURY    SPIT    DISPLAY     2008-07-24 00:00:00     188
477     COTTESMORE  SPIT    DISPLAY     NaN     757
478     COTTESMORE  DAK     DISPLAY     NaN     757
484     SUNDERLAND  SPIT    DISPLAY     NaN     333
487     EAST FORTUNE    SPIT    DISPLAY     NaN     406
489     WINDERMERE  HS  DISPLAY     2008-07-25 00:00:00     138
490     WINDERMERE  DAK     DISPLAY     2008-07-25 00:00:00     138
504     WIGTON  DHS     DISPLAY     2008-07-26 00:00:00     144
506     WINDERMERE  HS  DISPLAY     2008-07-26 00:00:00     138
507     WINDERMERE  DAK     DISPLAY     2008-07-26 00:00:00     138
508     SUNDERLAND  HS  DISPLAY     NaN     333
509     SUNDERLAND  DAK     DISPLAY     NaN     333
511     AUDLEM  SPIT    DISPLAY     2008-07-26 00:00:00     706
524     LYNEHAM FAM DAY     SPIT    DISPLAY     NaN     662
525     MALVERN     SPIT    DISPLAY     NaN     26
527     DAMYNS HALL     SPIT    DISPLAY     2008-08-08 00:00:00     766
529     DUXFORD     SPIT    DISPLAY     2008-08-09 00:00:00     612
530     DAMYNS HALL     SPIT    DISPLAY     NaN     766
534     BLACKPOOL   HS  DISPLAY     NaN     698
540     EASTBOURNE  HS  DISPLAY     2008-08-13 00:00:00     407
543     EASTBOURNE  HS  DISPLAY     2008-08-14 00:00:00     407
546     EASTBOURNE  HS  DISPLAY     2008-08-15 00:00:00     407
548     ROUGHAM     SPIT    DISPLAY     2009-08-15 00:00:00     551
551     DUXFORD     DAK     DISPLAY     2008-08-15 00:00:00     613
552     ROUGHAM     DAK     DISPLAY     NaN     551
556     EASTBOURNE  HS  DISPLAY     2008-08-16 00:00:00     407
564     CROMER  SPIT    DISPLAY     2008-08-19 00:00:00     139
569     CRANWELL    SPIT    DISPLAY     2009-08-20 00:00:00     913
578     SHOREHAM    SPIT    DISPLAY     NaN     366
581     RAMSEY  SPIT    DISPLAY     2009-08-22 00:00:00     896
583     WHITBY  DAK     DISPLAY     2008-08-22 00:00:00     125
584     WHITBY  HS  DISPLAY     2008-08-22 00:00:00     125
586     WHITBY  HS  DISPLAY     2008-08-23 00:00:00     125
587     WHITBY  DAK     DISPLAY     2008-08-23 00:00:00     125
591     SHOREHAM    HS  DISPLAY     2008-08-23 00:00:00     366
605     TWINWOOD AIRFIELD   DAK     DISPLAY     2008-08-30 00:00:00     934
608     ROMSEY  DAK     DISPLAY     NaN     175
613     STONELEIGH PARK     HS  DISPLAY     NaN     NaN
614     DUNSFOLD    HS  DISPLAY     NaN     583
622     ROMSEY  DAK     DISPLAY     2008-08-31 00:00:00     175
623     ROMSEY  HS  DISPLAY     NaN     175
625     STONELEIGH PARK     HS  DISPLAY     NaN     563
629     TWINWOOD AIRFIELD   DAK     DISPLAY     NaN     934
633     PLYMOUTH HOE    SPIT    DISPLAY     2008-09-03 00:00:00     295
646     PORTRUSH    DAK     DISPLAY     2008-09-05 00:00:00     76
647     PORTRUSH    HS  DISPLAY     2008-09-05 00:00:00     76
652     DUXFORD     HS  DISPLAY     NaN     615
659     PORTRUSH    HS  DISPLAY     2008-09-06 00:00:00     76
660     PORTRUSH    DAK     DISPLAY     2008-09-06 00:00:00     76
667     GREAT YARMOUTH  HS  DISPLAY     NaN     655
678     ODIHAM  SPIT    DISPLAY     2008-09-09 00:00:00     137
686     GUERNSEY    DAK     DISPLAY     2008-09-10 00:00:00     582
687     GUERNSEY    HS  DISPLAY     2008-09-10 00:00:00     582
688     JERSEY  DAK     DISPLAY     2008-09-10 00:00:00     581
689     JERSEY  HS  DISPLAY     2008-09-10 00:00:00     581
711     WOODSTOCK(BLENHEIM)     HS  DISPLAY     2008-09-12 00:00:00     362
717     MORECAMBE   SPIT    DISPLAY     NaN     368
720     WOODSTOCK   DAK     DISPLAY     2008-09-13 00:00:00     362
721     WOODSTOCK   HS  DISPLAY     NaN     362
744     NORTHOLT    HS  DISPLAY     2008-09-17 00:00:00     532
746     GOODWOOD    HS  DISPLAY     2008-09-18 00:00:00     256
755     GOODWOOD    HS  DISPLAY     2008-09-19 00:00:00     256
756     NORTHOLT    HS  DISPLAY     2009-09-19 00:00:00     863
763     KEMBLE  HS  DISPLAY     NaN     330
766     KEMBLE  HS  DISPLAY     2008-09-20 00:00:00     330
774     SANICOLE    DAK     DISPLAY     NaN     69
776     GOODWOOD    HS  DISPLAY     2008-09-20 00:00:00     256
777     SANICOLE    HS  DISPLAY     NaN     69
790     SOUTHPORT   HS  DISPLAY     NaN     584
791     SOUTHPORT   DAK     DISPLAY     NaN     584
803     DUXFORD     SPIT    DISPLAY     2008-10-11 00:00:00     616

任何帮助将不胜感激

致谢

Eddie Winch

Eddie Winch

推荐答案

[更新]-这很奇怪,但是它尊重您要应用的规则

[Updated] - This is kin of weird but it respects the rules you want to apply

(这也有点怪,所以很有意义)

In [1]:
import pandas as pd
 
data = [
        [475, 'SHAWBURY', 'DAK', 'DISPLAY', '2008-07-24 00:00:00', 188],
        [476, 'SHAWBURY', 'SPIT', 'DISPLAY', '2008-07-24 00:00:00', 188],
        [477, 'COTTESMORE', 'SPIT', 'DISPLAY', None, 757],                
        [478, 'COTTESMORE', 'DAK', 'DISPLAY', None, 757],               
        [484, 'SUNDERLAND', 'SPIT', 'DISPLAY', None, 333],           
        [487, 'EAST FORTUNE', 'SPIT', 'DISPLAY', None, 406],             
        [489, 'WINDERMERE', 'HS', 'DISPLAY', '2008-07-25 00:00:00', 138],
        [490, 'WINDERMERE', 'DAK', 'DISPLAY', '2008-07-25 00:00:00', 138],
        [504, 'WIGTON', 'DHS', 'DISPLAY', '2008-07-26 00:00:00', 144],
        [506, 'WINDERMERE', 'HS', 'DISPLAY', '2008-07-26 00:00:00', 138],
        [507, 'WINDERMERE', 'DAK', 'DISPLAY', '2008-07-26 00:00:00', 138],
        [508, 'SUNDERLAND', 'HS', 'DISPLAY', None, 333],                
        [509, 'SUNDERLAND', 'DAK', 'DISPLAY', None, 333]
       ]
df = pd.DataFrame(data, columns=['Index', 'Venue', 'A/C', 'DISPLAY', 'Date', 'BID']).set_index('Index')
df

Out [1]:

       Venue        A/C     DISPLAY     Date                    BID
Index                   
475    SHAWBURY     DAK     DISPLAY     2008-07-24 00:00:00     188
476    SHAWBURY     SPIT    DISPLAY     2008-07-24 00:00:00     188
477    COTTESMORE   SPIT    DISPLAY     None                    757
478    COTTESMORE   DAK     DISPLAY     None                    757
484    SUNDERLAND   SPIT    DISPLAY     None                    333
487    EAST FORTUNE SPIT    DISPLAY     None                    406
489    WINDERMERE   HS      DISPLAY     2008-07-25 00:00:00     138
490    WINDERMERE   DAK     DISPLAY     2008-07-25 00:00:00     138
504    WIGTON       DHS     DISPLAY     2008-07-26 00:00:00     144
506    WINDERMERE   HS      DISPLAY     2008-07-26 00:00:00     138
507    WINDERMERE   DAK     DISPLAY     2008-07-26 00:00:00     138
508    SUNDERLAND   HS      DISPLAY     None                    333
509    SUNDERLAND   DAK     DISPLAY     None                    333

2.操纵您的数据框

In [2] :
## Keep BID where we have at least 2 rows
test = df.groupby(by=['BID', 'Venue', 'DISPLAY']).count()
test = test[test['A/C']>1]
bids = test.reset_index().BID.tolist()

# Here if there is already `DHS` and `DS` in the column `A/C`, I want to keep them
df.loc[df['A/C']=='DHS', 'Aircraft'] = 'DHS'
df.loc[df['A/C']=='DS', 'Aircraft'] = 'DS'

# I keep 1 row for each bid that has at least 2 rows, and their Aircraft's value are updated
for bid in bids:
    df.loc[(df['BID']==bid) & (df['A/C']=='DAK'), 'Aircraft']= 'DHS' 
    df.loc[(df['BID']==bid) & (df['A/C']=='SPIT'), 'Aircraft'] = 'DS' 
    

df = df[df['Aircraft'].notnull()].drop(columns=['A/C'], axis=1)
data

Out [2]:

        Venue       DISPLAY     Date                BID     Aircraft
Index                   
475     SHAWBURY    DISPLAY     2008-07-24 00:00:00 188     DHS
476     SHAWBURY    DISPLAY     2008-07-24 00:00:00 188     DS
477     COTTESMORE  DISPLAY     None                757     DS
478     COTTESMORE  DISPLAY     None                757     DHS
484     SUNDERLAND  DISPLAY     None                333     DS
490     WINDERMERE  DISPLAY     2008-07-25 00:00:00 138     DHS
504     WIGTON      DISPLAY     2008-07-26 00:00:00 144     DHS
507     WINDERMERE  DISPLAY     2008-07-26 00:00:00 138     DHS
509     SUNDERLAND  DISPLAY     None                333     DHS

这篇关于我想使用 pandas 为Excel文件过滤数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆