大 pandas DataFrame的单个列中的多个值 [英] Multiple values in single column of a pandas DataFrame
问题描述
我有一些数据要从XML解析到pandas DataFrame. XML数据大致如下所示:
I have some data that I'm parsing from XML to a pandas DataFrame. The XML data roughly looks like this:
<tracks>
<track name="trackname1" variants="1,2,3,4,5">
<variant var="1,2,3">
<leg time="21:23" route_id="5" stop_id="103" serial="1"/>
<leg time="21:26" route_id="5" stop_id="17" serial="2"/>
<leg time="21:30" route_id="5" stop_id="38" serial="3"/>
<leg time="20:57" route_id="8" stop_id="101" serial="1"/>
<leg time="21:01" route_id="8" stop_id="59" serial="2"/>
...
</variant>
<variant var="4,5">
... more leg elements
</variant>
</track>
<track name="trackname2" variants="1,2,3,4,5,6,7">
<variant var="1">
... more leg elements
</variant>
<variant var="2,3,4,5,7">
... more leg elements
</variant>
</track>
</tracks>
我将其导入到大熊猫中,因为我需要能够将此数据与其他DataFrames连接起来,并且需要能够查询诸如:为变量1的所有支路获取route_id 5"之类的东西.
I'm importing this into pandas because I need to be able to join this data with other DataFrames and I need to be able to query for stuff like: "get all legs of variant 1 for route_id 5".
我试图弄清楚如何在pandas DataFrame中做到这一点.我是否应该制作一个看起来像这样的DataFrame:
I'm trying to figure out how I would do this in a pandas DataFrame. Should I make a DataFrame that would look something like this:
track_name variants time route_id stop_id serial
"trackname1" "1,2,3" "21:23" "5" "103" "1"
"trackname1" "1,2,3" "21:26" "5" "17" "2"
...
"trackname1" "4,5" "21:20" "5" "103" "1"
...
"trackname2" "1" "20:59" "3" "45" "1"
... you get the point
如果这是可行的方法,我将如何(有效地)提取"route_id 5上变体3的所有行"?请注意,这应该给我所有在变体列列表中具有3的行,而不仅仅是在变体列中仅 具有"3"的行.
If this is the way to go, how would I (efficiently) extract for example "all rows for variant 3 on route_id 5"? Note that this should give me all the rows that have 3 in the variant column list, not just the rows that only have "3" in the variants column.
是否存在一种不同的方式来构造DataFrame,这将使此操作变得更加容易?我应该使用除熊猫以外的其他东西吗?
Is there a different way of constructing the DataFrame that would make this easier? Should I be using something other than pandas?
推荐答案
假设您有足够的内存,如果您的DataFrame每行持有一个变体,则将更容易完成任务:
Assuming you have enough memory, your task will be more easily accomplished if your DataFrame held one variant per row:
track_name variants time route_id stop_id serial
"trackname1" 1 "21:23" 5 103 1
"trackname1" 2 "21:23" 5 103 1
"trackname1" 3 "21:23" 5 103 1
"trackname1" 1 "21:26" 5 17 2
"trackname1" 2 "21:26" 5 17 2
"trackname1" 3 "21:26" 5 17 2
...
"trackname1" 4 "21:20" 5 103 1
"trackname1" 5 "21:20" 5 103 1
...
"trackname2" 1 "20:59" 3 45 1
然后,您可以找到"route_id 5上变体3的所有行,
Then you could find "all rows for variant 3 on route_id 5 with
df.loc[(df['variants']==3) & (df['route_id']==5)]
如果您将许多变体打包到一行中,例如
If you pack many variants into one row, such as
"trackname1" "1,2,3" "21:23" "5" "103" "1"
然后您可以使用
df.loc[(df['variants'].str.contains("3")) & (df['route_id']=="5")]
假定,变体始终为单个数字.如果还有2位数字的变体,例如"13"或"30",则需要将更复杂的正则表达式模式传递给str.contains
.
assuming that the variants are always single digits. If there are also 2-digit variants like "13" or "30", then you would need to pass a more complicated regex pattern to str.contains
.
或者,您可以使用apply
在逗号上分割每个变体:
Alternatively, you could use apply
to split each variant on commas:
df['variants'].apply(lambda x: "3" in x.split(','))
但这效率很低,因为您现在要调用Python函数 每行一次,并进行字符串拆分和测试成员资格 列表与矢量化整数比较相比.
but this is very inefficent since you would now be calling a Python function once for every row, and doing string splitting and a test for membership in a list compared to a vectorized integer comparision.
因此,为避免可能复杂的正则表达式或对apply
的调用相对较慢,我认为最好的选择是构建每行具有一个整数变体的DataFrame.
Thus, to avoid possibly complicated regex or a relatively slow call to apply
, I think your best bet is to build the DataFrame with one integer variant per row.
这篇关于大 pandas DataFrame的单个列中的多个值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!