从具有特定模式的 txt 文件创建 Pandas DataFrame [英] Create Pandas DataFrame from txt file with specific pattern

查看:18
本文介绍了从具有特定模式的 txt 文件创建 Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要基于以下结构的文本文件创建一个Pandas DataFrame:

I need to create a Pandas DataFrame based on a text file based on the following structure:

Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]

带有[edit]"的行是States,行[number] 是Regions.我需要拆分以下内容,然后为每个区域名称重复州名称.

The rows with "[edit]" are States and the rows [number] are Regions. I need to split the following and repeat the State name for each Region Name thereafter.

Index          State          Region Name
0              Alabama        Aurburn...
1              Alabama        Florence...
2              Alabama        Jacksonville...
...
9              Alaska         Fairbanks...
10             Alaska         Arizona...
11             Alaska         Flagstaff...

熊猫数据帧

我不确定如何将基于[edit]"和[number]"或(characters)"的文本文件拆分为相应的列,并为每个区域名称重复 State Name.请任何人都可以给我一个起点来开始完成以下工作.

I not sure how to split the text file based on "[edit]" and "[number]" or "(characters)" into the respective columns and repeat the State Name for each Region Name. Please can anyone give me a starting point to begin with to accomplish the following.

推荐答案

您可以先read_csv 带参数 name 用于创建 DataFrameRegion Name,分隔符是不在值中的值(如 ;):

You can first read_csv with parameter name for create DataFrame with column Region Name, separator is value which is NOT in values (like ;):

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])

然后insert 新列 State 带有 extract 行,其中文本 [edit]replace( 到列Region 的末尾的所有值姓名.

Then insert new column State with extract rows where text [edit] and replace all values from ( to the end to column Region Name.

df.insert(0, 'State', df['Region Name'].str.extract('(.*)[edit]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' (.+$', '')

最后通过 [edit] 的行"noreferrer">boolean indexing,掩码由 str.contains:

Last remove rows where text [edit] by boolean indexing, mask is created by str.contains:

df = df[~df['Region Name'].str.contains('[edit]')].reset_index(drop=True)
print (df)
      State   Region Name
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Jacksonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe
11  Arizona        Tucson

如果需要所有值的解决方案更容易:

If need all values solution is easier:

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)[edit]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('[edit]')].reset_index(drop=True)
print (df)
      State                                        Region Name
0   Alabama                      Auburn (Auburn University)[1]
1   Alabama             Florence (University of North Alabama)
2   Alabama    Jacksonville (Jacksonville State University)[2]
3   Alabama         Livingston (University of West Alabama)[2]
4   Alabama           Montevallo (University of Montevallo)[2]
5   Alabama                          Troy (Troy University)[2]
6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
7   Alabama                  Tuskegee (Tuskegee University)[5]
8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
9   Arizona         Flagstaff (Northern Arizona University)[6]
10  Arizona                   Tempe (Arizona State University)
11  Arizona                     Tucson (University of Arizona)

这篇关于从具有特定模式的 txt 文件创建 Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆