使用特定模式从txt文件创建Pandas DataFrame [英] Create Pandas DataFrame from txt file with specific pattern

查看:116
本文介绍了使用特定模式从txt文件创建Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要基于以下结构基于文本文件创建Pandas DataFrame:

I need to create a Pandas DataFrame based on a text file based on the following structure:

Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]

带有"[edit]"的行是州,而[number]行是地区.我需要分割以下内容,然后为每个区域名称"重复州名称.

The rows with "[edit]" are States and the rows [number] are Regions. I need to split the following and repeat the State name for each Region Name thereafter.

Index          State          Region Name
0              Alabama        Aurburn...
1              Alabama        Florence...
2              Alabama        Jacksonville...
...
9              Alaska         Fairbanks...
10             Alaska         Arizona...
11             Alaska         Flagstaff...

Pandas DataFrame

Pandas DataFrame

我不确定如何将基于"[edit]"和"[number]"或(字符)"的文本文件拆分为相应的列,并为每个区域名称重复状态名称.请任何人给我一个起点,以完成以下任务.

I not sure how to split the text file based on "[edit]" and "[number]" or "(characters)" into the respective columns and repeat the State Name for each Region Name. Please can anyone give me a starting point to begin with to accomplish the following.

推荐答案

您可以首先name的="noreferrer"> read_csv ,用于使用列Region Name创建create DataFrame,分隔符是不包含在值中的值(例如;):

You can first read_csv with parameter name for create DataFrame with column Region Name, separator is value which is NOT in values (like ;):

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])

然后 insert 新列State extract 文本[edit] replace (到列Region Name的末尾的所有值.

Then insert new column State with extract rows where text [edit] and replace all values from ( to the end to column Region Name.

df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')

最后通过 boolean indexing删除文本[edit]的行,由 :

Last remove rows where text [edit] by boolean indexing, mask is created by str.contains:

df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State   Region Name
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Jacksonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe
11  Arizona        Tucson

如果需要所有值的解决方案会更容易:

If need all values solution is easier:

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State                                        Region Name
0   Alabama                      Auburn (Auburn University)[1]
1   Alabama             Florence (University of North Alabama)
2   Alabama    Jacksonville (Jacksonville State University)[2]
3   Alabama         Livingston (University of West Alabama)[2]
4   Alabama           Montevallo (University of Montevallo)[2]
5   Alabama                          Troy (Troy University)[2]
6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
7   Alabama                  Tuskegee (Tuskegee University)[5]
8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
9   Arizona         Flagstaff (Northern Arizona University)[6]
10  Arizona                   Tempe (Arizona State University)
11  Arizona                     Tucson (University of Arizona)

这篇关于使用特定模式从txt文件创建Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆