从具有特定模式的 txt 文件创建 Pandas DataFrame [英] Create Pandas DataFrame from txt file with specific pattern
问题描述
我需要基于以下结构的文本文件创建一个Pandas DataFrame:
I need to create a Pandas DataFrame based on a text file based on the following structure:
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]
带有[edit]"的行是States,行[number] 是Regions.我需要拆分以下内容,然后为每个区域名称重复州名称.
The rows with "[edit]" are States and the rows [number] are Regions. I need to split the following and repeat the State name for each Region Name thereafter.
Index State Region Name
0 Alabama Aurburn...
1 Alabama Florence...
2 Alabama Jacksonville...
...
9 Alaska Fairbanks...
10 Alaska Arizona...
11 Alaska Flagstaff...
熊猫数据帧
我不确定如何将基于[edit]"和[number]"或(characters)"的文本文件拆分为相应的列,并为每个区域名称重复 State Name.请任何人都可以给我一个起点来开始完成以下工作.
I not sure how to split the text file based on "[edit]" and "[number]" or "(characters)" into the respective columns and repeat the State Name for each Region Name. Please can anyone give me a starting point to begin with to accomplish the following.
推荐答案
您可以先read_csv
带参数 name
用于创建 DataFrame
列Region Name
,分隔符是不在值中的值(如 ;
):
You can first read_csv
with parameter name
for create DataFrame
with column Region Name
, separator is value which is NOT in values (like ;
):
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
然后insert
新列 State
带有 extract
行,其中文本 [edit]
和 replace
从 (
到列Region 的末尾的所有值姓名
.
Then insert
new column State
with extract
rows where text [edit]
and replace
all values from (
to the end to column Region Name
.
df.insert(0, 'State', df['Region Name'].str.extract('(.*)[edit]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' (.+$', '')
最后通过 [edit] 的行"noreferrer">boolean indexing
,掩码由 str.contains
:
Last remove rows where text [edit]
by boolean indexing
, mask is created by str.contains
:
df = df[~df['Region Name'].str.contains('[edit]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
如果需要所有值的解决方案更容易:
If need all values solution is easier:
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)[edit]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('[edit]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)
这篇关于从具有特定模式的 txt 文件创建 Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!