将无序元组列表转换为Pandas DataFrame [英] Converting unordered list of tuples to pandas DataFrame

查看:480
本文介绍了将无序元组列表转换为Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用库usaddress来解析我拥有的一组文件中的地址.我希望我的最终输出是一个数据框,其中列名称代表地址的一部分(例如街道,城市,州),行代表我提取的每个地址.例如:

I am using the library usaddress to parse addresses from a set of files I have. I would like my final output to be a data frame where column names represent parts of the address (e.g. street, city, state) and rows represent each individual address I've extracted. For example:

假设我有一个地址列表:

Suppose I have a list of addresses:

addr = ['123 Pennsylvania Ave NW Washington DC 20008', 
        '652 Polk St San Francisco, CA 94102', 
        '3711 Travis St #800 Houston, TX 77002']

然后我使用usaddress提取它们

and I extract them using usaddress

info = [usaddress.parse(loc) for loc in addr]

信息"是元组列表的列表,如下所示:

"info" is a list of a list of tuples that looks like this:

[[('123', 'AddressNumber'),
  ('Pennsylvania', 'StreetName'),
  ('Ave', 'StreetNamePostType'),
  ('NW', 'StreetNamePostDirectional'),
  ('Washington', 'PlaceName'),
  ('DC', 'StateName'),
  ('20008', 'ZipCode')],
 [('652', 'AddressNumber'),
  ('Polk', 'StreetName'),
  ('St', 'StreetNamePostType'),
  ('San', 'PlaceName'),
  ('Francisco,', 'PlaceName'),
  ('CA', 'StateName'),
  ('94102', 'ZipCode')],
 [('3711', 'AddressNumber'),
  ('Travis', 'StreetName'),
  ('St', 'StreetNamePostType'),
  ('#', 'OccupancyIdentifier'),
  ('800', 'OccupancyIdentifier'),
  ('Houston,', 'PlaceName'),

我希望每个列表(对象"info"中有3个列表)表示一行,每个元组对的2值表示列,而元组对的1值表示.注意:内部列表的链接并不总是相同的,因为并非每个地址都包含所有信息.

I would like each list (there are 3 lists within the object "info") to represent a row, and the 2 value of each tuple pair to denote a column and the 1 value of the tuple pair to be the value. Note: the link of the inner lists will not always be the same as not every address will have every bit of information.

任何帮助将不胜感激!

谢谢

推荐答案

感谢您的回复!我最终做了一个完全不同的解决方法,如下所示:

Thank you for your responses! I ended up doing a completely different workaround as follows:

我检查了文档,以查看来自usaddress的所有可能的parse_tags,创建了一个DataFrame,其中所有可能的标签作为列,而另一列包含提取的地址.然后,我开始使用regex解析并从列中提取信息.下面的代码!

I checked the documentation to see all possible parse_tags from usaddress, created a DataFrame with all possible tags as columns, and one other column with the extracted addresses. Then I proceeded to parse and extract information from the columns using regex. Code below!

parse_tags = ['Recipient','AddressNumber','AddressNumberPrefix','AddressNumberSuffix',
'StreetName','StreetNamePreDirectional','StreetNamePreModifier','StreetNamePreType',
'StreetNamePostDirectional','StreetNamePostModifier','StreetNamePostType','CornerOf',
'IntersectionSeparator','LandmarkName','USPSBoxGroupID','USPSBoxGroupType','USPSBoxID',
'USPSBoxType','BuildingName','OccupancyType','OccupancyIdentifier','SubaddressIdentifier',
'SubaddressType','PlaceName','StateName','ZipCode']

addr = ['123 Pennsylvania Ave NW Washington DC 20008', 
        '652 Polk St San Francisco, CA 94102', 
        '3711 Travis St #800 Houston, TX 77002']

df = pd.DataFrame({'Addresses': addr})
pd.concat([df, pd.DataFrame(columns = parse_tags)])

然后我创建了一个新列,该列使usaddress解析列表中的字符串成为"Info"

Then I created a new column that made a string out of the usaddress parse list and called it "Info"

df['Info'] = df['Addresses'].apply(lambda x: str(usaddress.parse(x)))

现在这是主要的解决方法.我遍历了每个列的名称,并在相应的信息"单元格中查找了该名称,并应用了正则表达式以提取它们所在的信息!

Now here's the major workaround. I looped through each column name and looked for it in the corresponding "Info" cell and applied regular expressions to extract information where they existed!

for colname in parse_tags:
    df[colname] = df['Info'].apply(lambda x: re.findall("\('(\S+)', '{}'\)".format(colname), x)[0] if re.search(
    colname, x) else "")

这可能不是最有效的方法,但是它可以达到我的目的.感谢大家提供的建议!

This is probably not the most efficient way, but it worked for my purposes. Thanks everyone for providing suggestions!

这篇关于将无序元组列表转换为Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆