我在清理数据时遇到了一些问题 [英] I have some problems with data-cleaning

查看：134 发布时间：2020/10/16 20:15:53 python pandas dataframe data-cleaning

本文介绍了我在清理数据时遇到了一些问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我从Wikipedia页面上抓了一张桌子，接下来我将清理数据。我已经将数据转换成Pandas格式，现在在清理数据时遇到了一些问题

I have scraped a table from wikipedia page and I am going to clean the data next. I have transformed the data in to Pandas format and now I have some problems cleaning the data

这是我执行的从Wikipedia页面上抓取表格的代码

Here are the codes I have executed to scrape the table from the wikipedia page

import requests
import pandas as pd
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())

My_table = soup.find('table',{'class':'wikitable sortable'})
My_table

PostalCode=[]
for row in My_table.findAll('tr')[1:]:
    PostalCode_cell=row.findAll('td')[0]
    PostalCode.append(PostalCode_cell.text)    
print(PostalCode)

Borough=[]
for row in My_table.findAll('tr')[1:] :
    Borough_cell=row.findAll('td')[1]
    Borough.append(Borough_cell.text)   
print(Borough)

Neighbourhood=[]
for row in My_table.findAll('tr')[1:]:
    Neighbourhood_cell=row.findAll('td')[2]
    Neighbourhood_cell.text.rstrip('\n')
    Neighbourhood.append(Neighbourhood_cell.text)
print(Neighbourhood)

canada=pd.DataFrame({'PostalCode':PostalCode,'Borough':Borough,'Neighborhood':Neighbourhood})
canada.rename(columns = {'PostalCode':'PostalCode','Borough':'Borough','Neighborhood':'Neighborhood'}, inplace = True) 
canada

我尝试了groupby函数，希望获得第二个所需的结果，但没有解决：

I have tried the groupby function hoping to get the 2nd desired outcome, but did not worked out:

canada.groupby(['PostalCode', 'Borough'])

I已尝试从自治市镇中删除未分配值：

I have tried to drop the "Not assigned" value from the Borough:

canada=canada.Borough.drop("Not assigned",axis=0)

但它显示ed： [未在轴中找到'未分配']

but it showed:"['Not assigned'] not found in axis"

这是清理后的数据的预期结果：
1.忽略值为 Borough
2中未分配的区域。对于具有相同邮政编码和自治市镇的邻域，它们应显示在同一行中并以逗号
分隔。3。如果单元格具有自治市镇，但未分配邻域，
邻居将与自治市镇相同

Here are the expected results of my cleaned data: 1. Ignore cells with value "Not assigned" in Borough 2. For Neighborhoods with the same PostalCode and Borough, they should show in the same line and seperated with comma 3. If a cell has a Borough but a "Not assigned" Neighborhood, the Neighborhood will be the same as the Borough

而且，我注意到，我抓取的表的每个末尾都包含 \n邻里的价值。我应该在抓取过程中添加任何代码来摆脱它吗？

And also, I noticed that the table I scraped contained "\n" at the end of each value in Neighborhood. Is there any codes I should add in the scraping process to get rid of it?

非常感谢您的提前帮助。

Many thanks for your help in advance.

推荐答案

感觉有点漫长。

import pandas as pd

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
canada = tables[0]
canada.columns = canada.iloc[0]
canada = canada.iloc[1:]
canada = canada[canada.Borough != 'Not assigned']
canada['Neighbourhood'].loc[canada['Neighbourhood'] == 'Not assigned'] =  canada.Borough
canada['Location'] = canada.Borough + ', ' + canada.Neighbourhood
canada.drop(['Borough', 'Neighbourhood'], axis=1, inplace = True)
canada.reset_index(drop=True)

引用：

https://stackoverflow.com / a / 49161313/6241235

编辑：

我认为@bubble的p关于不区分大小写的搜索是一个不错的选择，他们说 canada = canada [canada.loc [:,'Borough']。str.contains（'未分配'，case = False）] ，但我没想到）


I think @bubble's point about a case insensitive search is a good one where they say canada = canada[canada.loc[:, 'Borough'].str.contains('Not assigned', case=False)] but I didn't think of that)

                        这篇关于我在清理数据时遇到了一些问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

我在清理数据时遇到了一些问题 [英] I have some problems with data-cleaning

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

我在清理数据时遇到了一些问题 [英] I have some problems with data-cleaning

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭