提取此数据的最佳方法是什么 [英] What is the best way to extract this data

查看:31
本文介绍了提取此数据的最佳方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

查看 网站,我想不会看到错误,因为每种当地语言(约鲁巴语),因为它含义翻译,并且有220 当地语言(约鲁巴语).

Looking at the site, i suppose not to see an error because each local language(Yoruba) as it Meaning and Translation, and there are 220 local language(Yoruba).

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

res = requests.get('http://yoruba.unl.edu/yoruba.php-text=1a&view=0&uni=0&l=1.htm')
soup = BeautifulSoup(res.content,'html.parser')

edu = {'Yoruba':[],'Translation':[],'Meaning':[]}
    # first loop
for br in soup.select('p > br:nth-of-type(1)'):
    text = br.previous_sibling.strip()
    edu['Yoruba'].append(text)
    # second loop
for br in soup.select('p > br:nth-of-type(2)'):
    text = br.previous_sibling
    if isinstance(text, str):
        edu['Translation'].append(text.strip())
    # third loop
for br in soup.select('p > br:nth-of-type(3)'):
    text = br.previous_sibling
    if isinstance(text, str):
        edu['Meaning'].append(re.sub(r'[\(\)]','',str(text.strip())))

df7 = pd.DataFrame(edu)

错误

ValueError: arrays must all be same length

推荐答案

由于三个键的长度各不相同,我想最好的解决方法是将短键填充到最长键的长度(220, 在这种情况下).为此,请在创建数据框之前添加以下内容:

Since each of the three keys has different length, I guess the best way to address it is to pad the short keys to the length of the longest key (220, in this case). To do that add the following right before creating your dataframe:

length = max(len(edu['Meaning']),len(edu['Translation']),len(edu['Yoruba'])) #in case you don't know, find the length of the longest key
for k in edu:
    for i in range(length-len(edu[k])):
        edu[k].append("NA") # this is where the padding is; you can replacing NA with anything else, obviously

df7 = pd.DataFrame.from_dict(edu) #since edu is a dictionary, I would use this method
df7

让我知道这是否有效.

这篇关于提取此数据的最佳方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆