在 pandas 中创建年份列 [英] Creating a year column in Pandas
问题描述
我正在尝试创建一个年份列,并将年份取自数据框中的标题列.该代码有效,但是列dtype是object.例如,在第1行中,年份显示为[2013].
I'm trying to create a year column with the year taken from the title column in my dataframe. This code works, but the column dtype is object. For example, in row 1 the year displays as [2013].
我该怎么做,但是将dtype列更改为float?
How can i do this, but change the column dtype to a float?
year_list = []
for i in range(title_length):
year = re.findall('\d{4}', wine['title'][i])
year_list.append(year)
wine['year'] = year_list
这是我数据框的开头:
country designation points province title year
Italy Vulkà Bianco 87 Sicily Nicosia 2013 Vulkà Bianco [2013]
推荐答案
Instead of re.findall
that returns a list of strings, you may use str.extract()
:
wine['year'] = wine['title'].str.extract(r'\b(\d{4})\b')
或者,如果您只想匹配1900-2000s年:
Or, in case you want to only match 1900-2000s years:
wine['year'] = wine['title'].str.extract(r'\b((?:19|20)\d{2})\b')
请注意,str.extract
中的模式必须包含至少1个捕获组,其值将用于填充新列.仅考虑第一个匹配项,因此,如果需要,您可能需要稍后调整上下文.
Note that the pattern in str.extract
must contain at least 1 capturing group, its value will be used to populate the new column. The first match will only be considered, so you might have to precise the context later if need be.
我建议在\d{4}
模式周围使用单词边界\b
来将4位数字块匹配为整个单词,并避免像1234567890
这样的字符串中出现部分匹配.
I suggest using word boundaries \b
around the \d{4}
pattern to match 4-digit chunks as whole words and avoid partial matches in strings like 1234567890
.
这篇关于在 pandas 中创建年份列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!