随机化日期和月份,但保留年份和时间间隔 [英] randomize date and month but preserve year and time interval

查看:45
本文介绍了随机化日期和月份,但保留年份和时间间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理多个文件中的大数据。这是一个更大的问题的一部分,但是为了简单起见,我将其分为几部分。



文件1存储在df1中,文件2存储在df2中。我大约有12个文件,每个文件有300万个记录。



df1和df2都是相关的,但存储为单独的文件。

  df1 = pd.DataFrame({'person_id':[1、2、3、4、5],
'日期出生':['12 / 30/1961','05 / 29/1967','02 / 03/1957','7/27/1959','01 / 13/1971'],
'date_death':['07/23/2017','05/29/2017','02/03/2015',np.nan,np.nan]})
df1 ['date_birth'] = pd.to_datetime(df1 ['date_birth'])
df1 ['date_death'] = pd.to_datetime(df1 ['date_death'])
df1 ['diff_birth_death'] = df1 ['date_death' ]-df1 ['date_birth']
df1 ['diff_birth_death'] = df1 ['diff_birth_death'] / np.timedelta64(1,'D')


df2 = pd.DataFrame({'person_id':[1,1,1,2,3],
'visit_id':['A1','A2','A3','B1','B2'] ,
'diag_start':['01 / 01/2012','02 / 25/2017','02 / 03/2015','07 / 27/2016','01 / 13/2011']] ,
'diag_end':['05 / 03/2012','05/29/2017','03/03/2015','08/15/2016','02/13/2011'] })
df2 ['diag_start'] = pd.to_datetime(df2 ['diag_start'])
df2 ['diag_end'] = pd.to_datetime(df2 [' diag_end'])
df2 ['diff_birth_diag_start'] = df2 ['diag_start']-df1 ['date_birth']
df2 ['diff_birth_diag_end'] = df2 ['diag_end']-df1 ']
df2 ['diff_birth_diag_start'] = df2 ['diff_birth_diag_start'] / np.timedelta64(1,'D')
df2 ['diff_birth_diag_end'] = df2 ['diff_birth_diag_end'。 timedelta64(1,'D')

我想做的是



1)随机化/移位日期 month 值,但保留<$事件之间的c $ c> year 分量和时间差(出生与死亡之间,出生与diag_start之间,出生与diag_end之间)



2)如何找到满足上述条件的每个主题(要添加/减去/随机分配的天数)的日期偏移值



在下面的示例中,我手动添加了以下偏移量。

  person_id 1 = -10天(值不正确。您将在下面看到错误原因)
person_id 2 = 10天
person_id 3 = 100天
person_id 4 = 20天
person_id 5 = 125天

我希望我的输出如下所示



df1-全部正确-日期和月份已更改(年份和间隔为保留)



解决方案

如评论中所述,您要随机化两个 datetime 对象有一些限制:


  1. 开始日期必须低于结束日期

  2. 随机分配后,开始日期和结束日期之间的时间间隔必须保持不变

  3. 开始日期和结束日期必须保持相同(例如2000-01-01不能成为1999-12-31 )

为了解决此问题,我认为是在不更改年份的情况下找到起始数据可能的更改范围,然后找到结束日期可能的更改范围(也无需更改年份),最后与它们相交以获取更改的范围适用于两个日期的ge。之后,最终范围内的任何随机值都不会更改任何限制日期的年份,并且将使间隔保持不变。



我创建了一个实现此功能。您将其传递给start和end datetime对象,它将返回一个元组,其中的日期会根据限制随机分配。

  import dt 
从随机导入的日期时间

def rand_date_diff_keep_year_and_interval(dt1,dt2):
如果dt1> dt2:
引发Exception( dt1必须小于dt2)
range1 = {
min:dt1.replace(month = 1,day = 1)-dt1,
max:dt1.replace(month = 12,day = 31)-dt1,
}
range2 = {
min:dt2.replace(month = 1,day = 1)-dt2,
最大:dt2.replace(month = 12,day = 31)-dt2,
}
交叉点= {
最小:最大(range1 [ min],range2 [ min]),
max:min(range1 [ max],range2 [ max]),
}
rand_change = random()*(intersection [ max]-交集[ min])+交集[ min]
return(dt1 + rand_change,dt2 + rand_change)

print(rand_date_diff_keep_year_and_interval(dt.datetime(2000,1,1),dt.datetime(2000,12,31))))
print(rand_date_diff_keep_year_and_interval(dt.datetime(2000,5,18),dt.datetime( 2001,8,20)))



Pandas Solution



要使用Pandas DataFrame,我们需要将之前的代码修改为w ork系列而不是单个datetime对象。逻辑几乎保持不变,但是可以这么说,现在我们正在按系列进行所有操作。另外,我使用了 numpy.random 来生成一系列随机数,而不是只创建一个随机数并对所有行重复一次……这会少很多

 进口日期时间为dt 
进口熊猫为pd
进口numpy.random为rnd

def series_rand_date_diff_keep_year_and_interval(sdt1,sdt2):
如有(sdt1> sdt2):
引发异常( dt1必须小于dt2)
range1 = {
min:sdt1.apply(lambda dt1:dt1.replace(month = 1,day = 1)-dt1),
max:sdt1.apply(lambda dt1:dt1.replace( month = 12,day = 31)-dt1),
}
range2 = {
min:sdt2.apply(lambda dt2:dt2.replace(month = 1,day = 1 )-dt2),
max:sdt2.apply(lambda dt2:dt2.replace(month = 12,day = 31)-dt2),
}
交集= {
min:pd.concat([range1 [ min],range2 [ min]],轴= 1).max(axis = 1),
max:pd.concat( [range1 [ max],range2 [ max]],axis = 1).min(axis = 1),
}
rand_change = pd.Series(rnd.uniform(size = len(sdt1)))*(intersection [ max]-交集[ min])+交集[ min]
返回(sdt1 + rand_change,sdt2 + rand_change)

df = pd.DataFrame([
{开始:dt.datetime(2000,1,1),结束:dt.datetime(2000,12,31)},
{开始:dt.datetime(2000,5, 18), end:dt.datetime(2001,8,20)},
])

df2 = pd.DataFrame(df)
df2 [ start ],df2 [ end] = series_rand_date_diff_keep_year_and_interval(df [ start],df [ end])
print(df2.head())



多列熊猫解决方案



再看一个问题,事件序列中有很多列,它们都代表日期,其中一些代表NaT值(空日期)。如果我们希望应用相同的限制,并在一系列事件中保持所有事件之间的相对距离,而不更改任何值的年份,并且也接受NaT列,则我们必须进行一些更改。不用列出更改,而是直接输入代码:

 导入日期时间为dt 
导入熊猫为pd
import numpy.random as rnd
import numpy as np
from functools import reduce

def manyseries_rand_date_diff_keep_year_and_interval(* sdts):
range = list(map(
lambda sdt:
{
min:sdt.apply(lambda dt:dt.replace(month = 1,day = 1)-dt),
max :sdt.apply(lambda dt:dt.replace(month = 12,day = 31)-dt),
},
sdts
))
交叉点= reduce(
lambda range1,range2:
{
min:pd.concat([range1 [ min],range2 [ min]],轴= 1).max(axis = 1),
max:pd.concat([range1 [ max],range2 [ max]],axis = 1).min(axis = 1),
},
范围

rand_change = pd.Series(rnd.uniform(size = len(intersection [ max])))*(in tersection [ max]-交集[ min])+交集[ min]
返回列表(地图(lambda sdt:sdt + rand_change,sdts))

def setup_diffs (df1,df2):
df1 ['diff_birth_death'] = df1 ['date_death']-df1 ['date_birth']
df1 ['diff_birth_death'] = df1 ['diff_birth_death'] / n timedelta64(1,'D')

df2 ['diff_birth_diag_start'] = df2 ['diag_start']-df1 ['date_birth']
df2 ['diff_birth_diag_end'] = df2 [' diag_end']-df1 ['date_birth']
df2 ['diff_birth_diag_start'] = df2 ['diff_birth_diag_start'] / np.timedelta64(1,'D')
df2 ['diff_birth_diag_end ['diff_birth_diag_end'] / np.timedelta64(1,'D')

df1 = pd.DataFrame({'person_id':[1、2、3、4、5],
'日期出生':['12 / 30/1961','05 / 29/1967','02 / 03/1957','7/27/1959','01 / 13/1971'],
'date_death':['07/23/2017','05 / 29/2017','02 / 03/2015',np.nan,np.nan]})
df1 ['date_birth'] = pd.to_datetime(df1 ['date_birth'])
df1 ['date_death'] = pd.to_datetime(df1 ['date_death'])

df2 = pd.DataFrame({'person_id':[1,1,1,2,3],
'visit_id':['A1','A2','A3','B1','B2'],
'diag_start':['01 / 01/2012','02 / 25/2017, 02/03/2015, 07/27/2016, 01/13/2011],
'diag_end':['05/03/2012','05 / 29/2017','03/03/2015','08/15/2016','02/13/2011']})
df2 ['diag_start'] = pd.to_datetime(df2 ['diag_start '])
df2 ['diag_end'] = pd.to_datetime(df2 ['diag_end'])
setup_diffs(df1,df2)

display(df1)
display(df2)

series_list = manyseries_rand_date_diff_keep_year_and_interval(
df1 ['date_birth'],df1 ['date_death'],df2 ['diag_start'],df2 ['diag_end'])
df1 ['date_birth'],df1 ['date_death'],df2 ['diag_start'],df2 ['diag_end'] = series_list
setup_diffs(df1,df2)

display(df1)
display(df2)

这次,我使用Jupyter Notebook更好可视化数据框:





希望这样做帮助!欢迎任何意见和建议。


I am dealing with big data in multiple files. This is part of a larger problem but for simplicity purposes, I am breaking it into parts.

file 1 is stored in df1 and file 2 is stored in df2. I have around 12 files with 3 million records in each..

Both df1 and df2 are related but stored as separate files.

df1 = pd.DataFrame({'person_id': [1, 2, 3, 4, 5],
                        'date_birth': ['12/30/1961', '05/29/1967', '02/03/1957', '7/27/1959', '01/13/1971'],
                        'date_death': ['07/23/2017','05/29/2017','02/03/2015',np.nan,np.nan]})
df1['date_birth'] = pd.to_datetime(df1['date_birth'])
df1['date_death'] = pd.to_datetime(df1['date_death'])
df1['diff_birth_death'] = df1['date_death'] - df1['date_birth']
df1['diff_birth_death']=df1['diff_birth_death']/np.timedelta64(1,'D')


df2 = pd.DataFrame({'person_id': [1,1,1,2,3],
                    'visit_id':['A1','A2','A3','B1','B2'],
                    'diag_start': ['01/01/2012', '02/25/2017', '02/03/2015', '07/27/2016', '01/13/2011'],
                    'diag_end': ['05/03/2012','05/29/2017','03/03/2015','08/15/2016','02/13/2011']})
df2['diag_start'] = pd.to_datetime(df2['diag_start'])
df2['diag_end'] = pd.to_datetime(df2['diag_end'])
df2['diff_birth_diag_start'] = df2['diag_start'] - df1['date_birth']
df2['diff_birth_diag_end'] = df2['diag_end'] - df1['date_birth']
df2['diff_birth_diag_start']=df2['diff_birth_diag_start']/np.timedelta64(1,'D')
df2['diff_birth_diag_end']=df2['diff_birth_diag_end']/np.timedelta64(1,'D')

What I would like to do is

1) randomize/shift the date and month values but retain the year component and time difference between events (between birth and death, between birth and diag_start, between birth and diag_end)

2) How to find the date offset value for each subject (no of days to be added/subtracted/randomized) for which condition above is satisfied

In the example below, I have manually added below offsets.

person_id 1 = -10 days (incorrect value. you will see below as to why it's incorrect)
person_id 2 = 10 days
person_id 3 = 100 days
person_id 4 = 20 days
person_id 5 = 125 days

I expect my output to be something like below

df1 - all correct - date and months shifted (year and interval is retained)

df2 - offset chosen was incorrect leading to change in year. Though interval was maintained year value changed.

解决方案

As stated in the comments, what you want is to randomize two datetime objects given some restrictions:

  1. The start date must be lower than the end date
  2. The time interval between start and end dates must remain the same after randomization
  3. The start and end years must remain the same (e.g. 2000-01-01 cannot become 1999-12-31)

To solve this problem, what I thought was to find the range of change that is possible for the start data without changing the year, then find the range of change that is possible for the end date, also without changing the year, and finally intersect them to get the range of change that applies to both dates. After that, any random value inside the final range will not change the year of any of the limiting dates and will keep the interval intact.

I have created a function that implements this functionality. You pass it the start and end datetime objects, and it will return a tuple with those dates randomized according to the restrictions.

import datetime as dt
from random import random

def rand_date_diff_keep_year_and_interval(dt1, dt2):
    if dt1 > dt2:
        raise Exception("dt1 must be lesser than dt2")
    range1 = {
        "min": dt1.replace(month=1, day=1) - dt1,
        "max": dt1.replace(month=12, day=31) - dt1,
    }
    range2 = {
        "min": dt2.replace(month=1, day=1) - dt2,
        "max": dt2.replace(month=12, day=31) - dt2,
    }
    intersection = {
        "min": max(range1["min"], range2["min"]),
        "max": min(range1["max"], range2["max"]),
    }
    rand_change = random()*(intersection["max"] - intersection["min"]) + intersection["min"]
    return (dt1 + rand_change, dt2 + rand_change)

print(rand_date_diff_keep_year_and_interval(dt.datetime(2000, 1, 1), dt.datetime(2000, 12, 31)))
print(rand_date_diff_keep_year_and_interval(dt.datetime(2000, 5, 18), dt.datetime(2001, 8, 20)))

Pandas Solution

To work with Pandas DataFrame we need to adapt the previous code to work with series instead of single datetime objects. The logic stays almost the same, but now we are doing everything "series-wise" so to speak. Also, I used numpy.random to generate a series of random number, instead of creating just one random number and repeat it for all rows... that would be a lot less random.

import datetime as dt
import pandas as pd
import numpy.random as rnd

def series_rand_date_diff_keep_year_and_interval(sdt1, sdt2):
    if any(sdt1 > sdt2):
        raise Exception("dt1 must be lesser than dt2")
    range1 = {
        "min": sdt1.apply(lambda dt1: dt1.replace(month=1, day=1) - dt1),
        "max": sdt1.apply(lambda dt1: dt1.replace(month=12, day=31) - dt1),
    }
    range2 = {
        "min": sdt2.apply(lambda dt2: dt2.replace(month=1, day=1) - dt2),
        "max": sdt2.apply(lambda dt2: dt2.replace(month=12, day=31) - dt2),
    }
    intersection = {
        "min": pd.concat([range1["min"], range2["min"]], axis=1).max(axis=1),
        "max": pd.concat([range1["max"], range2["max"]], axis=1).min(axis=1),
    }
    rand_change = pd.Series(rnd.uniform(size=len(sdt1)))*(intersection["max"] - intersection["min"]) + intersection["min"]
    return (sdt1 + rand_change, sdt2 + rand_change)

df = pd.DataFrame([
        {"start": dt.datetime(2000, 1, 1), "end": dt.datetime(2000, 12, 31)},
        {"start": dt.datetime(2000, 5, 18), "end": dt.datetime(2001, 8, 20)},
    ])

df2 = pd.DataFrame(df)
df2["start"], df2["end"] = series_rand_date_diff_keep_year_and_interval(df["start"], df["end"])
print(df2.head())

Multicolumn Pandas Solution

Looking again at the question, there are many columns in the sequence of events, all of them representing dates, and some of them of NaT values (null dates). If we want the same restrictions to apply, and keep the relative distance between all events in the series of events, without changing the year of any of the values, and also accepting NaT columns, we have to change a few things. Instead of listing the changes, lets go straight into the code:

import datetime as dt
import pandas as pd
import numpy.random as rnd
import numpy as np
from functools import reduce

def manyseries_rand_date_diff_keep_year_and_interval(*sdts):
    ranges = list(map(
        lambda sdt:
            {
                "min": sdt.apply(lambda dt: dt.replace(month=1,  day=1 ) - dt),
                "max": sdt.apply(lambda dt: dt.replace(month=12, day=31) - dt),
            },
        sdts
        ))
    intersection = reduce(
        lambda range1, range2:
            {
                "min": pd.concat([range1["min"], range2["min"]], axis=1).max(axis=1),
                "max": pd.concat([range1["max"], range2["max"]], axis=1).min(axis=1),
            },
        ranges
        )
    rand_change = pd.Series(rnd.uniform(size=len(intersection["max"])))*(intersection["max"] - intersection["min"]) + intersection["min"]
    return list(map(lambda sdt: sdt + rand_change, sdts))

def setup_diffs(df1, df2):
    df1['diff_birth_death'] = df1['date_death'] - df1['date_birth']
    df1['diff_birth_death'] = df1['diff_birth_death']/np.timedelta64(1,'D')

    df2['diff_birth_diag_start'] = df2['diag_start'] - df1['date_birth']
    df2['diff_birth_diag_end'] = df2['diag_end'] - df1['date_birth']
    df2['diff_birth_diag_start'] = df2['diff_birth_diag_start']/np.timedelta64(1,'D')
    df2['diff_birth_diag_end'] = df2['diff_birth_diag_end']/np.timedelta64(1,'D')

df1 = pd.DataFrame({'person_id': [1, 2, 3, 4, 5],
                        'date_birth': ['12/30/1961', '05/29/1967', '02/03/1957', '7/27/1959', '01/13/1971'],
                        'date_death': ['07/23/2017', '05/29/2017', '02/03/2015', np.nan,      np.nan]})
df1['date_birth'] = pd.to_datetime(df1['date_birth'])
df1['date_death'] = pd.to_datetime(df1['date_death'])

df2 = pd.DataFrame({'person_id': [1,1,1,2,3],
                    'visit_id':['A1','A2','A3','B1','B2'],
                    'diag_start': ['01/01/2012', '02/25/2017', '02/03/2015', '07/27/2016', '01/13/2011'],
                    'diag_end': ['05/03/2012','05/29/2017','03/03/2015','08/15/2016','02/13/2011']})
df2['diag_start'] = pd.to_datetime(df2['diag_start'])
df2['diag_end'] = pd.to_datetime(df2['diag_end'])
setup_diffs(df1, df2)

display(df1)
display(df2)

series_list = manyseries_rand_date_diff_keep_year_and_interval(
    df1['date_birth'], df1['date_death'], df2['diag_start'], df2['diag_end'])
df1['date_birth'], df1['date_death'], df2['diag_start'], df2['diag_end'] = series_list
setup_diffs(df1, df2)

display(df1)
display(df2)

This time, I used Jupyter Notebook to better visualize the DataFrames:

Hope this helps! Any comments and suggestion are welcome.

这篇关于随机化日期和月份,但保留年份和时间间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆