通过使用现有数据集作为基础数据集来生成数据 [英] Generate data by using existing dataset as the base dataset
问题描述
我有一个包含10万个唯一数据记录的数据集,以对代码进行基准测试,我需要对具有500万个唯一记录的数据进行测试,但我不想生成随机数据。我想使用我拥有的10万条数据记录作为基础数据集,并使用某些列的唯一值生成与之相似的剩余数据,如何使用python或Scala做到这一点?
I have a dataset consisting of 100k unique data records, to benchmark the code, I need to test on data with 5 million unique records, I don't want to generate random data. I would like to use the 100k data records which I have as the base dataset and generate the remaining data similar to it with unique values for certain columns, How can I do that using python or Scala ?
以下是示例数据
latitude longitude step count
25.696395 -80.297496 1 1
25.699544 -80.297055 1 1
25.698612 -80.292015 1 1
25.939942 -80.341607 1 1
25.939221 -80.349899 1 1
25.944992 -80.346589 1 1
27.938951 -82.492018 1 1
27.944691 -82.48961 1 3
28.355484 -81.55574 1 1
每对经纬度在生成的数据中应该是唯一的,我也应该能够为这些列设置最小值和最大值
Each pair of latitude and longitude should be unique across the data generated, I should be able to set min and max values for these columns as well
推荐答案
您可以使用R轻松生成符合正态分布的数据,可以执行以下步骤
You can generate data conforming to normal distribution easily using R, you can follow the following steps
#Read the data into a dataframe
library(data.table)
data = data = fread("data.csv", sep=",", select = c("latitude", "longitude"))
#Remove duplicate and null values
df = data.frame("Lat"=data$"latitude", "Lon"=data$"longitude")
df1 = unique(df[1:2])
df2 <- na.omit(df1)
#Determine the mean and standard deviation of latitude and longitude values
meanLat = mean(df2$Lat)
meanLon = mean(df2$Lon)
sdLat = sd(df2$Lat)
sdLon = sd(df2$Lon)
#Use Normal distribution to generate new data of 1 million records
newData = list()
newData$Lat = sapply(rep(0, 1000000), function(x) (sum(runif(12))-6) * sdLat + meanLat)
newData$Lon = sapply(rep(0, 1000000), function(x) (sum(runif(12))-6) * sdLon + meanLon)
finalData = rbind(df2,newData)
now final data contains both old records and new records
将finalData数据帧写入CSV文件,您可以从Scala或python读取
Write the finalData dataframe to a CSV file and you can read it from Scala or python
这篇关于通过使用现有数据集作为基础数据集来生成数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!