在R中生成虚拟网店数据:随机生成交易时合并参数 [英] Generating dummy webshop data in R: Incorporating parameters when randomly generating transactions

查看:104
本文介绍了在R中生成虚拟网店数据:随机生成交易时合并参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我目前正在上的一门课程,我正在尝试建立虚拟交易,客户与交易;产品数据集,用于展示网店环境中的机器学习用例以及财务仪表板;不幸的是,我们没有得到虚拟数据。我认为这是提高我的R知识的好方法,但是在实现它方面遇到了很大的困难。

For a course I am currently in I am trying to build a dummy transaction, customer & product dataset to showcase a machine learning usecase in a webshop environment as well as a financial dashboard; unfortunately, we have not been given dummy data. I figured this'd be a nice way to improve my R knowledge, but am experiencing severe difficulties in realizing it.

我的想法是我指定一些参数/规则(任意/虚拟的,但适用于某种聚类算法的演示)。我基本上是想隐藏一个模式,然后利用机器学习(不是此问题的一部分)重新找到该模式。我隐藏的模式基于产品采用的生命周期,试图展示如何识别不同的客户类型以用于有针对性的营销目的。

The idea is that I specify some parameters/rules (arbitrary/fictitious, but applicable for a demonstration of a certain clustering algorithm). I'm basically trying to hide a pattern to then re-find this pattern utilizing machine learning (not part of this question). The pattern I'm hiding is based on the product adoption life cycle, attempting to show how identifying different customer types could be used for targeted marketing purposes.

我将演示我在寻找什么。我想尽可能保持现实。我试图通过将每个客户的交易次数和其他特征分配给正态分布来实现。我完全愿意尝试其他可能的方法?

I'll demonstrate what I'm looking for. I'd like to keep it as realistic as possible. I attempted to do so by assigning the number of transactions per customer and other characteristics to normal distributions; I am completely open to potential other ways to do this?

以下是我走了多远,首先建立了一个客户表:

The following is how far I have come, first build a table of customers:

# Define Customer Types & Respective probabilities
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15)   # Probability of being in each group.

set.seed(1)   # Set seed to make reproducible
Customers <- data.frame(ID=(1:10000), 
  CustomerType = sample(CustomerTypes, size=10000,
                                  replace=TRUE, prob=PropCustTypes),
  NumBought = rnorm(10000,3,2)   # Number of Transactions to Generate, open to alternative solutions?
)
Customers[Customers$Numbought<0]$NumBought <- 0   # Cap NumBought at 0 

接下来,生成产品表以供选择:

Next, generate a table of products to choose from:

Products <- data.frame(
  ID=(1:50),
  DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000),
  SuggestedPrice = rnorm(50, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10   # Cap ProductPrice at 10$
Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10")   # Cap Releasedate to 1 year ago 

现在我想生成n次交易(数量在上面的客户表中),基于每个相关的变量的以下参数。

Now I would like to generate n transactions (number is in customer table above), based on the following parameters for each variable that is currently relevant).

Parameters <- data.frame(
  CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
  BySearchEngine   = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X
  ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
  ByPartnerBlog    = c(0.30, .30,  0.35, 0.35),
  Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
  Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
    stringsAsFactors=FALSE)

Parameters
   CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount
1  EarlyAdopter            0.1             0.60          0.30          1     0.00
2   Pragmatists            0.4             0.30          0.30          6     0.00
3 Conservatives            0.5             0.15          0.35         12     0.05
4    Dealseeker            0.6             0.05          0.35         12     0.10

这个想法是,'EarlyAdopters'拥有(平均而言,正态分布)10%的交易,带有标签'BySearchEngine',60%'ByDirectCustomer'和30%'ByPartnerBlog';这些值必须彼此排斥:在最终数据集中无法通过PartnerBlog和搜索引擎获得。选项为:

The idea is that 'EarlyAdopters' would have (on average, normally distributed) 10% of transactions with a label 'BySearchEngine', 60% 'ByDirectCustomer' and 30% 'ByPartnerBlog'; these values need to exclude each other: one cannot be obtained via both a PartnerBlog and via a Search Engine in the final dataset. The options are:

ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")

此外,我想生成一个折扣变量,通常使用上述方法进行分配。为简单起见,标准差可能为平均值/ 5。

Furthermore, I'd like to generate a discount variable that is normally distributed utilizing the above means. For simplicity, standard deviations may be mean/5.

接下来,我最棘手的部分是,我想根据一些规则生成这些交易:

Next, my most tricky part, I'd like to generate these transactions according to a few rules:


  • 在几天中分布均匀,在周末可能更多;

  • 在2006-2014年间扩散。

  • 多年来分散了客户的交易数量;

  • 客户无法购买尚未发布的产品。

  • Somewhat evenly distributed over days, maybe slightly more during the weekend;
  • Spread out between 2006-2014.
  • Spreading out the # of transactions of customers over the years;
  • Customers cannot buy products that haven't been released yet.

其他参数

YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer?
DailyMax <-  1 # Same question? Likely dependent on YearlyMax

CustomerID 2的结果为:

The result for CustomerID 2 would be:

Transactions <- data.frame(
    ID        = c(1,2),
    CustomerID = c(2,2), # The customer that bought the item.
    ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average
    DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average
    ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified.
    GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard.
    Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior.    

Transactions
  ID CustomerID ProductID DateOfPurchase     ReferredBy GrossPrice Discount
1  1          2        51     2013-01-02 DirectCustomer      50.00     0.02
2  2          2       100     2012-12-03   SearchEngine      52.99     0.00

我对编写R代码越来越有信心,但遇到了困难编写代码以保留全局参数(每天的交易分布,每位客户每年最多交易#次)以及符合以下要求的各种链接:

I'm getting more and more confident in writing R code, but I'm having difficulties writing the code to keep the global parameters (daily distributions of transactions, yearly maximum of # transactions per customer) as well as the various linkages in line:


  • 及时性:发布后人们购买的速度有多快

  • 推荐人:该客户如何到达我的网站?

  • 该客户有多少折扣曾经(以说明对折扣有多敏感)

这使我不知道是否应该为客户编写for循环表,为每个客户生成交易,或者我是否应该采用其他方式nt路由。非常感谢任何贡献。即使我渴望通过R解决这个问题,也欢迎使用其他虚拟数据集。随着我的前进,我会不断更新此帖子。

This causes me to not know whether I should write a for loop over the customer table, generating transactions per customer, or whether I should take a different route. Any contributions are greatly appreciated. Alternative dummy datasets are welcome as well, even though I'm eager to solve this problem by means of R. I'll keep this post updated as I progress.

我的当前伪代码:


  • 使用sample()将客户分配给客户类型

  • 生成客户$ Num购买交易

  • ...还在思考吗?

编辑:正在生成交易表,现在我只是'需要用正确的数据填充它:

Generating the transactions table, now I 'just' need to fill it with the right data:

Tr <- data.frame(
  ID = 1:sum(Customers$NumBought),
  CustomerID = NA,
  DateOfPurchase = NA,
  ReferredBy = NA,
  GrossPrice=NA,
  Discount=NA)


推荐答案

非常粗略地,建立一个数据库,包括天数和数量该天的访问量:

Very roughly, set up an database of days, and number of visits in that day:

days<- data.frame(day=1:8000, customerRate = 8000/XtotalNumberOfVisits)
# you could change the customerRate to reflect promotions, time since launch, ...
days$nVisits <- rpois(8000, days$customerRate)

然后编录访问次数

    visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
    visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
    visits$nPurchases <- rpois(nrow(vists), XpurchaseRate[visits$customerType])

任何<$它们前面的c $ c> X 是您的过程的参数。同样,您可以根据您拥有的其他列,通过参数化可用对象之间的相对可能性来生成交易数据库。或者,您可以生成一个访问数据库,其中包含当天每种产品可用的密钥:

Any of the variables with X in front of them are parameters of your process. You'd similarly go on to generate a transactions database by parametrising the relative likelihood amongst objects available, according to the other columns you have. Or you can generate a visits database including a key to each product available at that day:

   productRelease <- data.frame(id=X, releaseDay=sort(X)) # ie df is sorted by releaseDay
   visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
   visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
   day$productsAvailable = rep(1:nrow(productRelease), times=diff(c(productRelease$releaseDay, nrow(days)+1)))
   visits <- visits[(1:nrow(visits))[day$productsAvailable],]
   visits$prodID <- with(visits, ave(rep(id==id, id, cumsum))

然后,您可以确定一个函数,该函数为您提供每一行客户购买商品的概率(基于日期,客户,产品),然后通过 visits $ did他们购买<-runif(nrow(访问))< XmyProbability。

You can then decide a function that gives you, for each row, a probability of the customer purchasing that item (based on day, customer, product). And then fill in the purchase by `visits$didTheyPurchase <- runif(nrow(visits)) < XmyProbability.

对不起,可能是拼写错误粗略地进行此操作,因为我一直在直接输入,但希望这能给您一个想法。

Sorry, there's probably typos's littered throughout this as I was typing it straight, but hopefully this gives you an idea.

这篇关于在R中生成虚拟网店数据:随机生成交易时合并参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆