只是为了好玩:使用概率按国籍生成真实姓名 [英] Just for fun: Produce real names by nationality using probability

查看:83
本文介绍了只是为了好玩:使用概率按国籍生成真实姓名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述





我正在制作一些测试数据。



我需要:



Forename,

姓氏,
国籍,

性别,

DoB,

电子邮件地址,



我按性别和国籍以及国籍姓氏列出了流行的名字。我可以通过一个随机组合的姓氏和姓氏来生成所有细节。



我想要做的是使用名单'频率'列我的列表的名字。

我应该如何存储频率的名称,并使用随机人生成器来加权?



这是一些代码



Hi,

I am producing some test data.

I need:

Forename,
Surname,
Nationality,
Gender,
DoB,
Email Address,

I have lists of popular forenames by gender and nationality and surnames by nationality. I can produce all the details from a randomized combination of forename and surname.

What I would like to do is use the name 'Frequency' column I have with my list of names.
How should I store the frequency with the name and use it to weight by random person generator?

Here is some code

private static Dictionary<string, double> _enMForenames = new Dictionary<string, double>
        {
            {"Oliver",1.939}, //about 499 more
        };
private static Dictionary<string, double> _enFForenames = new Dictionary<string, double>
        {
            {"Amelia",1.638}, //about 499 more
        };
private static Dictionary<string,double> _enSurnames = new Dictionary<string, double>
        {
            {"SMITH",0.0074062771845516}, //about 30k more
        };





谢谢^ _ ^





如标题所示。我不必使用频率,但听起来很有趣(好吧,对我来说:)无论如何:)。





编辑:我已经创建了自己的字典来生成加权随机类型。

这会有效吗?





Thanks ^_^


As the title suggests. I don't have to use the frequency but it sounds like fun (well, to me anyway :)).


I have created my own dictionary for producing a weighted random type.
Will this work?

private class WeightedRandomLists<T> : Dictionary<T, double>
        {
            static Random rand = new Random((int)(DateTime.Now.Ticks % int.MaxValue));

            private static Dictionary<T, double> _totals;


            public T NextRandomItem
            {
                 get
                {
                    if (!this.Any())
                        return default(T);

                    if (_totals == null)
                    {

                        double total = Values.Sum();

                        double runningTotal = 0.0;
                        _totals = this.Select(kvp =>
                        {
                            runningTotal += (kvp.Value/total)/100; //because percent
                            return new KeyValuePair<T, double>(kvp.Key, runningTotal);
                        }).ToDictionary(kvp => kvp.Key, kvp => kvp.Value);
                    }

                    var random = rand.NextDouble();
                    var index = Array.BinarySearch(_totals.Values.ToArray(), random);
                    return _totals.Keys.Cast<T>().ToArray()[index];
                }
            }
        }

推荐答案

我会存储每个名字的概率:

I would store probabilities for each name:
var names = new [] {"Adam", "Barbara", "Cecilia", "Daniel", "Emily", "Felix"};
var probabilities = new [] {0.15, 0.2, 0.1, 0.35, 0.15, 0.05};



假设 System.Random.NextDouble() [ ^ ]在区间0 <= x <1时返回均匀分布的随机变量x。 1.我们可以将其转换为我们的自定义发行版。


Assuming System.Random.NextDouble()[^] returns a random variable x of uniform distribution on interval 0 <= x < 1. We can transform this to our custom distribution.

// running sum of probabilities
var thresholds = new [] {0.15, 0.35, 0.45, 0.8, 0.95, 1.0};

var randomValue = random.NextDouble();
var index = Array.BinarySearch(thresholds, randomValue);

if(index < 0)
{
    index = ~index;
}

var name = names[index];



Array.BinarySearch [ ^ ]如果找到,则返回项目的索引或下一个更大项目的索引的按位补码。


Array.BinarySearch[^] returns the index of the item if found or bitwise complement of the index of next bigger item.


关于存储仅频率的最佳候选者将类似于

系统。 Collections.Generic.Dictionary< string,uint> ,其中第一个通用参数用于名称,第二个参数用于频率。 (不是真正的频率,而只是出现次数,但这是你真正需要的。)而且你也可能需要让第二个参数成为某个类(使用引用类型比值类型结构好得多) )包含频率和/或其他数据。



此类为您提供O(1)的时间复杂度,以便按名称查找。因为你需要更新每个重复名称的频率,这就是你想要的。



你的帖子中有很多不清楚的地方,首先,你得到的地方输入数据,但这不是您问题的一部分。 ;-)



-SA
On of the best candidates for storing just frequency would be something like
System.Collections.Generic.Dictionary<string, uint>, where first generic parameter is for the name, and the second one for frequency. (Not really "frequency", but just the number of occurrences, but this is what you really need.) And it's also possible that you may need to make the second parameter some class (using reference types is much better than value-type structure) containing frequency and/or other data.

This class gives you time complexity of O(1) for finding by name. As you need to update frequency on each repeated name, this is what you want.

There is a lot of unclear in your post, first of all, where you get the input data, but this is not a part of your question. ;-)

—SA


这篇关于只是为了好玩:使用概率按国籍生成真实姓名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆