只是为了好玩(第二部分):使用概率按国籍生成真实姓名 [英] Just for fun (part II): Produce real names by nationality using probability

查看:78
本文介绍了只是为了好玩(第二部分):使用概率按国籍生成真实姓名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述





只是为了好玩:按国籍使用概率制作真实姓名 [ ^ ]以获取导致此问题的完整详情。



我正在尝试创建一个准确的(ish)随机名称生成器。我将用它来对我的新系统进行性能测试。我想通过名字受欢迎来衡量发电机的结果。



到目前为止,我有三个系列:男英国姓氏,女英国姓氏和英国姓氏。每个人都有一定程度的受欢迎程度:

  private   static 字典< string,double> _enMForenames =  new 字典< string,double> 
{
{ Oliver 1 939 }, // 大约499多
};
private static 字典< string,double> _enFForenames = new 字典< string,double>
{
{ Amelia 1 638 }, // 大约499多
};
private static 字典< string,double> _enSurnames = new 字典< string,double>
{
{ SMITH 0 0074062771845516 }, // 大约30k以上
};







请注意,这些收藏品是不完整(上帝帮助我,如果他们是:S)所以所有项目的百分比未达到100%(我认为男性名称约占总数的68%?)



这是我迄今为止生成一个人的原因:

  private  < span class =code-keyword> static  Passenger CreatePerson()
{
Person newPerson = new Person();

bool male = rand.Next( 0 1 )== 1 ;

字典<字符串,双> forenames;

if (男)
{
forenames = _enMForenames;
newPassenger.Gender = rand.Next( 0 1 )== 1 M 男性;
}
else
{
forenames = _enFForenames;
newPassenger.Gender = rand.Next( 0 1 )== 1 F 女性;
}

// 这只是粗略的atm但这里是我需要的地方加权我的选择
newPerson .Forename =
forenames.Keys.Select((k,i)= > new {i,k})
.Where(n = > ni == rand.Next ( 0 ,forenames.Count - 1 ))
。选择(n = > nk)
.First();

// 姓氏将在此处以类似的方式完成

newPerson .Email = string .Format( {0}。{1} @ {2},newPerson .Forename,newPerson。姓氏, test.test);

int days = rand.Next( 0 INT .MaxValue);

newPerson .DateOfBirth =( new DateTime( 1900 1 1 )。日期).AddDays(天);

return newPerson;
}







所以:

如何我是否使用百分比来对随机名称进行加权?



我可以更有效地为大型名称设置(200万) )?



谢谢



Andy







编辑:这些是我的资源:

Forenames

http://www.behindthename.com/top/lists/england-wales/2013 [ ^ ]

姓氏

http://en.geneanet.org/genealogy/1/Surname.php [ ^ ]

解决方案

基于托马斯的回答 [ ^ ]你之前的问题,这样的事情应该有效:

  public   sealed   class  RandomNameGenerator 
{
private readonly string [] _names;
private readonly double [ ] _thresholds;

public RandomNameGenerator(IDictionary< string,double> source)
{
if (source == null throw new ArgumentNullException( source);
if (source.Count == 0 throw new ArgumentException( 没有名字指定。 source);

_names = new string [source.Count];
_thresholds = new double [source.Count];

int index = 0 ;
double runningTotal = 0D;
double totalWeight = source.Values.Sum();

foreach (KeyValuePair< string,double> pair in source)
{
runningTotal + = pair.Value;
_thresholds [index] = runningTotal / totalWeight;
_names [index] = pair.Key;
index ++;
}
}

public string GetRandomName(随机随机)
{
if (random == null throw new ArgumentNullException( 随机);

double n = random.NextDouble();
int index = Array.BinarySearch(_thresholds,n);
if (index < 0 )index = ~index;

return _names [index];
}
}





使用该课程应该相当简单:

< pre lang =cs> var names = new RandomNameGenerator( new Dictionary< string,double>
{
{ Smith 1 39 },
{ Jones 0 7 },
{ Adams 42 }
});

var random = new Random();
var result = Enumerable.Range( 0 10000
。选择(_ = > names.GetRandomName(随机))
.GroupBy(n = > n,(键,项目)= > new {Name = key,Count = items.Count()},StringComparer.OrdinalIgnoreCase)
.Dump(); // LINQPad扩展方法 - 使用您自己喜欢的显示方法。

< span class =code-comment> / *
计算阈值:
史密斯:0.03152642
琼斯:0.04740304
亚当斯:1

输出应类似于:
史密斯:~315
琼斯:~159
亚当斯:~9526
* /


我们假设您首先找到0到1的均匀分布的随机浮点数。您需要将它转换为非均匀分布(加权)分布,是一种统一的,在分布映射函数上给每个名称赋予不同的范围。这个函数非常简单,你需要一个存储范围和名称的简单数组。您将需要搜索此数组中的范围。



这是数组定义:

 < span class =code-keyword> class  NameDescriptor {
internal NameDescriptor( string name, double low, double high){
这个。低=低;
.High = high;
.Name = name;
}
内部 double 低{获得; private set ; }
内部 double 高{获取; private set ; }
内部 字符串名称{ get ; private set ; }
} // 类NameDescriptor



按名称数创建此元素的数组。要填充数组,请将您的概率周期性地重新设置为0..1的范围。

比如说,每个名称的百分比值为1.5,3,2 ......然后值应该是

 0到0.015(1.5 / 100)
0.015至0.045(1.5 / 100 + 3/100)
0.045至0.065(1.5 / 100 + 3/100 + 2/100)
...





现在,均匀分布的值0到1将属于这些范围之一。按此条件在数组中查找 NameDescriptor 的实例。我没有设计精确的算法,当然它不应该是那么慢的线性搜索。最简单的算法将是相当快的(但不是最快的)除以二的算法。它很快,因为你的所有范围都是有序的。粗略地说,您从中间数组索引开始并检查该值是否适合该范围。如果没有,请确定尝试元素左侧或右侧的合适范围。这样,您将搜索变量除以2。等等......



找到数组元素后,将其 Name 输出到输出。



-SA


我还没有测试过其他提议的解决方案,但我可以想象当产生200万个随机名称时,他们花了一些时间。



这个解决方案应该快得多,以换取消耗更多的内存。也许有必要牺牲一些准确性以使其在记忆方面工作,但我认为这对你的名字不一定是必要的。



1)求和所有名称的权重(例如,Smith的权重= 859017)。我们称之为 S



2)分配一个长度的数组小号。根据名称的金额,它应该是一个UInt16或Int32的数组。



你可能需要在你的app.config中允许这个大于2GB的对象:

 <   runtime  >  
< gcAllowVeryLargeObjects 已启用 = true / >
< / runtime >



如果遇到OutOfMemoryException,请将每个名称的权重除以任意数字(例如2)。如果新权重为0,则给它一个值(理论上;肯定不需要你的数据)。重新开始步骤1.(这将是精度损失)



3)初始化该数组(伪代码):

 offset = 0 
for nameIdx = 0 to names.length - 1
for i = 1 to names [nameIdx] .weight
array [offset + i] = nameIdx
endfor
offset + = names [nameIdx] .weight
endfor



4)生成一个随机数 rnd 介于0和 S 之间。随机名称是:names [array [ rnd ]]

根据需要重复步骤4.


Hi,

see Just for fun: Produce real names by nationality using probability[^] for full details leading up to this question.

I am trying to create an accurate (ish) random name generator. I will be using it to performance test my new system. I would like to weight the results of the generator by name popularity.

So far, I have three collections: Male UK Forenames, Female UK Forenames and UK Surnames. Each has a percentage of popularity:

private static Dictionary<string, double> _enMForenames = new Dictionary<string, double>
        {
            {"Oliver",1.939}, //about 499 more
        };
private static Dictionary<string, double> _enFForenames = new Dictionary<string, double>
        {
            {"Amelia",1.638}, //about 499 more
        };
private static Dictionary<string,double> _enSurnames = new Dictionary<string, double>
        {
            {"SMITH",0.0074062771845516}, //about 30k more
        };




Note that these collections are not complete (god help me if they were :S) so the percent sub of all of the items does not reach 100% (I think Male names are about 68% of the total?)

This is what I have so far for generating one person:

private static Passenger CreatePerson ()
{
    Person newPerson = new Person ();

    bool male = rand.Next(0,1) == 1;

    Dictionary<string, double> forenames;

    if (male)
    {
        forenames = _enMForenames;
        newPassenger.Gender = rand.Next(0, 1) == 1 ? "M" : "Male";
    }
    else
    {
        forenames = _enFForenames;
        newPassenger.Gender = rand.Next(0, 1) == 1 ? "F" : "Female";
    }

    //This is just rough atm but here is where I need to weight my selection
    newPerson .Forename =
        forenames.Keys.Select((k, i) => new { i, k })
            .Where(n => n.i == rand.Next(0, forenames.Count - 1))
            .Select(n => n.k)
            .First();

    //Surname will be done in a similar way here

    newPerson .Email = string.Format("{0}.{1}@{2}", newPerson .Forename, newPerson .Surname, "test.test");

    int days = rand.Next(0, int.MaxValue);

    newPerson .DateOfBirth = (new DateTime(1900, 1, 1).Date).AddDays(days);

    return newPerson ;
}




So:
How do I use the percent to weight the random name selection?
and
Can I do this more efficiently for large sets of names (2 million)?

Thanks

Andy



Edit: these are my resources:
Forenames
http://www.behindthename.com/top/lists/england-wales/2013[^]
Surnames
http://en.geneanet.org/genealogy/1/Surname.php[^]

解决方案

Based on Tomas's answer[^] to your previous question, something like this should work:

public sealed class RandomNameGenerator
{
    private readonly string[] _names;
    private readonly double[] _thresholds;

    public RandomNameGenerator(IDictionary<string, double> source)
    {
        if (source == null) throw new ArgumentNullException("source");
        if (source.Count == 0) throw new ArgumentException("No names specified.", "source");

        _names = new string[source.Count];
        _thresholds = new double[source.Count];

        int index = 0;
        double runningTotal = 0D;
        double totalWeight = source.Values.Sum();

        foreach (KeyValuePair<string, double> pair in source)
        {
            runningTotal += pair.Value;
            _thresholds[index] = runningTotal / totalWeight;
            _names[index] = pair.Key;
            index++;
        }
    }

    public string GetRandomName(Random random)
    {
        if (random == null) throw new ArgumentNullException("random");

        double n = random.NextDouble();
        int index = Array.BinarySearch(_thresholds, n);
        if (index < 0) index = ~index;

        return _names[index];
    }
}



Using the class should be fairly simple:

var names = new RandomNameGenerator(new Dictionary<string, double>
{
    { "Smith", 1.39 },
    { "Jones", 0.7 },
    { "Adams", 42 }
});

var random = new Random();
var result = Enumerable.Range(0, 10000)
    .Select(_ => names.GetRandomName(random))
    .GroupBy(n => n, (key, items) => new { Name = key, Count = items.Count() }, StringComparer.OrdinalIgnoreCase)
    .Dump(); // LINQPad extension method - use your own preferred display method.

/*
Computed thresholds:
Smith: 0.03152642
Jones: 0.04740304
Adams: 1

Output should be something similar to:
Smith: ~315
Jones: ~159
Adams: ~9526
*/


Let's assume you are starting with finding a random floating-point number of uniform distribution 0 to 1. You need to convert it to you non-uniform (weighted) distribution, which is a kind of uniform, with giving each name different range on distribution mapping function. This function is very simple, you will need a simple array storing ranges and names. You will need to search a range in this array.

This is the array definition:

class NameDescriptor {
    internal NameDescriptor(string name, double low, double high) {
        this.Low = low;
        this.High = high;
        this.Name = name;
    }
    internal double Low { get; private set; }
    internal double High { get; private set; }
    internal string Name { get; private set; }
} //class NameDescriptor


Create an array of this element, by the number of names. To populate the array, re-work recurrently your probability into ranges withing 0..1.
Say, you have percentage values per name 1.5, 3, 2… Then the Low and High values should be

0 to 0.015 (1.5/100)
0.015 to 0.045 (1.5/100 + 3/100)
0.045 to 0.065 (1.5/100 + 3/100 + 2/100)
...



Now, your uniformly-distributed value 0 to 1 will fall into one of these ranges. Find the instance of NameDescriptor in the array by this criterion. I did not devise exact algorithm, bit of course it should not be that slow the linear search. The simplest algorithm would be pretty fast (but not the fastest possible) divide-by-two algorithm. It is fast, because all your ranges are ordered. Roughly speaking, you start with middle array index and check that the value fits in the range. If it does not, determine is the suitable range is on left or on right of your attempted element. This way, you divided your search variant by two. And so on…

When the array element is found, your put its Name to output.

—SA


I haven't tested the other proposed solutions but I could imagine that they take "some time" when generating 2 million random names.

This solution should be a lot faster in exchange for consuming a lot more memory. Maybe it's neccessary to sacrifice some accuracy in order to make it work memory-wise but I assume it will not be neccessary for your corpus of names.

1) Sum the weights of all names (e.g. weight = 859017 for "Smith"). Let's call this sum S.

2) Allocate an array with a length of S. Depending on the amount of names it should be an array of UInt16 or Int32.

You might need this in your app.config to allow objects larger than 2GB:

<runtime>
  <gcAllowVeryLargeObjects enabled="true" />
</runtime>


If you run into an OutOfMemoryException, divide each name's weight by an arbitrary number (e.g. 2). If the new weight would be 0, give it a value of 1 (theoretically; surely not neccessary for your data). Start over with step 1. (This would be the accuracy-loss)

3) Initialize that array (pseudo-code):

offset = 0
for nameIdx = 0 to names.length - 1
   for i = 1 to names[nameIdx].weight
      array[offset + i] = nameIdx
   endfor
   offset += names[nameIdx].weight
endfor


4) Generate a random number rnd between 0 and S. The random name then is: names[array[rnd]]
Repeat step 4 as often as you want.


这篇关于只是为了好玩(第二部分):使用概率按国籍生成真实姓名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆