在哈希映射上运行perceptron算法功能vecteur:java [英] run perceptron algorithm on a hash map feature vecteur: java

查看:112
本文介绍了在哈希映射上运行perceptron算法功能vecteur:java的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码,它将目录中的许多文件读入哈希映射,这是我的功能vecteur 。从某种意义上讲,它有点幼稚,但它并不是我现在主要关注的问题。我想知道如何使用这种数据结构作为感知器算法的输入。我猜我们称之为一袋话,不是吗?

I have the following code, it reads in many files from a directory into a hash map, this is my feature vecteur. It's somewhat naive in the sense that it does no stemming but that's not my primary concern right now. I want to know how I can use this data structure as the input to the perceptron algorithm. I guess we call this a bag of words, isn't it?

public class BagOfWords 
{
        static Map<String, Integer> bag_of_words = new HashMap<>();

        public static void main(String[] args) throws IOException 
        {
            String path = "/home/flavius/atheism;
            File file = new File( path );
            new BagOfWords().iterateDirectory(file);

            for (Map.Entry<String, Integer> entry : bag_of_words.entrySet()) 
            {
                System.out.println(entry.getKey()+" : "+entry.getValue());
            }

        }

        private void iterateDirectory(File file) throws IOException 
        {
            for (File f : file.listFiles()) 
            {
                if (f.isDirectory()) 
                {    
                    iterateDirectory(file);
                } 
                else 
                {
                    String line; 
                    BufferedReader br = new BufferedReader(new FileReader( f ));

                    while ((line = br.readLine()) != null) 
                    {

                        String[] words = line.split(" ");//those are your words

                        String word;

                        for (int i = 0; i < words.length; i++) 
                        {
                            word = words[i];
                            if (!bag_of_words.containsKey(word))
                            {
                                bag_of_words.put(word, 0);
                            }
                            bag_of_words.put(word, bag_of_words.get(word) + 1);
                        }

                    }

                }
            }
        }
    }

你可以看到路径进入一个名为'atheism'的目录,还有一个名为sports的目录,我想尝试线性分离这两类文档,然后尝试将看不见的测试文档分成两个类别。

You can see that the path goes to a directory called 'atheism' there's also one called sports, I want to try to linearly seperate these two classes of documents, and then try to seperate the unseen test docs into either category.

怎么做?如何概念化。我很欣赏一个可靠的参考,全面的解释或某种伪代码。

How to do that? How to conceptualize that. I'd appreciate a solid reference, comprehensive explanation or some kind of pseudocode.

我在网上找不到很多信息丰富且清晰的参考资料。

I've not found many informative and lucid references on the web.

推荐答案

让我们预先建立一些词汇表(我猜你使用的是20-newsgroup数据集):

Let's establish some vocabulary up front (I guess you are using the 20-newsgroup dataset):


  • 类别标签是你想要预测的,在你的二元情况下,这是无神论与其余的

  • 您输入分类器的特征向量

  • 文档是来自数据集的单个电子邮件

  • 令牌文档的一小部分,通常是unigram / bigram / trigram

  • 词典一组允许的单词为您的向量

  • "Class Label" is what you're trying to predict, in your binary case this is "atheism" vs. the rest
  • "Feature vector" that's what you input to your classifier
  • "Document" that is a single e-mail from the dataset
  • "Token" a fraction of a document, usually a unigram/bigram/trigram
  • "Dictionary" a set of "allowed" words for your vector

所以词袋的矢量化算法通常遵循以下步骤:

So the vectorization algorithm for bag of words usually follows the following steps:


  1. 过去所有文档(跨所有类标签)并收集所有标记,这是您的字典和特征向量的维度

  2. 再次遍历所有文档并为每个文档执行:
  1. Go over all the documents (across all class labels) and collect all the tokens, this is your dictionary and the dimensionality of your feature vector
  2. Go over all the documents again and for each do:

  1. 使用您的维度创建新的要素向量字典(例如200,该字典中的200个条目)

  2. 遍历该文档中的所有标记,并在特征向量的此维度设置单词计数(在此文档中)


  • 您现在有一个可以输入算法的特征向量列表

  • 示例:

    Document 1 = ["I", "am", "awesome"]
    Document 2 = ["I", "am", "great", "great"]
    

    字典是:

    ["I", "am", "awesome", "great"]
    

    因此,作为向量的文档如下所示:

    So the documents as a vector would look like:

    Document 1 = [1, 1, 1, 0]
    Document 2 = [1, 1, 0, 2]
    

    然后你就可以做各种奇特的数学运算并将其提供给你的感知器。

    And with that you can do all kinds of fancy math stuff and feed this into your perceptron.

    这篇关于在哈希映射上运行perceptron算法功能vecteur:java的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆