有趣的NLP/机器学习风格项目-分析隐私政策 [英] Interesting NLP/machine-learning style project -- analyzing privacy policies

查看:216
本文介绍了有趣的NLP/机器学习风格项目-分析隐私政策的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想就分配给我的一个有趣的问题提供一些意见.任务是分析成百上千个,最后成千上万个隐私策略,并确定它们的核心特征.例如,他们获取用户的位置吗?是否与第三方共享/出售?等等.

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.

我已经与一些人进行了交谈,阅读了很多有关隐私政策的内容,并亲自思考了一下.这是我目前的攻击计划:

I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:

首先,阅读大量隐私,找到满足特定特征的主要线索"或指标.例如,如果成百上千的隐私策略在同一行:我们将前往您的位置.",则该行可能是100%确信该隐私策略包括获取用户位置的提示.其他提示可能会使人们对某个特征的信任度小得多.例如,单词"location"的出现可能会使用户位置存储的可能性增加25%.

First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.

想法是继续发展这些提示及其适当的置信区间,以便我可以高度自信地对所有隐私策略进行分类.可以比喻为使用贝叶斯过滤器的电子邮件垃圾邮件捕获系统,以识别哪些邮件可能是商业邮件和未经请求的邮件.

The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.

我想问你们是否认为这是解决此问题的好方法.您将如何精确地解决这样的问题?此外,您是否建议使用任何特定的工具或框架.欢迎任何输入.这是我第一次做一个涉及人工智能的项目,特别是机器学习和NLP.

I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.

推荐答案

想法是继续发展这些提示及其适当的置信区间,以便我可以高度自信地对所有隐私策略进行分类.可以比喻为使用贝叶斯过滤器的电子邮件垃圾邮件捕获系统,以识别哪些邮件可能是商业邮件和未经请求的邮件.

The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.

这是文本分类.假设每个文档有多个输出类别,则实际上是多标签分类.标准方法是手动标记一组带有您要预测的类/标签的文档,然后根据文档的特征训练分类器;通常是单词或n-gram出现次数或计数,可能由tf-idf加权.

This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.

用于文档分类的流行学习算法包括朴素贝叶斯(Bayes)和线性SVM,尽管其他分类器学习者也可以使用.可以通过 one-将任何分类器扩展为一个多标签分类器. vs.-rest (OvR)的构建.

The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.

这篇关于有趣的NLP/机器学习风格项目-分析隐私政策的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆