有趣的 NLP/机器学习风格项目——分析隐私政策 [英] Interesting NLP/machine-learning style project -- analyzing privacy policies

查看:36
本文介绍了有趣的 NLP/机器学习风格项目——分析隐私政策的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想就分配给我的一个有趣问题提供一些意见.任务是分析成百上千的隐私政策,并确定它们的核心特征.例如,他们是否获取用户的位置?他们是否与第三方共享/销售?等.

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.

我与一些人交谈过,阅读了很多关于隐私政策的内容,并自己思考了这个问题.这是我目前的攻击计划:

I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:

首先,阅读大量隐私,找到满足某个特征的主要线索"或指标.例如,如果数百个隐私政策具有相同的行:我们将获取您的位置.",则该行可以 100% 地表明该隐私策略包括获取用户的位置.其他提示可能会降低对某个特征的置信度.例如,位置"一词的存在可能会使用户的位置在商店中的可能性增加 25%.

First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.

我们的想法是不断发展这些线索及其适当的置信区间,以便我可以高度自信地对所有隐私政策进行分类.这里可以类比电子邮件垃圾邮件捕获系统,该系统使用贝叶斯过滤器来识别哪些邮件可能是商业邮件和未经请求的邮件.

The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.

我想问一下你们是否认为这是解决这个问题的好方法.你究竟会如何处理这样的问题?此外,是否有任何您建议使用的特定工具或框架.欢迎任何意见.这是我第一次做涉及人工智能的项目,特别是机器学习和 NLP.

I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.

推荐答案

我们的想法是不断发展这些线索及其适当的置信区间,以便我可以高度自信地对所有隐私政策进行分类.这里可以类比电子邮件垃圾邮件捕获系统,该系统使用贝叶斯过滤器来识别哪些邮件可能是商业邮件和未经请求的邮件.

The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.

这是文本分类.鉴于每个文档有多个输出类别,它实际上是多标签分类.标准方法是手动标记一组带有您想要预测的类/标签的文档,然后根据文档的特征训练一个分类器;通常是单词或 n-gram 出现或计数,可能由 tf-idf 加权.

This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.

用于文档分类的流行学习算法包括朴素贝叶斯和线性 SVM,但其他​​分类器学习器也可能工作.任何分类器都可以通过 one- 扩展为多标签分类器vs.-rest (OvR) 构造.

The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.

这篇关于有趣的 NLP/机器学习风格项目——分析隐私政策的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆