如何使用bert来表达长句子? [英] how to use bert for long sentences?

查看:71
本文介绍了如何使用bert来表达长句子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将给定的文本分类为新闻,点击诱饵或其他.我要培训的文本很长.长度的分布在此处显示.现在,问题是我应该在中间修剪文本并使其长度为512个令牌吗?但是,我什至有大约10,000字的文档,所以我不会通过截断来松开要旨吗?或者,我应该将文本拆分为512个长度的子文本.如果是这样,则一个文本的子文本可能类似于另一文本的子文本,但标签将不同.它不是嘈杂的数据吗?或者,我应该在这里使用双向LSTM并填充到max_len吗?

I am trying to classify given text into news, clickbait or others. The texts which I have for training are long.distribution of lengths is shown here. Now, the question is should I trim the text at the middle and make it 512 tokens long? But, I have even documents with circa 10,000 words so won't I loose the gist by truncation? Or, should I split my text into sub texts of 512 length. If so, then the sub text of one text may be similar to subtext of another text but the labels will be different. Doesn't it become noisy data? Or, should I just use bidirectional LSTM's here and pad to max_len?

推荐答案

您的类似问题的答案可以在论文

The answer to the similar question of yours can be found in the paper here.

如果您在谈论将文本分类为新闻或点击诱饵,为什么您认为同一文档的各个块将具有不同的标签?您可以对文本进行分块,并遵循如何 Github 页面和文档,我多次使用它,并获得了良好的效果.

Why do you think the chunks of the same document will have different labels if you're talking about classiffcation of texts as news or clickbaits? You can chunk the text and follow the idea of truncation approach proposed in How to Fine-Tune BERT for Text Classification?. The authors show that head+tail truncating delivers high accuracy. I used it several times thanks to the Github page and documentation and got good results.

您可以选择带有标志-trunc_medium 的截断方法,并带有以下选项:

You can choose the truncation method with a flag --trunc_medium with the options:

  1. -2表示仅头(保留前512个令牌),
  2. -1表示仅尾部(保留最后512个令牌),
  3. 0表示前半部+后半部(例如:head256 + tail256),
  4. 其他自然数k表示头k +尾巴(例如:头k +尾巴(512-k)).

然后,您可以合并块的结果,从而为您拥有的长文本创建合并的嵌入.

Then you may pool the results for the chunks creating the Pooled embeddings for the long texts you have.

在这里,我还将继续讨论有关将BER​​T引用至Big BIRD的长文本分类的最新方法(请参见 Longformers

Here I will also continue discussion about the state-of-the-art approaches for the classification of long texts with BERT reffering to Big BIRD (see the article). The researchers from Google build on the idea of Longformers and Extended Transformers Construction. Basically they propose combine the idea of Longformers and Randomized Attention that reduces quadratic dependency on the sequence length to linear. You can try even 10000-wording texts. The approach is interesting however, it requires architecture with more layers.

请也检查stackoverflow 问题.

Plese check also the stackoverflow question.

这篇关于如何使用bert来表达长句子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆