如何编写 XQuery flwor 表达式来计算单词之间的概率? [英] How to write an XQuery flwor expression to calculate the probability between words?

查看:28
本文介绍了如何编写 XQuery flwor 表达式来计算单词之间的概率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个 XQuery flwor 表达式,以返回 xml 文件中所有出现的目标单词we",以及每种情况下句子中的下一个单词.我想将概率计算为比率:(后继词出现在目标词我们"之后的次数除以后继词整体出现的次数).

I'm writing an XQuery flwor expression to return all the occurrences of the target word 'we' in the xml file, together with the word which comes next in the sentence in each case. I want to calculate the probability as the ratio: (number of times successor word appears after target word 'we' divided by the number of times successor word appears overall).

这是我正在处理的 XML 文件:

Here is the XML file I am working on:

<u who="PS6H7">
<s n="3">
    <w c5="AV0" hw="well" pos="ADV">Well</w>
    <c c5="PUN">, </c>
    <w c5="AJ0" hw="good" pos="ADJ">good </w>
    <w c5="NN1" hw="afternoon" pos="SUBST">afternoon</w>
    <c c5="PUN">, </c>
    <w c5="PNI" hw="everybody" pos="PRON">everybody</w>
    <c c5="PUN">, </c>
    <w c5="PNP" hw="i" pos="PRON">I </w>
    <w c5="VVB" hw="think" pos="VERB">think </w>
    <w c5="PNP" hw="we" pos="PRON">we</w>
    <w c5="VHD" hw="have" pos="VERB">'d </w>
    <w c5="AV0" hw="well" pos="ADV">better </w>
    <w c5="VVI" hw="get" pos="VERB">get </w>
    <w c5="VVN" hw="start" pos="VERB">started</w>
    <c c5="PUN">.</c>
</s>

<s n="4">
    <w c5="PNP" hw="we" pos="PRON">We </w>
    <w c5="VVD" hw="look" pos="VERB">looked </w>
    <w c5="AV0" hw="so" pos="ADV">so </w>
    <w c5="AJ0" hw="thin" pos="ADJ">thin </w>
    <w c5="PRP" hw="on" pos="PREP">on </w>
    <w c5="AT0" hw="the" pos="ART">the </w>
    <w c5="NN1" hw="ground" pos="SUBST">ground</w>
    <c c5="PUN">, </c>
    <w c5="PNP" hw="i" pos="PRON">I </w>
    <w c5="VVD" hw="think" pos="VERB">thought </w>
    <w c5="PNP" hw="we" pos="PRON">we</w>
    <w c5="VM0" hw="would" pos="VERB">'d </w>
    <w c5="VVI" hw="sit" pos="VERB">sit </w>
    <w c5="CJC" hw="and" pos="CONJ">and </w>
    <w c5="VVI" hw="wait" pos="VERB">wait </w>
    <w c5="CJC" hw="and" pos="CONJ">and </w>
    <w c5="VVI" hw="see" pos="VERB">see </w>
    <w c5="CJS" hw="if" pos="CONJ">if </w>
    <w c5="PNI" hw="everyone" pos="PRON">everyone</w>
    <w c5="VBZ" hw="be" pos="VERB">'s </w>
    <w c5="VVG-AJ0" hw="come" pos="VERB">coming</w>
    <c c5="PUN">, </c>
    <w c5="CJC" hw="but" pos="CONJ">but </w>
    <w c5="UNC" hw="erm" pos="UNC">erm </w>
    <w c5="PNP" hw="we" pos="PRON">we</w>
    <w c5="VM0" hw="will" pos="VERB">'ll </w>
    <w c5="VHI" hw="have" pos="VERB">have </w>
    <w c5="TO0" hw="to" pos="PREP">to </w>
    <w c5="VVI" hw="get" pos="VERB">get </w>
    <w c5="VVN" hw="start" pos="VERB">started </w>
    <w c5="AV0" hw="anyway" pos="ADV">anyway</w>
    <c c5="PUN">.</c>
</s>

<s n="5">
    <w c5="PNP" hw="we" pos="PRON">We</w>
    <w c5="VM0" hw="will" pos="VERB">'ll </w>
    <w c5="VVI" hw="welcome" pos="VERB">welcome</w>
    <c c5="PUN">, </c>
    <w c5="PNP" hw="we" pos="PRON">we </w>
    <w c5="VHB" hw="have" pos="VERB">have </w>
    <w c5="CRD" hw="two" pos="ADJ">two </w>
    <w c5="NN2" hw="speaker" pos="SUBST">speakers</w>
    <c c5="PUN">, </c>
    <w c5="NP0" hw="mr" pos="SUBST">Mr </w>
    <w c5="NP0" hw="bob" pos="SUBST">Bob </w>
    <w c5="NP0" hw="plumtree" pos="SUBST">Plumtree</w>
    <c c5="PUN">, </c>
    <w c5="CJC" hw="and" pos="CONJ">and </w>
    <w c5="NP0" hw="ms" pos="SUBST">Ms </w>
    <w c5="NP0" hw="erica" pos="SUBST">Erica </w>
    <w c5="NP0" hw="ison" pos="SUBST">Ison</w>
    <c c5="PUN">.</c>
</s>

<s n="6">
    <w c5="PNP" hw="we" pos="PRON">We </w>
    <w c5="VVD" hw="ask" pos="VERB">asked </w>
    <w c5="PNP" hw="they" pos="PRON">them </w>
    <w c5="PRP" hw="to" pos="PREP">to </w>
    <w c5="AT0" hw="the" pos="ART">the </w>
    <w c5="NN1" hw="meeting" pos="SUBST">meeting </w>
    <w c5="CJC" hw="and" pos="CONJ">and </w>
    <w c5="PNP" hw="we" pos="PRON">we </w>
    <w c5="VVB" hw="look" pos="VERB">look </w>
    <w c5="AV0" hw="forward" pos="ADV">forward </w>
    <w c5="PRP" hw="to" pos="PREP">to </w>
    <w c5="VVG-NN1" hw="listen" pos="VERB">listening </w>
    <w c5="PRP" hw="to" pos="PREP">to </w>
    <w c5="PNP" hw="you" pos="PRON">you </w>
    <w c5="AV0" hw="later" pos="ADV">later </w>
    <w c5="AVP" hw="on" pos="ADV">on </w>
    <w c5="PRP" hw="in" pos="PREP">in </w>
    <w c5="AT0" hw="the" pos="ART">the </w>
    <w c5="NN1" hw="agenda" pos="SUBST">agenda</w>
    <c c5="PUN">.</c>
</s>

<s n="7">
    <w c5="AT0" hw="the" pos="ART">The </w>
    <w c5="NN2" hw="minute" pos="SUBST">minutes </w>
    <w c5="PRF" hw="of" pos="PREP">of </w>
    <w c5="AT0" hw="the" pos="ART">the </w>
    <w c5="NN1" hw="meeting" pos="SUBST">meeting </w>
    <w c5="VVD-VVN" hw="hold" pos="VERB">held </w>
    <w c5="PRP" hw="in" pos="PREP">in </w>
    <w c5="NP0" hw="january" pos="SUBST">January</w>
    <c c5="PUN">.</c>
</s>

<s n="8">
    <w c5="DT0" hw="any" pos="ADJ">Any </w>
    <w c5="NN2" hw="correction" pos="SUBST">corrections </w>
    <w c5="PRP" hw="to" pos="PREP">to </w>
    <w c5="AT0" hw="the" pos="ART">the </w>
    <w c5="NN2" hw="minute" pos="SUBST">minutes </w>
    <w c5="ORD" hw="first" pos="ADJ">first</w>
    <c c5="PUN">?</c>
</s>

</u> 

这是我的 XQuery 表达式.它返回所有出现的目标词 'we 以及它后面的词.我也能找到频率(后继词出现在目标词之后的次数),但我无法计算概率比.求概率的公式是(后继词出现在目标词我们"之后的次数除以后继词整体出现的次数).

This is my XQuery expression. It returns all the occurrences of the target word 'we, together with the word that comes after it. I am also able to find the frequency (number of times the successor word occurs after target word), but I cannot calculate the probability ratio. The formula to find probability is (number of times successor word appears after target word 'we' divided by the number of times successor word appears overall).

结果,我想要一个 HTML 表格来显示第一列中的目标单词we",第二列中出现在we"之后的单词以及组合出现在第三列中的频率或次数,以及第 4 列的概率.

In result, I want to an HTML table to show the target word 'we' in 1st column, the word that occurs after 'we' in 2nd column and the frequency or number of times the combination occurred in 3rd column, and the probability in the 4th column.

<html>
<body>
<table border='1'>
<tr><td>Target</td><td>Successor</td><td>Frequency</td><td>Probability</td></tr>

{

let $target := "we"

let $x := doc("KS0.xml")//u//s//w[lower-case(normalize-space()) = $target]

for $successor in distinct-values($x/following-sibling::w[1])

let $probability := count(doc("KS0.xml")//u//s//w)

let $frequency := $x/following-sibling::w[1][. = $successor]

order by count($frequency) descending

return <tr>
           <td>{$target}</td>
           <td>{$successor}</td>
           <td>{count($frequency)}</td>
           <td>{$probability}</td>
       </tr>
}

</table>
</body>
</html>

这是我得到的输出.它在第 4 列中计数的概率不正确.

This is my output which I get. The probability it counts in the 4th column in not correct.

<?xml version="1.0" encoding="UTF-8"?>
<html>
   <body>
      <table border="1">
         <tr>
            <td>Target</td>
            <td>Successor</td>
            <td>Frequency</td>
            <td>Probability</td>
         </tr>
         <tr>
            <td>we</td>
            <td>'re </td>
            <td>44</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>'ve </td>
            <td>38</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>'ll </td>
            <td>11</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>have </td>
            <td>8</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>could </td>
            <td>7</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>have</td>
            <td>6</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>do </td>
            <td>6</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>are </td>
            <td>6</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>'d </td>
            <td>5</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>do</td>
            <td>5</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>were </td>
            <td>4</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>should </td>
            <td>4</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>see </td>
            <td>3</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>will </td>
            <td>3</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>going </td>
            <td>3</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>had </td>
            <td>3</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>shall </td>
            <td>3</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>can </td>
            <td>3</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>look </td>
            <td>2</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>did</td>
            <td>2</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>know </td>
            <td>2</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>need </td>
            <td>2</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>make </td>
            <td>2</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>would </td>
            <td>2</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>want </td>
            <td>2</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>hope </td>
            <td>2</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>looked </td>
            <td>1</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>asked </td>
            <td>1</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>erm </td>
            <td>1</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>talking </td>
            <td>1</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>Chris</td>
            <td>1</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>aiming </td>
            <td>1</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>on</td>
            <td>1</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>come </td>
            <td>1</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>occasionally </td>
            <td>1</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>should</td>
            <td>1</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>ought </td>
            <td>1</td>
            <td>11674</td>
         </tr>
         <tr>
            <td>we</td>
            <td>said</td>
            <td>1</td>
            <td>11674</td>
         </tr>

      </table>
   </body>
</html>

推荐答案

您正在计算所有单词,而不仅是 $target实际得到概率).当我不得不考虑它的实际含义时,我在旅途中重命名了 $x,最好总是使用变量名(现在一些额外的字节不再昂贵).最后,我替换了 descendant-or-self 轴步骤 //child 步骤 /,它们有很多较低的性能损失(并且该文档并没有让我假设您真的需要这些).

You're counting all words, and not only occurrences of $target (and miss the division to actually get the probability). I renamed $x on the go when I had to think about what it actually means, better always use speaking variable names (a few additional bytes aren't expensive any more these days). Finally, I replaced the descendant-or-self axis steps // through child steps /, which have much lower performance penalty (and the document does not let me assume you really need those).

(: snip :)
let $target := "we"
let $occurrences := doc("KS0.xml")/u/s/w[lower-case(normalize-space()) = $target]
for $successor in distinct-values($occurrences/following-sibling::w[1])
let $frequency := $occurrences/following-sibling::w[1][. = $successor]
let $probability := count($frequency) div count(/u/s/w[lower-case(normalize-space()) = lower-case(normalize-space($successor))])
order by count($frequency) descending
return <tr>
           <td>{$target}</td>
           <td>{$successor}</td>
           <td>{count($frequency)}</td>
           <td>{$probability}</td>
       </tr>
(: snip :)

这篇关于如何编写 XQuery flwor 表达式来计算单词之间的概率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆