您如何确定用户在汇编语言 X86 中输入的字符串中的单词频率? [英] How do you determine frequency of words in a user entered string in Assembly Language X86?

查看:30
本文介绍了您如何确定用户在汇编语言 X86 中输入的字符串中的单词频率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是汇编语言编程的完全初学者.我需要帮助编写一个汇编语言程序来从用户那里获取一个字符串,计算并显示用户输入的字符串中每个单词出现的次数.

I am a complete beginner to assembly language programming. I need help in writing an assembly language program to get a string from the user, count and display the number of times each word occur in the user entered string.

例如,如果用户输入:

Hello Hello what is new Hello what is not new

输出应该是:

Hello   3
what    2
is      2
not     1
new     2

我在下面有以下代码,它将为我提供字符串中字符的频率.但是,我不知道如何编辑以便我可以跟踪单词而不仅仅是字符,然后能够以相应的频率显示这些单词.

I have the following code below that will give me the frequency of characters within a string. However, I do not how to edit so that I can keep track of words instead of just characters, then be able to display those words with their corresponding frequency.

INCLUDE Irvine32.inc

Get_frequencies PROTO,
    pString:PTR BYTE,   ; points to string
    pTable:PTR DWORD    ; points to frequency table

.data
freqTable DWORD 256 DUP(0)
;aString BYTE 1,2,"This is extremely difficult for the experienced",0

aString BYTE 80 DUP(0),0

str1 BYTE "*** Constructing a Frequency Table *** (DEMO)",
    0dh,0ah,0dh,0ah,
    "Enter between 1 and 80 characters: ",0

.code
main PROC

    call Clrscr
    mov  edx,OFFSET str1
    call WriteString

    mov  ecx,SIZEOF aString - 1
    mov  edx,OFFSET aString
    call ReadString

    INVOKE Get_frequencies, ADDR aString, ADDR freqTable
    call DisplayTable

   exit
   main ENDP

;-------------------------------------------------------------

  Get_frequencies PROC,
    pString:PTR BYTE,   ; points to string
    pTable:PTR DWORD    ; points to frequencey table

;
; Constructs a character frequency table. Each array position
; is indexed by its corresponding ASCII code.
;
; Returns: Each entry in the table contains a count of how
; many times that character occurred in the string.
;-------------------------------------------------------------

mov esi,pString
mov edi,pTable
cld     ; clear Direction flag (forward)

L1: mov eax,0       ; clear upper bits of EAX
   lodsb        ; AL = [ESI], inc ESI
   cmp al,0     ; end of string?
   je  Exit_proc        ; yes: exit
   shl eax,2        ; multiply by 4
   inc DWORD PTR [edi + eax]    ; inc table[AL]
   jmp L1       ; repeat loop

 Exit_proc:
    ret
 Get_frequencies ENDP

  ;-------------------------------------------------------------

 DisplayTable PROC

  ;
  ; Display the non-empty entries of the frequency table.
  ; This procedure was not required, but it makes it easier
  ; to demonstrate that Get_frequencies works.
  ;-------------------------------------------------------------

  .data
  colonStr BYTE ": ",0
  .code
  call Crlf
  mov ecx,LENGTHOF freqTable    ; entries to show
  mov esi,OFFSET freqTable
  mov ebx,0 ; index counter

 L1:    mov eax,[esi]   ; get frequency count
        cmp eax,0   ; count = 0?
        jna L2  ; if so, skip to next entry

         mov eax,ebx    ; display the index
         call WriteChar
         mov edx,OFFSET colonStr    ; display ": "
         call WriteString
         mov eax,[esi]  ; show frequency count
         call WriteDec
         call Crlf

  L2:   add esi,TYPE freqTable  ; point to next table entry
        inc ebx ; increment index
        loop L1

        call Crlf
        ret
      DisplayTable ENDP

      END main

<小时>

到目前为止,这是我尝试实施彼得回答中建议的蛮力搜索的方法:


This is what I have so far in my attempt to implement the brute-force search suggested in Peter's answer:

 .data
 str2 BYTE "one two three",0

 .code
  main proc
   mov  edi,OFFSET str2
    Mov esi,edi
    Mov Ecx, 0  ;reset ecx to 0
    Not Ecx     ;set Ecx to -1 or highest possible integer
    Mov Al, ' ' ;Initialize a1 to delimiter of (space) ' '
    Cld         ;Clear Direction Pointer
    Repne Scasb ;scan edi one byte at a time until delimiter found
    Not Ecx
    Lea Eax, [ecx-1] ;Set Eax to index of found delimiter
    Xchg Esi, Edi  ;Take Edi which is now equal to string after found       delimiter and put in esi

    mov edx, esi
    call WriteString    

 main endp
 end main

这会打印二三",但我希望它打印一".

This prints "two three", but I want it to print "one".

推荐答案

与使用 C 或您已经知道的任何其他语言相比,asm 没有什么特别之处会让您选择不同的算法.

There's nothing special about asm that would make you choose a different algorithm than if you were doing it in C or any other language you already know.

这是作业吗?它似乎比你通常想要在 asm 中做的事情更复杂.但是,如果它只适用于短字符串,那么不使用除字符缓冲区以外的任何数据结构的非常糟糕的蛮力算法会更容易实现(见下文).

Is this an assignment? It seems more complex than something you'd usually want to do in asm. However, if it only has to work for short strings, a really bad brute-force algorithm that doesn't use any data structures other than a char buffer would be easier to implement (see below).

可打印的 ASCII 字符少于 128 个,因此计数表既小又容易.

There are less than 128 printable ASCII characters, so a table of counts is small and easy.

字符和单词的区别在于可能的单词数量是无限的(除非内存大小限制),因此您不能直接使用单词作为计数表的索引.即使使用 4 个字符的单词作为计数表的 4 字节索引也会使计数表大小 = 2^32 个条目.

The difference between characters and words is that the number of possible words is infinite (barring memory-size limitations), so you can't just use words directly as an index into a table of counts. Even using 4-character words as a 4-byte index into a table of counts would make the count table size = 2^32 entries.

将计数表算法从字符调整为单词的最直接方法是使用某种字典数据结构,如哈希表tree,甚至是 基数树/trie.每个条目都将一个词映射到它的计数器.

The most direct way to adapt the count-table algorithm from character to words is with some kind of dictionary data structure, like a hash table, tree, or even a radix tree / trie. Each entry maps a word to its counter.

最简单可能的字典"word:count 将是一个未排序的
数组struct {char *word;int count;} histogram[]; 你线性搜索.这效率不高,但在 asm 中很容易实现.可以通过存储像 int wordlen; 这样的显式长度字符串来加速单词比较.char *wordstart 这样你就可以直接指向你的字符串,而不必存储 0 终止符.然后你可以在查看它们的字节之前检查单词的长度是否相同,并且memcmp可以比strcmp更有效率.

The simplest possible "dictionary" of word:count would be an unsorted array of
struct {char *word; int count;} histogram[]; that you linear-search. This is not efficient, but would be easy to implement in asm. Word comparison could be sped up by storing explicit-length strings like int wordlen; char *wordstart so you can just point into your string without having to store 0 terminators. Then you can check if words are even the same length before looking at their bytes, and memcmp can be more efficient than strcmp.

这很糟糕,因为它是对每个输入词的这个计数数组的线性搜索,但它仍然可能比下面的暴力方法更好,扫描(和复制)每个输入词的整个字符串的其余部分.保持这个数组排序(插入排序风格)可能更好也可能不会更好,当找到单词时允许 O(log(n)) 访问与二进制搜索.可能有一些巧妙的权衡,比如间隙数组(用一种方法来通知未使用的插槽)可以让特定位置的插入只需要复制有限的数量.或者当然是经典的字典数据结构,如哈希表或树.

This sucks because it's a linear search of this array of counts for every input word, but it's still probably better than the brute-force way below that scans (and copies) the whole rest of the string for every input word. Keeping this array sorted (insertion-sort style) might or might not be better, allowing O(log(n)) access with a binary search when the word is found. There are probably clever tradeoffs like gapped arrays (with a way to signal unused slots) that can let an insert at a specific place only need to copy a limited amount. Or of course a classic dictionary data structure like a hash table or tree.

另一种选择是对您的单词列表进行排序,然后循环遍历已排序的列表并计算重复项.或者在排序时累积重复计数,这是一个有趣的问题当您输入的单词列表太大而无法一次性全部放入内存时.

Another option is to sort your list of words, and then loop through the sorted list counting duplicates. Or accumulate duplicate-counts while you sort, which is an interesting problem when your input word list is too big to fit in memory all at once.

我可能会这样做:

;; tighter version of your char-count loop
   xor   eax, eax                    ; clear upper bits of EAX
L1:
   lodsb                          ; AL = [ESI], inc ESI
   inc   DWORD PTR [edi + eax*4]  ; inc table[AL]
   test  al,al                    ; Set flags based on AL
   jz    L1                       ; loop if it's not the end of the string
   ;; fall through when done
   ;;dec   DWORD PTR [edi]          ; undo the count of the zero if you care

或者更可能是一个movzx eax, byte [esi]/inc esi,比lodsb更有效率,避免了成本合并到旧的 EAX.

Or more likely a movzx eax, byte [esi] / inc esi, which is more efficient than lodsb, avoiding the cost of merging into the old EAX.

这确实计算了终止的 0,但在循环中保存了指令.在循环外使用更多代码,您可以在保持循环较小的同时避免这种情况.

This does count the terminating 0, but saves instructions in the loop. With more code outside the loop, you could avoid this while still keeping the loop small.

由于我们左移作为寻址模式的一部分,我们不需要在循环内将 eax 归零.使用 xor-zeroing 还可以避免在只写入 al 后读取 eax 时部分寄存器速度变慢,如果您关心性能.

Since we left-shift as part of the addressing mode, we don't need to zero eax inside the loop. Using xor-zeroing also avoids a partial-register slowdown when reading eax after writing just al, if you care about performance.

这对于长字符串的表现会很糟糕,但对于非常短的字符串可能会很好.主要优点是易于实现,但对于非常短的字符串(如 128 字节),它在性能上可能与树或散列相比具有竞争力.

This will perform terribly for long strings, but might be pretty good for very short strings. The main advantage is ease of implementation, but it may be competitive in performance to a tree or hash for very short strings, like 128 bytes.

  • 找出第一个单词的结尾
  • 在字符串的 rest 中搜索该词的出现次数.(记住只匹配完整的单词,而不是像 strstr(3))
  • 在每次匹配时,将字符串的其余部分复制到左侧,从字符串中删除单词.(也删除第一个词.)
  • 重复直到字符串为空
  • Find the end of the first word
  • Search the rest of the string for occurrences of that word. (Remember to match full words only, not like strstr(3))
  • At every match, copy the rest of the string to the left, removing the word from the string. (Remove the first word, too.)
  • Repeat until the string is empty

(您不必复制第一次出现,只需将指针向前移动,以便下一次搜索从它的末尾开始.如果未找到重复项,则无需进行任何复制.)

(You don't have to copy the first occurrence out, just advance your pointer so the next search starts at the end of it. If no duplicates are found, you don't need to do any copying.)

这会花费大量时间来复制字符,但我们需要避免重复(当我们稍后再使用时,对单词的最后 n-1 次出现再次计数).通过搜索已计数单词的另一种数据结构进行重复检查可以避免复制,但是如果您打算这样做,您最好只将字符串和直方图传递到找到的单词字典"中,见上文.

This spends a lot of time copying characters around, but we need to avoid duplicates (doing another count for the last n-1 occurrences of a word when we come to it later). Duplicate checking by searching another data structure of already-counted words would let you avoid copying, but if you're going to do that you might as well just do one pass over the string and histogram into that found-words "dictionary", see above.

但这本身需要额外的周期和代码,并且所有未来的搜索都将搜索这些已经计算过的词.如果你打算使用一组已经找到的词,你不妨

but that itself costs extra cycles and code, and all future searches would be searching these already-counted words. If you going to use an array of already-found words, you might as well

您可以使用 rep movsb 进行复制,我认为这仍然适用于重叠的 dst 和 src.(但如果它们紧密重叠,则速度不快).

You can do the copying with rep movsb, which I think still works with overlapping dst and src. (But is not fast if they overlap closely).

另一种选择是使用第二个相同大小的缓冲区,这样您只需复制整个字符串的一个副本,同时删除您正在计算的当前单词的所有出现次数:

Another option is to have a 2nd equal-size buffer, so you only do one copy of the whole string while removing all occurrences of the current word you're counting:

从字符串 A 中的第二个单词开始:

Starting from the 2nd word in string A:

  • 将字符复制到字符串 B 中.
  • 当您到达一个词的末尾时,检查它是否与您要查找的当前词匹配.
    • 如果是:count++dst-=length_of_current_countword.当您检测到您要查找的单词时,您将目标指针 (edi) 倒回到开头,因此将来的复制将覆盖它.IE.您现在基本上可以免费取消复制它.
    • copy characters into string B.
    • when you reach the end of a word, check if it's a match for the current word you're looking for.
      • if yes: count++ and dst-=length_of_current_countword. When you detect the word you're looking for, you rewind your destination pointer (edi) to the beginning, so future copying will overwrite it. I.e. you can un-copy it basically for free at this point.

      这篇关于您如何确定用户在汇编语言 X86 中输入的字符串中的单词频率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆