Tuesday, May 28, 2013

String Matching Algorithms

Some of String Matching Algorithms are :
Naïve string search algorithm
Knuth–Morris–Pratt algorithm
Boyer–Moore string search algorithm

String matching algorithm is finding all occurrences of a pattern in a text . 
Assume that the text is an array T[1 . . n] of length n and that the pattern is an array P[1 . . m] of length m.We say that pattern P occurs with shift s in text T (or, equivalently, that pattern P occurs beginning at position s + 1 in text T) if 0 <= s <=  n - m and T[s + 1 . . s + m] = P[1 . . m]
It can be useful in
   Text Editing
   Search Engine 
   Multimedia
   Network Intrusion Detection System
   Bioinformatics  and Cheminformatics and many more

Naïve string search algorithm Steps :















Knuth–Morris–Pratt algorithm :

Knuth, Morris and Pratt proposed a linear time algorithm for the string matching problem. 
A matching time of O(n) is achieved by avoiding comparisons with elements of ‘T’ that have previously been involved in comparison with some element of the pattern ‘P’ to be matched. i.e., backtracking on the text/string ‘T’ never occurs

Components of KMP algorithm

The prefix function, Π
The prefix function,Π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid useless shifts of the pattern ‘P’. In other words, this enables avoiding backtracking on the string ‘T’.

The KMP Matcher
With string ‘T’, pattern ‘P’ and prefix function ‘Π’ as inputs, finds the occurrence of ‘P’ in ‘T’ and returns the number of shifts of ‘P’ after which occurrence is found. 


Following pseudocode computes the prefix fucnction, Π:

Compute-Prefix-Function (p)
    m := length[P]               //’P’ pattern to be matched
 Π[1] := 0 
  k := 0
 for q := 2 to m
         do while k > 0 and P[k+1] != P[q]
                       do k := Π[k]
              If P[k+1] = P[q]
                 then k := k +1
               Π[q] := k
     return  Π



















Boyer-Moore Algorithm :
Developed in 1977, the B-M string search algorithm is a particularly efficient algorithm, and has served as a standard benchmark for string search algorithm ever since.
This algorithm’s execution time can be sub-linear, as not every character of the string to be searched needs to be checked.
Generally speaking, the algorithm gets faster as the target string becomes larger.

How does it work :
   Right to left scan
   Bad character rule
   Good suffix rule














Bad character rule :
Definition :
For each character x in the alphabet, let R(x) denote the position of the right-most occurrence of character x in P.  
R(x) is defined to be 0 if x is not in P

Usage
Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k].
Shift P to the right by max(1, i - R(T[k])) places
Hopefully more than 1















Observation of Bad character rules :
Work well in practice with large alphabets like the English alphabet
Work less well with small alphabets like DNA

Strong good suffix rule :
P[i..n] matches text T[j..j+n-i] but T[(j-1) does not match P(i-1)
The rightmost non-suffix substring t’ of P that matches the suffix P[i..n] AND the character to the 
        left of t’ in P is different than P(i-1)
Shift P so that t’ matches up with T[j..j+n-i]




No comments:

Post a Comment