Some of String Matching Algorithms are :
Naïve string search algorithm
Knuth–Morris–Pratt algorithm
Boyer–Moore string search algorithm
Naïve string search algorithm
Knuth–Morris–Pratt algorithm
Boyer–Moore string search algorithm
String matching algorithm is finding all occurrences of a pattern in a text .
Assume that the text is an array T[1 . . n] of length n and that the pattern is an array P[1 . . m] of length m.We say that pattern P occurs with shift s in text T (or, equivalently, that pattern P occurs beginning at position s + 1 in text T) if 0 <= s <= n - m and T[s + 1 . . s + m] = P[1 . . m]
It can be useful in
Text Editing
Search Engine
Multimedia
Network Intrusion Detection System
Bioinformatics and Cheminformatics and many more
Naïve string search algorithm Steps :
Knuth–Morris–Pratt algorithm :
Knuth, Morris and Pratt proposed a linear time algorithm for the string matching problem.
A matching time of O(n) is achieved by avoiding comparisons with elements of ‘T’ that have previously been involved in comparison with some element of the pattern ‘P’ to be matched. i.e., backtracking on the text/string ‘T’ never occurs
Components of KMP algorithm
The prefix function, Π
The prefix function,Π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid useless shifts of the pattern ‘P’. In other words, this enables avoiding backtracking on the string ‘T’.
The KMP Matcher
With string ‘T’, pattern ‘P’ and prefix function ‘Π’ as inputs, finds the occurrence of ‘P’ in ‘T’ and returns the number of shifts of ‘P’ after which occurrence is found.
Following pseudocode computes the prefix fucnction, Π:
Compute-Prefix-Function (p)
m := length[P] //’P’ pattern to be matched
Π[1] := 0
k := 0
for q := 2 to m
do while k > 0 and P[k+1] != P[q]
do k := Π[k]
If P[k+1] = P[q]
then k := k +1
Π[q] := k
return Π
Boyer-Moore Algorithm :
Developed in 1977, the B-M string search algorithm is a particularly efficient algorithm, and has served as a standard benchmark for string search algorithm ever since.
This algorithm’s execution time can be sub-linear, as not every character of the string to be searched needs to be checked.
Generally speaking, the algorithm gets faster as the target string becomes larger.
How does it work :
Right to left scan
Bad character rule
Good suffix rule
Bad character rule :
Definition :
For each character x in the alphabet, let R(x) denote the position of the right-most occurrence of character x in P.
R(x) is defined to be 0 if x is not in P
Usage
Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k].
Shift P to the right by max(1, i - R(T[k])) places
Hopefully more than 1
Observation of Bad character rules :
Work well in practice with large alphabets like the English alphabet
Work less well with small alphabets like DNA
Strong good suffix rule :
P[i..n] matches text T[j..j+n-i] but T[(j-1) does not match P(i-1)
The rightmost non-suffix substring t’ of P that matches the suffix P[i..n] AND the character to the
left of t’ in P is different than P(i-1)
Shift P so that t’ matches up with T[j..j+n-i]
No comments:
Post a Comment