How should overlapping patterns be handled in a Boyer-Moore-Horspool string search?

  algorithm, c++, search

I’m coding a Boyer-Moore-Horspool string search function as homework. It’s going okay, but I realized a slight issue.

unsigned int bmh_search(const string& pattern, const string& text) {

    int M = pattern.size();
    int N = text.size();
    int c = 0;
    int skip = 0;

    map<char, unsigned int> skip_table = create_skip_table(pattern, 1);

    for (int i = M-1; i <= N; i+=skip) {
        int j;

        if (text[i] == pattern[M-1]) {
            /* For current index i, check for pattern match */
            for (j = 0; j < M; j++)
                if (text[i - j] != pattern[M-1-j])
                    break;

            if (j == M) // if pattern[0...M-1] = text[i, i+1, ...i+M-1]
                c++;
        } 
        
        skip = (skip_table.find(text[i]) != skip_table.end()) ? skip_table[text[i]] : skip_table['*'];
        }

        if (skip <= 0)
            break;
    }

    return c;
}

This fills my imposed unit tests, but one case that comes to mind is overlapping patterns. For example, should bmh_search("baba","bababababa") return 3 or 4? Should I skip by the pattern’s length after every pattern found ? What’s the standard approach?

Source: Windows Questions C++

LEAVE A COMMENT