Efficiently check if a string is an approximate substring of (approximately contrained in) another string, up to a given error threshold?

  agrep, c++, matching, string

Take two character strings in C or C++, s1 and s2. It is rather trivial to check if one contains the other exactly.

The following will return true in case s2 is a substring of s1.

In C:

strstr(s1, s2)

In C++:

#include <string>
str.find(str2) != string::npos

With boost:

#include <boost/algorithm/string.hpp>
boost::algorithm::contains(s1, s2)

What I am looking for is an analogous way of efficiently (in terms of speed, not memory) finding whether a string is approximately contained in / is approximately a substring of another up to a given difference threshold. Pretty much like the agrep bash command does in Unix systems for finding files.

For example, suppose the function was called approx_contains. Then approx_contains(s1, s2, 4) could return true/false in case s2 is contained within s1 up to a Levenshtein distance of 4.

Searching online, I only found numerous references and questions on how to calculate Levenshtein distance between two strings, or theoretical algorithms for approximate string pattern matching that go beyond simply checking whether or not one string contains another – which here becomes wasteful.

Trying very hard to avoid reinventing the wheel, how could someone do such a checking in C or C++?

Source: Windows Questions C++

LEAVE A COMMENT