Indel#

Functions#

distance#

rapidfuzz.distance.Indel.distance(s1, s2, *, processor=None, score_cutoff=None)#

Calculates the minimum number of insertions and deletions required to change one sequence into the other. This is equivalent to the Levenshtein distance with a substitution weight of 2.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (int, optional) – Maximum distance between s1 and s2, that is considered as a result. If the distance is bigger than score_cutoff, score_cutoff + 1 is returned instead. Default is None, which deactivates this behaviour.

Returns:

distance – distance between s1 and s2

Return type:

int

Examples

Find the Indel distance between two strings:

>>> from rapidfuzz.distance import Indel
>>> Indel.distance("lewenstein", "levenshtein")
3

Setting a maximum distance allows the implementation to select a more efficient implementation:

>>> Indel.distance("lewenstein", "levenshtein", score_cutoff=1)
2

normalized_distance#

rapidfuzz.distance.Indel.normalized_distance(s1, s2, *, processor=None, score_cutoff=None)#

Calculates a normalized levenshtein similarity in the range [1, 0].

This is calculated as distance / (len1 + len2).

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0. For norm_dist > score_cutoff 1.0 is returned instead. Default is 1.0, which deactivates this behaviour.

Returns:

norm_dist – normalized distance between s1 and s2 as a float between 0 and 1.0

Return type:

float

similarity#

rapidfuzz.distance.Indel.similarity(s1, s2, *, processor=None, score_cutoff=None)#

Calculates the Indel similarity in the range [max, 0].

This is calculated as (len1 + len2) - distance.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (int, optional) – Maximum distance between s1 and s2, that is considered as a result. If the similarity is smaller than score_cutoff, 0 is returned instead. Default is None, which deactivates this behaviour.

Returns:

similarity – similarity between s1 and s2

Return type:

int

normalized_similarity#

rapidfuzz.distance.Indel.normalized_similarity(s1, s2, *, processor=None, score_cutoff=None)#

Calculates a normalized indel similarity in the range [0, 1].

This is calculated as 1 - normalized_distance

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0. For norm_sim < score_cutoff 0 is returned instead. Default is 0, which deactivates this behaviour.

Returns:

norm_sim – normalized similarity between s1 and s2 as a float between 0 and 1.0

Return type:

float

Examples

Find the normalized Indel similarity between two strings:

>>> from rapidfuzz.distance import Indel
>>> Indel.normalized_similarity("lewenstein", "levenshtein")
0.85714285714285

Setting a score_cutoff allows the implementation to select a more efficient implementation:

>>> Indel.normalized_similarity("lewenstein", "levenshtein", score_cutoff=0.9)
0.0

When a different processor is used s1 and s2 do not have to be strings

>>> Indel.normalized_similarity(["lewenstein"], ["levenshtein"], processor=lambda s: s[0])
0.8571428571428572

editops#

rapidfuzz.distance.Indel.editops(s1, s2, *, processor=None)#

Return Editops describing how to turn s1 into s2.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

Returns:

editops – edit operations required to turn s1 into s2

Return type:

Editops

Notes

The alignment is calculated using an algorithm of Heikki Hyyrö, which is described [6]. It has a time complexity and memory usage of O([N/64] * M).

References

Examples

>>> from rapidfuzz.distance import Indel
>>> for tag, src_pos, dest_pos in Indel.editops("qabxcd", "abycdf"):
...    print(("%7s s1[%d] s2[%d]" % (tag, src_pos, dest_pos)))
 delete s1[0] s2[0]
 delete s1[3] s2[2]
 insert s1[4] s2[2]
 insert s1[6] s2[5]

opcodes#

rapidfuzz.distance.Indel.opcodes(s1, s2, *, processor=None)#

Return Opcodes describing how to turn s1 into s2.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

Returns:

opcodes – edit operations required to turn s1 into s2

Return type:

Opcodes

Notes

The alignment is calculated using an algorithm of Heikki Hyyrö, which is described [7]. It has a time complexity and memory usage of O([N/64] * M).

References

Examples

>>> from rapidfuzz.distance import Indel
>>> a = "qabxcd"
>>> b = "abycdf"
>>> for tag, i1, i2, j1, j2 in Indel.opcodes(a, b):
...    print(("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
...           (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2])))
 delete a[0:1] (q) b[0:0] ()
  equal a[1:3] (ab) b[0:2] (ab)
 delete a[3:4] (x) b[2:2] ()
 insert a[4:4] () b[2:3] (y)
  equal a[4:6] (cd) b[3:5] (cd)
 insert a[6:6] () b[5:6] (f)

Performance#

Since the Levenshtein module uses different implementations based on the weights used, this leads to different performance characteristics. The following sections show the performance for the different possible weights.

Indel#

The following image shows a benchmark of the Indel distance in RapidFuzz and python-Levenshtein. Similar to the normal Levenshtein distance python-Levenshtein uses an implementation with a time complexity of O(NM), while RapidFuzz has a time complexity of O([N/64]M).

../../_images/indel_levenshtein.svg

Implementation Notes#

The following implementation is used with a worst-case performance of O([N/64]M).

  • if max is 0 the similarity can be calculated using a direct comparision, since no difference between the strings is allowed. The time complexity of this algorithm is O(N).

  • if max is 1 and the two strings have a similar length, the similarity can be calculated using a direct comparision aswell, since a substitution would cause a edit distance higher than max. The time complexity of this algorithm is O(N).

  • A common prefix/suffix of the two compared strings does not affect the Levenshtein distance, so the affix is removed before calculating the similarity.

  • If max is ≤ 4 the mbleven algorithm is used. This algorithm checks all possible edit operations that are possible under the threshold max. As a difference to the normal Levenshtein distance this algorithm can even be used up to a threshold of 4 here, since the higher weight of substitutions decreases the amount of possible edit operations. The time complexity of this algorithm is O(N).

  • If the length of the shorter string is ≤ 64 after removing the common affix Hyyrös’ lcs algorithm is used, which calculates the Indel distance in parallel. The algorithm is described by Hyyro [Hyy04] and is extended with support for UTF32 in this implementation. The time complexity of this algorithm is O(N).

  • If the length of the shorter string is ≥ 64 after removing the common affix a blockwise implementation of the Hyyrös’ lcs algorithm is used, which calculates the Levenshtein distance in parallel (64 characters at a time). The algorithm is described by Hyyro [Hyy04]. The time complexity of this algorithm is O([N/64]M).