Levenshtein#

This implementation supports the usage of different weights for Insertion/Deletion/Substitution. The uniform Levenshtein distance refers to weights=(1,1,1) and the Indel distance refers to weights=(1,1,2). All other weights are refered to as generic Levenshtein distance.

Functions#

distance#

rapidfuzz.distance.Levenshtein.distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None)#

Calculates the minimum number of insertions, deletions, and substitutions required to change one sequence into the other according to Levenshtein with custom costs for insertion, deletion and substitution

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • weights (Tuple[int, int, int] or None, optional) – The weights for the three operations in the form (insertion, deletion, substitution). Default is (1, 1, 1), which gives all three operations a weight of 1.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (int, optional) – Maximum distance between s1 and s2, that is considered as a result. If the distance is bigger than score_cutoff, score_cutoff + 1 is returned instead. Default is None, which deactivates this behaviour.

Returns:

distance – distance between s1 and s2

Return type:

int

Raises:

ValueError – If unsupported weights are provided a ValueError is thrown

Examples

Find the Levenshtein distance between two strings:

>>> from rapidfuzz.distance import Levenshtein
>>> Levenshtein.distance("lewenstein", "levenshtein")
2

Setting a maximum distance allows the implementation to select a more efficient implementation:

>>> Levenshtein.distance("lewenstein", "levenshtein", score_cutoff=1)
2

It is possible to select different weights by passing a weight tuple.

>>> Levenshtein.distance("lewenstein", "levenshtein", weights=(1,1,2))
3

normalized_distance#

rapidfuzz.distance.Levenshtein.normalized_distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None)#

Calculates a normalized levenshtein distance in the range [1, 0] using custom costs for insertion, deletion and substitution.

This is calculated as distance / max, where max is the maximal possible Levenshtein distance given the lengths of the sequences s1/s2 and the weights.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • weights (Tuple[int, int, int] or None, optional) – The weights for the three operations in the form (insertion, deletion, substitution). Default is (1, 1, 1), which gives all three operations a weight of 1.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0. For norm_dist > score_cutoff 1.0 is returned instead. Default is 1.0, which deactivates this behaviour.

Returns:

norm_dist – normalized distance between s1 and s2 as a float between 1.0 and 0.0

Return type:

float

Raises:

ValueError – If unsupported weights are provided a ValueError is thrown

similarity#

rapidfuzz.distance.Levenshtein.similarity(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None)#

Calculates the levenshtein similarity in the range [max, 0] using custom costs for insertion, deletion and substitution.

This is calculated as max - distance, where max is the maximal possible Levenshtein distance given the lengths of the sequences s1/s2 and the weights.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • weights (Tuple[int, int, int] or None, optional) – The weights for the three operations in the form (insertion, deletion, substitution). Default is (1, 1, 1), which gives all three operations a weight of 1.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (int, optional) – Maximum distance between s1 and s2, that is considered as a result. If the similarity is smaller than score_cutoff, 0 is returned instead. Default is None, which deactivates this behaviour.

Returns:

similarity – similarity between s1 and s2

Return type:

int

Raises:

ValueError – If unsupported weights are provided a ValueError is thrown

normalized_similarity#

rapidfuzz.distance.Levenshtein.normalized_similarity(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None)#

Calculates a normalized levenshtein similarity in the range [0, 1] using custom costs for insertion, deletion and substitution.

This is calculated as 1 - normalized_distance

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • weights (Tuple[int, int, int] or None, optional) – The weights for the three operations in the form (insertion, deletion, substitution). Default is (1, 1, 1), which gives all three operations a weight of 1.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0. For norm_sim < score_cutoff 0 is returned instead. Default is 0, which deactivates this behaviour.

Returns:

norm_sim – normalized similarity between s1 and s2 as a float between 0 and 1.0

Return type:

float

Raises:

ValueError – If unsupported weights are provided a ValueError is thrown

Examples

Find the normalized Levenshtein similarity between two strings:

>>> from rapidfuzz.distance import Levenshtein
>>> Levenshtein.normalized_similarity("lewenstein", "levenshtein")
0.81818181818181

Setting a score_cutoff allows the implementation to select a more efficient implementation:

>>> Levenshtein.normalized_similarity("lewenstein", "levenshtein", score_cutoff=0.85)
0.0

It is possible to select different weights by passing a weight tuple.

>>> Levenshtein.normalized_similarity("lewenstein", "levenshtein", weights=(1,1,2))
0.85714285714285

When a different processor is used s1 and s2 do not have to be strings

>>> Levenshtein.normalized_similarity(["lewenstein"], ["levenshtein"], processor=lambda s: s[0])
0.81818181818181

editops#

rapidfuzz.distance.Levenshtein.editops(s1, s2, *, processor=None)#

Return Editops describing how to turn s1 into s2.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

Returns:

editops – edit operations required to turn s1 into s2

Return type:

Editops

Notes

The alignment is calculated using an algorithm of Heikki Hyyrö, which is described [8]. It has a time complexity and memory usage of O([N/64] * M).

References

Examples

>>> from rapidfuzz.distance import Levenshtein
>>> for tag, src_pos, dest_pos in Levenshtein.editops("qabxcd", "abycdf"):
...    print(("%7s s1[%d] s2[%d]" % (tag, src_pos, dest_pos)))
 delete s1[1] s2[0]
replace s1[3] s2[2]
 insert s1[6] s2[5]

opcodes#

rapidfuzz.distance.Levenshtein.opcodes(s1, s2, *, processor=None)#

Return Opcodes describing how to turn s1 into s2.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

Returns:

opcodes – edit operations required to turn s1 into s2

Return type:

Opcodes

Notes

The alignment is calculated using an algorithm of Heikki Hyyrö, which is described [9]. It has a time complexity and memory usage of O([N/64] * M).

References

Examples

>>> from rapidfuzz.distance import Levenshtein
>>> a = "qabxcd"
>>> b = "abycdf"
>>> for tag, i1, i2, j1, j2 in Levenshtein.opcodes("qabxcd", "abycdf"):
...    print(("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
...           (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2])))
 delete a[0:1] (q) b[0:0] ()
  equal a[1:3] (ab) b[0:2] (ab)
replace a[3:4] (x) b[2:3] (y)
  equal a[4:6] (cd) b[3:5] (cd)
 insert a[6:6] () b[5:6] (f)

Performance#

Since the Levenshtein module uses different implementations based on the weights used, this leads to different performance characteristics. The following sections show the performance for the different possible weights.

Uniform#

The following image shows a benchmark of the uniform Levenshtein distance in multiple Python libraries. All of them are implemented either in C/C++ or Cython. The graph shows, that python-Levenshtein is the only library with a time complexity of O(NM), while all other libraries have a time complexity of O([N/64]M). Especially for long strings RapidFuzz is a lot faster than all the other tested libraries.

../../_images/uniform_levenshtein.svg

Indel#

The following image shows a benchmark of the Indel distance in RapidFuzz and python-Levenshtein. Similar to the normal Levenshtein distance python-Levenshtein uses an implementation with a time complexity of O(NM), while RapidFuzz has a time complexity of O([N/64]M).

../../_images/indel_levenshtein.svg

Implementation Notes#

Depending on the used input parameters, different optimized implementation are used to improve the performance. These implementations are described in the following sections.

Uniform#

The implementation for the uniform Levenshtein distance has a worst-case performance of O([N/64]M). It uses the following optimized implementations:

  • if score_cutoff is 0 the similarity can be calculated using a direct comparision, since no difference between the strings is allowed. The time complexity of this algorithm is O(N).

  • A common prefix/suffix of the two compared strings does not affect the Levenshtein distance, so the affix is removed before calculating the similarity.

  • If score_cutoff is ≤ 3 the mbleven algorithm is used. This algorithm checks all possible edit operations that are possible under the threshold score_cutoff. The time complexity of this algorithm is O(N).

  • If the length of the shorter string is ≤ 64 after removing the common affix Hyyrös’ algorithm is used, which calculates the Levenshtein distance in parallel. The algorithm is described by Hyyrö [Hyyro03]. The time complexity of this algorithm is O(N).

  • If the length of the shorter string is ≥ 64 after removing the common affix a blockwise implementation of Hyyrös’ algorithm is used, which calculates the Levenshtein distance in parallel (64 characters at a time). The algorithm is described by Hyyrö [Hyyro03]. The time complexity of this algorithm is O([N/64]M).

Indel#

The Indel distance is available as a stand alone implementation. Further details can be found in here.

Generic#

The implementation for other weights is based on Wagner-Fischer. It has a performance of O(N * M) and has a memory usage of O(N). Further details can be found in Wagner and Fischer [WF74].