# Indel#

## Functions#

### distance#

rapidfuzz.distance.Indel.distance(s1, s2, *, processor=None, score_cutoff=None)#

Calculates the minimum number of insertions and deletions required to change one sequence into the other. This is equivalent to the Levenshtein distance with a substitution weight of 2.

Parameters:
• s1 (Sequence[Hashable]) – First string to compare.

• s2 (Sequence[Hashable]) – Second string to compare.

• processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

• score_cutoff (int, optional) – Maximum distance between s1 and s2, that is considered as a result. If the distance is bigger than score_cutoff, score_cutoff + 1 is returned instead. Default is None, which deactivates this behaviour.

Returns:

distance – distance between s1 and s2

Return type:

int

Examples

Find the Indel distance between two strings:

```>>> from rapidfuzz.distance import Indel
>>> Indel.distance("lewenstein", "levenshtein")
3
```

Setting a maximum distance allows the implementation to select a more efficient implementation:

```>>> Indel.distance("lewenstein", "levenshtein", score_cutoff=1)
2
```

### normalized_distance#

rapidfuzz.distance.Indel.normalized_distance(s1, s2, *, processor=None, score_cutoff=None)#

Calculates a normalized levenshtein similarity in the range [1, 0].

This is calculated as `distance / (len1 + len2)`.

Parameters:
• s1 (Sequence[Hashable]) – First string to compare.

• s2 (Sequence[Hashable]) – Second string to compare.

• processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

• score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0. For norm_dist > score_cutoff 1.0 is returned instead. Default is 1.0, which deactivates this behaviour.

Returns:

norm_dist – normalized distance between s1 and s2 as a float between 0 and 1.0

Return type:

float

### similarity#

rapidfuzz.distance.Indel.similarity(s1, s2, *, processor=None, score_cutoff=None)#

Calculates the Indel similarity in the range [max, 0].

This is calculated as `(len1 + len2) - distance`.

Parameters:
• s1 (Sequence[Hashable]) – First string to compare.

• s2 (Sequence[Hashable]) – Second string to compare.

• processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

• score_cutoff (int, optional) – Maximum distance between s1 and s2, that is considered as a result. If the similarity is smaller than score_cutoff, 0 is returned instead. Default is None, which deactivates this behaviour.

Returns:

similarity – similarity between s1 and s2

Return type:

int

### normalized_similarity#

rapidfuzz.distance.Indel.normalized_similarity(s1, s2, *, processor=None, score_cutoff=None)#

Calculates a normalized indel similarity in the range [0, 1].

This is calculated as `1 - normalized_distance`

Parameters:
• s1 (Sequence[Hashable]) – First string to compare.

• s2 (Sequence[Hashable]) – Second string to compare.

• processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

• score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 1.0. For norm_sim < score_cutoff 0 is returned instead. Default is 0, which deactivates this behaviour.

Returns:

norm_sim – normalized similarity between s1 and s2 as a float between 0 and 1.0

Return type:

float

Examples

Find the normalized Indel similarity between two strings:

```>>> from rapidfuzz.distance import Indel
>>> Indel.normalized_similarity("lewenstein", "levenshtein")
0.85714285714285
```

Setting a score_cutoff allows the implementation to select a more efficient implementation:

```>>> Indel.normalized_similarity("lewenstein", "levenshtein", score_cutoff=0.9)
0.0
```

When a different processor is used s1 and s2 do not have to be strings

```>>> Indel.normalized_similarity(["lewenstein"], ["levenshtein"], processor=lambda s: s)
0.8571428571428572
```

### editops#

rapidfuzz.distance.Indel.editops(s1, s2, *, processor=None)#

Return Editops describing how to turn s1 into s2.

Parameters:
• s1 (Sequence[Hashable]) – First string to compare.

• s2 (Sequence[Hashable]) – Second string to compare.

• processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

Returns:

editops – edit operations required to turn s1 into s2

Return type:

Editops

Notes

The alignment is calculated using an algorithm of Heikki Hyyrö, which is described . It has a time complexity and memory usage of `O([N/64] * M)`.

References

Examples

```>>> from rapidfuzz.distance import Indel
>>> for tag, src_pos, dest_pos in Indel.editops("qabxcd", "abycdf"):
...    print(("%7s s1[%d] s2[%d]" % (tag, src_pos, dest_pos)))
delete s1 s2
delete s1 s2
insert s1 s2
insert s1 s2
```

### opcodes#

rapidfuzz.distance.Indel.opcodes(s1, s2, *, processor=None)#

Return Opcodes describing how to turn s1 into s2.

Parameters:
• s1 (Sequence[Hashable]) – First string to compare.

• s2 (Sequence[Hashable]) – Second string to compare.

• processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

Returns:

opcodes – edit operations required to turn s1 into s2

Return type:

Opcodes

Notes

The alignment is calculated using an algorithm of Heikki Hyyrö, which is described . It has a time complexity and memory usage of `O([N/64] * M)`.

References

Examples

```>>> from rapidfuzz.distance import Indel
```
```>>> a = "qabxcd"
>>> b = "abycdf"
>>> for tag, i1, i2, j1, j2 in Indel.opcodes(a, b):
...    print(("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
...           (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2])))
delete a[0:1] (q) b[0:0] ()
equal a[1:3] (ab) b[0:2] (ab)
delete a[3:4] (x) b[2:2] ()
insert a[4:4] () b[2:3] (y)
equal a[4:6] (cd) b[3:5] (cd)
insert a[6:6] () b[5:6] (f)
```

## Performance#

Since the Levenshtein module uses different implementations based on the weights used, this leads to different performance characteristics. The following sections show the performance for the different possible weights.

### Indel#

The following image shows a benchmark of the Indel distance in RapidFuzz and python-Levenshtein. Similar to the normal Levenshtein distance python-Levenshtein uses an implementation with a time complexity of `O(NM)`, while RapidFuzz has a time complexity of `O([N/64]M)`. ## Implementation Notes#

The following implementation is used with a worst-case performance of `O([N/64]M)`.

• if max is 0 the similarity can be calculated using a direct comparision, since no difference between the strings is allowed. The time complexity of this algorithm is `O(N)`.

• if max is 1 and the two strings have a similar length, the similarity can be calculated using a direct comparision aswell, since a substitution would cause a edit distance higher than max. The time complexity of this algorithm is `O(N)`.

• A common prefix/suffix of the two compared strings does not affect the Levenshtein distance, so the affix is removed before calculating the similarity.

• If max is ≤ 4 the mbleven algorithm is used. This algorithm checks all possible edit operations that are possible under the threshold max. As a difference to the normal Levenshtein distance this algorithm can even be used up to a threshold of 4 here, since the higher weight of substitutions decreases the amount of possible edit operations. The time complexity of this algorithm is `O(N)`.

• If the length of the shorter string is ≤ 64 after removing the common affix Hyyrös’ lcs algorithm is used, which calculates the Indel distance in parallel. The algorithm is described by Hyyro [Hyy04] and is extended with support for UTF32 in this implementation. The time complexity of this algorithm is `O(N)`.

• If the length of the shorter string is ≥ 64 after removing the common affix a blockwise implementation of the Hyyrös’ lcs algorithm is used, which calculates the Levenshtein distance in parallel (64 characters at a time). The algorithm is described by Hyyro [Hyy04]. The time complexity of this algorithm is `O([N/64]M)`.