rapidfuzz.string_metric#

levenshtein#

rapidfuzz.string_metric.levenshtein(s1, s2, *, weights=(1, 1, 1), processor=None, max=None)#

Calculates the minimum number of insertions, deletions, and substitutions required to change one sequence into the other according to Levenshtein with custom costs for insertion, deletion and substitution

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • weights (Tuple[int, int, int] or None, optional) – The weights for the three operations in the form (insertion, deletion, substitution). Default is (1, 1, 1), which gives all three operations a weight of 1.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • max (int or None, optional) – Maximum distance between s1 and s2, that is considered as a result. If the distance is bigger than max, max + 1 is returned instead. Default is None, which deactivates this behaviour.

Returns:

distance – distance between s1 and s2

Return type:

int

Raises:

Examples

Find the Levenshtein distance between two strings:

>>> from rapidfuzz.string_metric import levenshtein
>>> levenshtein("lewenstein", "levenshtein")
2

Setting a maximum distance allows the implementation to select a more efficient implementation:

>>> levenshtein("lewenstein", "levenshtein", max=1)
2

It is possible to select different weights by passing a weight tuple.

>>> levenshtein("lewenstein", "levenshtein", weights=(1,1,2))
3

normalized_levenshtein#

rapidfuzz.string_metric.normalized_levenshtein(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None)#

Calculates a normalized levenshtein distance using custom costs for insertion, deletion and substitution.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • weights (Tuple[int, int, int] or None, optional) – The weights for the three operations in the form (insertion, deletion, substitution). Default is (1, 1, 1), which gives all three operations a weight of 1.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 100. For ratio < score_cutoff 0 is returned instead. Default is 0, which deactivates this behaviour.

Returns:

similarity – similarity between s1 and s2 as a float between 0 and 100

Return type:

float

Raises:

See also

levenshtein

Levenshtein distance

Examples

Find the normalized Levenshtein distance between two strings:

>>> from rapidfuzz.string_metric import normalized_levenshtein
>>> normalized_levenshtein("lewenstein", "levenshtein")
81.81818181818181

Setting a score_cutoff allows the implementation to select a more efficient implementation:

>>> normalized_levenshtein("lewenstein", "levenshtein", score_cutoff=85)
0.0

It is possible to select different weights by passing a weight tuple.

>>> normalized_levenshtein("lewenstein", "levenshtein", weights=(1,1,2))
85.71428571428571

When a different processor is used s1 and s2 do not have to be strings

>>> normalized_levenshtein(["lewenstein"], ["levenshtein"], processor=lambda s: s[0])
81.81818181818181

hamming#

rapidfuzz.string_metric.hamming(s1, s2, *, processor=None, max=None)#

Calculates the Hamming distance between two strings. The hamming distance is defined as the number of positions where the two strings differ. It describes the minimum amount of substitutions required to transform s1 into s2.

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • max (int or None, optional) – Maximum distance between s1 and s2, that is considered as a result. If the distance is bigger than max, max + 1 is returned instead. Default is None, which deactivates this behaviour.

Returns:

distance – distance between s1 and s2

Return type:

int

Raises:

normalized_hamming#

rapidfuzz.string_metric.normalized_hamming(s1, s2, *, processor=None, score_cutoff=None)#

Calculates a normalized hamming distance

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 100. For ratio < score_cutoff 0 is returned instead. Default is 0, which deactivates this behaviour.

Returns:

similarity – similarity between s1 and s2 as a float between 0 and 100

Return type:

float

Raises:

ValueError – If s1 and s2 have a different length

See also

hamming

Hamming distance

Use

func:rapidfuzz.distance.Hamming.normalized_similarity instead. This function will be removed in v3.0.0.

jaro_similarity#

rapidfuzz.string_metric.jaro_similarity(s1, s2, *, processor=None, score_cutoff=None)#

Calculates the jaro similarity

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 100. For ratio < score_cutoff 0 is returned instead. Default is 0, which deactivates this behaviour.

Returns:

  • similarity (float) – similarity between s1 and s2 as a float between 0 and 100

  • .. deprecated:: 2.0.0 – Use rapidfuzz.distance.Jaro.similarity() instead. This function will be removed in v3.0.0.

jaro_winkler_similarity#

rapidfuzz.string_metric.jaro_winkler_similarity(s1, s2, *, prefix_weight=0.1, processor=None, score_cutoff=None)#

Calculates the jaro winkler similarity

Parameters:
  • s1 (Sequence[Hashable]) – First string to compare.

  • s2 (Sequence[Hashable]) – Second string to compare.

  • prefix_weight (float, optional) – Weight used for the common prefix of the two strings. Has to be between 0 and 0.25. Default is 0.1.

  • processor (callable, optional) – Optional callable that is used to preprocess the strings before comparing them. Default is None, which deactivates this behaviour.

  • score_cutoff (float, optional) – Optional argument for a score threshold as a float between 0 and 100. For ratio < score_cutoff 0 is returned instead. Default is 0, which deactivates this behaviour.

Returns:

similarity – similarity between s1 and s2 as a float between 0 and 100

Return type:

float

Raises: