Jaro-Winkler & other string similarity measures in Ruby
Jaro-Winkler distance
The Jaro-Winkler distance is a measure of similarity between two strings. The Jaro measure is the weighted sum of percentage of matched characters. Winkler increased this measure for matching initial characters, then rescaled it.
Application
The distance is mainly used in the area of record linkage or duplicate detection, where record linkage is the methodology of bringing together corresponding records from two or more sources with finding duplicates within files. It originates from the public health area when files of individual patients were brought together using name, date-of-birth and other information, but spread to other domains including statistics, computer science and data mining.
Budzinsky concluded in a review of twenty string comparators that the methods of Jaro and Winkler worked second best. This review is unfortunately unpublished but gets cited in domain-related literature.
Characteristics
In many matching situations, it is not possible to compare two strings exactly (character-by-character) because of misspellings and typographical errors. Dealing with this via approximate string comparison has been a major research areas. In record linkage, one needs to have a function that represents approximate agreement, with agreement being represented by $1$ and degrees of partial agreement being represented by numbers between $0$ and $1$.
The Algorithm
Jaro introduced a string comparator measure that gives values of partial agreement between two strings. The string comparator accounts for length of strings and partially accounts for the types of errors typically made in alphanumeric strings by human beings like insertions, deletions and transpositions. It is used in adjusting exact agreement weights when two strings do not agree on a character-by-character basis.
The basic Jaro algorithm has three components:
- compute the string lengths
- find the number of common characters in the two strings
- find the number of transpositions.
The definition of common is that the agreeing character must be within half the length of the shorter string:
$${{max(|s_1|,|s_2|)}/2}-1$$
The definition of a transposition is that the character from one string is out of order with the corresponding common character from the other string. More specifically the first assigned character on one string is compared to the first assigned character on the other string. If the characters are not the same, half of a transposition has occurred. Then the second assigned character on one string is compared to the second assigned character on the other string, etc. The number of mismatched characters is divided by two to yield the number of transpositions. However in some implementations a different approach is taken, taking insertions or deletion into consideration.
$$\text" if "c=0\text" then "m=0$$ $$\text" if "c>0\text" then "m=w_1{c/{|s_1|}}+w_2{c/{|s_2|}}+w_t{{c-t}/c}$$
$$\table w_1, =, \text"weight associated with characters in the first of two files"; w_2, =, \text"weight associated with characters in the second of two files"; w_t, =, \text"weight associated with transpositions"; , , ; s_1, =, \text"first string"; s_2, =, \text"second string"; t, =, \text"number of 'half' transpositions of characters"; c, =, \text"number of characters in common in pair of strings"$$
If two strings agree on a character-by-character basis, then the Jaro string comparator $m$ is set to $w_1+w_2+ w_t$, which is the maximum value that $m$ can assume. The minimum value is $0$, which occurs when the two strings have no characters in common. For present matching applications $w_1$, $w_2$ and $w_t$ are arbitrarily set to $1/3$.
The Winkler adjustment to the Jaro distance basically modifies the basic string comparator according to whether the first few characters in the strings being compared agree. $m=m+(n_i*0.1*{(1-m)})$, if the first i characters agree. The weight is standardized by $0.1$. However Winkler introduced a more fine-grained function, in which the weights depend on the type of string (first name, last name, street, etc.).
Implementation
# extension for array class
class Array
# select array items with index
# give a block both the item with index of array
# filtered by a select statement
def select_with_index
index = -1
select { |x| index += 1; yield(x, index) }
end
# return indices array of array item
# example all indices of a in string "aaabaaabba"
def aindices(o)
out = Array.new
select_with_index { |x, i|
out << i if x == o }
out
end
end
# jaro winkler distance, input 2 strings, state whether
# winkler adjustment should be used or not
def jaro_winkler(str1, str2, winkleradjust=false)
# if strings (without trailing & leadning spaces) are equal - return 1
#return 1 if str1.strip==str2.strip
# either string blank - return 0
#return 0 if str1.size==0 or str2.size==0
m = 0 # number of matching chars
tr = 0 # number of transpositions
# remove trailing & leading spaces and split strings to character arrays
s1 = str1.strip.split(//)
s2 = str2.strip.split(//)
# get character array length
s1l = s1.length
s2l = s2.length
# str2 should be the longer string
if s1l > s2l
tmp = s2
s2 = s1
s1 = tmp
end
# hash from all unique str2 chars + occurances
# example 'aba': hash={ a => 0, b => 0 } a: first occurance, b first occurance
# if the first a was visited: { a => 1, b => 0} a: second occuance, b second occurance
found = Hash[*s2.uniq.sort.collect { |v| [v,0]}.flatten]
# matching distance definition
md = (([s1l,s2l].max / 2) - 1).to_i
s1.each_with_index do |c,i|
# find number of matching chars
if !found[c].nil? # character exists in str2
# calculates distance between 2 matching characters compare with md
if !s2.aindices(c)[found[c]].nil?
x = (s2.aindices(c)[found[c]] - i).abs
if x <= md
found[c] += 1 # increase occurance of character
m += 1 # increase number of matching characters
# transpositions?
if (x != 0)
tr += 1
end
end
end
end
end
tr = (tr/2).to_i
# calc jaro-distance
third = 1.0/3
jd = (third * m / s1l) + (third * m / s2l) + (third * (m - tr) / m)
out = jd
# winkleradjust? if first l characters are the same
if winkleradjust
l = 0
(0..s1l-1).each { |i| s1[i]==s2[i] ? l+=1 : break }
out = jd + (l * 0.1 * (1 - jd))
end
end
more implementations: C, Java, PHP, Python
Literature
- Budzinsky, C.D. (1991), Automated spelling correction,
Unpublished Report, Statistics Canada, Ottawa - Jaro, M.A. (1989), Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida, Journal of the American Statistical Association 84(406):414--420.
- Jaro, M. A. (1995), Probabilistic linkage of large public health data file, Statistics in Medicine 14:(5-7)
- Winkler, W.E. (1999), The state of record linkage and current research problems, Statistics of Income Division, Internal Revenue Service Publication R99/04
- Winkler, W.E. (2006), Overview of Record Linkage and Current Research Directions, Statistical Research Division, U.S. Census Bureau
Other string similarity measures
Levenshtein distance
def levenshtein(str1, str2)
return str1.length if 0 == str2.length
return str2.length if 0 == str1.length
c = Array.new
c << (str1[0] == str2[0] ? 0 : 1) + (levenshtein str1[1..-1], str2[1..-1])
c << 1 + levenshtein(str1[1..-1], str2)
c << 1 + levenshtein(str1, str2[1..-1])
return c.min
end
more info & implementations: 1 | 2 | 3 | 4
recursive algorithm, therefore a little bit non-performant ;), for the non recursive version see damerau extension...
Damerau–Levenshtein distance
# extension for string class
class String
# return character array of string with indices
def each_char_with_index
i = 0
split(//).each do |c|
yield i, c
i += 1
end
end
end
def damerau_levenshtein(str1, str2)
d = Array.new(str1.size+1){Array.new(str2.size+1)}
for i in (0..str1.size)
d[i][0] = i
end
for j in (0..str2.size)
d[0][j] = j
end
str1.each_char_with_index do |i,c1|
str2.each_char_with_index do |j,c2|
c = (c1 == c2 ? 0 : 1)
d[i+1][j+1] = [
d[i][j+1] + 1, #deletion
d[i+1][j] + 1, #insertion
d[i][j] + c].min #substitution
if (i>0) and (j>0) and (str1[i]==str2[j-1]) and (str1[i-1]==str2[j])
d[i+1][j+1] = [
d[i+1][j+1],
d[i-1][j-1] + c].min #transposition
end
end
end
d[str1.size][str2.size]
end
more info & implementations: 1 | 3 | 4
Dice's coefficient
# extension for string class
class String
# get ngrams of string
def ngrams(len = 1)
ngrams = []
len = size if len > size
(0..size - len).each do |n|
ng = self[n...(n + len)]
ngrams.push(ng)
end
ngrams
end
end
def dice_coefficient(str1, str2)
str1_2grams = str1.ngrams(2)
str2_2grams = str2.ngrams(2)
intersection = (str1_2grams & str2_2grams).length
total = str1_2grams.length + str2_2grams.length
dice = 2.0 * intersection / total
end
more info & implementations: 1
Hamming Distance
def hamming_distance(str1, str2)
str1.split(//).zip(str2.split(//)).inject(0) { |h, e| e[0]==e[1] ? h+0 : h+1 }
end