Issue
I have to normalize the Levenshtein distance between 0 to 1. I see different variations floating in SO.
I am thinking to adopt the following approach:
- if two strings, s1 and s2
- len = max(s1.length(), s2.length());
- normalized_distance = float(len - levenshteinDistance(s1, s2)) / float(len);
Then the highest score 1.0 means an exact match and 0.0 means no match.
But I see variations here: two whole texts similarity using levenshtein distance where 1- distance(a,b)/max(a.length, b.length)
Difference in normalization of Levenshtein (edit) distance?
Explanation of normalized edit distance formula
I am wondering is there a canonical code implementation in Java? I know org.apache.commons.text
only implements LevenshteinDistance and not normalized LevenshteinDistance.
Solution
Your first answer begins with "The effects of both variants should be nearly the same". The reason normalized LevenshteinDistance doesn't exist is because you (or somebody else) hasn't seen fit to implement it. Besides, it seems a rather trivial once you have the Levenshtein distance:
private double normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
if ((s1.length() > s2.length() || (s1.length() == s2.length()) {
return levenshtein/s1.length();
}
else if (s2.length() > s1.length()) {
return levenshtein/s2.length();
}
}
After 3 days, once this has been thoroughly ripped to shreds, I'll add it as a Github issue on commons-text.
Answered By - hd1
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.