Similarity Metrics Guide¶
If you’ve already checked out the /compare endpoint of our Compare API in our interactive API documentation you’ll know that this function returns a Metric object containing several similarity and distance metrics:
class Metric {
cosineSimilarity: 0.34935777200667767
euclideanDistance: 0.6797804208600183
jaccardDistance: 0.809368191721133
overlappingAll: 175
overlappingLeftRight: 0.5335365853658537
overlappingRightLeft: 0.22875816993464052
sizeLeft: 328
sizeRight: 765
weightedScoring: 56.1925405151255
}
These metrics each display a different perspective on similarity. It is up to you, the user, to decide which one (or which combination of metrics) is most appropriate for your use case.
There is also a /compare/bulk endpoint, allowing multiple pair-wise comparisons with one call and returning a list of Metric objects corresponding to the input list of pairs to compare.
On this page you can find a brief explanation of each of the metrics along with some links to further reading.
Cosine Similarity¶
Specifically cosine similarity is defined as a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. Generally speaking the closer this value is to 1.0 the more similar the input terms are to each other. As a guide to using Cortical.io’s Retina we find that a value of around 0.3 indicates a level of similarity which is sufficient for general purposes.
You can find more detailed information about cosine similarity on Wikipedia.
Euclidean Distance¶
In mathematics, the Euclidean distance or Euclidean metric is the ordinary distance between two points that one would measure with a ruler. This means the closer this value is to 0.0 (zero), the closer the two items are with respect to similarity.
For more details see the Euclidean Distance Wikipedia entry.
Jaccard Distance¶
The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1.0. The closer the Jaccard distance is to 1.0 the more dissimilar the two items are.
More detailed information can found on Wikipedia’s Jaccard Distance page.
Overlapping¶
Overlapping is a Cortical.io defined similarity measure that shows the number of overlapping points between the items to compare. Left being the first item in the comparison and right being the second item. overlappingLeftRight refers to the percentage of positions of the left side included in the right side (using the example values above this would be the result of the following calculation: 175/328), and overlappingRightLeft refers to the percentage of positions of the right side included in the left side (using the example values above this would be the result of the following calculation: 175/765).
When combined with the image view of a Semantic Fingerprint the overlapping positions enable a qualitative measure of similarity, in that it is then possible to identify the overlapping regions and to isolate the semantics of a particular region.
Weighted Scoring¶
This is a Cortical.io defined weighting for similarity measures. The higher the weighting, the more similar the terms. This measure can be used in one-to-many comparisons where one side of the comparison remains constant - for instance, an email filter, where the filter remains constant but is compared to many different emails.
API Clients¶
The FullClient object available in the Java, Python, and JavaScript client libraries has the following methods for calling the compare endpoints:
- compare
- compareBulk