Similarity Metrics Guide

If you’ve already checked out the /compare endpoint of our Compare API in our interactive API documentation you’ll know that this function returns a Metric object containing several similarity and distance metrics:

class Metric {
    cosineSimilarity: 0.34935777200667767
    euclideanDistance: 0.6797804208600183
    jaccardDistance: 0.809368191721133
    overlappingAll: 175
    overlappingLeftRight: 0.5335365853658537
    overlappingRightLeft: 0.22875816993464052
    sizeLeft: 328
    sizeRight: 765
    weightedScoring: 56.1925405151255
}

These metrics each display a different perspective on similarity. It is up to you, the user, to decide which one (or which combination of metrics) is most appropriate for your use case.

There is also a /compare/bulk endpoint, allowing multiple pair-wise comparisons with one call and returning a list of Metric objects corresponding to the input list of pairs to compare.

On this page you can find a brief explanation of each of the metrics along with some links to further reading.

Cosine Similarity

Specifically cosine similarity is defined as a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. Generally speaking the closer this value is to 1.0 the more similar the input terms are to each other. As a guide to using Cortical.io’s Retina we find that a value of around 0.3 indicates a level of similarity which is sufficient for general purposes. You may, of course, require a stricter definition of similarity (or perhaps a looser one) - this is entirely up to you.

Cortical.io’s Compare API (which can be found via the Compare button in our interactive API documentation) lets you decide where to place the threshold. You can find more detailed information about cosine similarity on Wikipedia.

Euclidean Distance

In mathematics, the Euclidean distance or Euclidean metric is the ordinary distance between two points that one would measure with a ruler. This means the closer this value is to 0.0 (zero), the closer the two items are with respect to similarity.

For more details see the Euclidean Distance Wikipedia entry.

Jaccard Distance

The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1.0. The closer the Jaccard distance is to 1.0 the more dissimilar the two items are.

More detailed information can found on Wikipedia’s Jaccard Distance page.

Overlapping

Overlapping is a Cortical.io defined similarity measure that shows the number of overlapping points between the items to compare. Left being the first item in the comparison and right being the second item. overlappingLeftRight refers to the percentage of positions of the left side included in the right side (using the example values above this would be the result of the following calculation: 175/328), and overlappingRightLeft refers to the percentage of positions of the right side included in the left side (using the example values above this would be the result of the following calculation: 175/765).

When combined with the image view of a Semantic Fingerprint the overlapping positions enable a qualitative measure of similarity, in that it is then possible to identify the overlapping regions and to isolate the semantics of a particular region.

Weighted Scoring

This is a Cortical.io defined weighting for similarity measures. The higher the weighting, the more similar the terms. This measure can be used in one-to-many comparisons where one side of the comparison remains constant - for instance, an email filter, where the filter remains constant but is compared to many different emails.

API Clients

The FullClient object available in the Java, Python, and JavaScript client libraries has the following methods for calling the compare endpoints:

  • compare
  • compareBulk