Welcome to Retina Library, a Java/Scala library for Semantic Text Processing with some functionality specifically targeted at execution on an Apache Spark cluster!

Between versions 2.4.0 and 2.4.1 the marketing name of this product changed from Retina Spark to Retina Library. From version 2.4.1 onwards this documentation reflects the new product name, but the technical artefacts, most importantly the Retina Library distribution jar file and Retina Library license jar file, still reflect the old product name Retina Spark.

This document helps you getting started with Retina Library. It helps you write programs using Retina Library, execute them on a single Java Virtual Machine (JVM) as well as in an Apache Spark cluster, and gives you an overview of the core functionality and APIs implemented in Retina Library.

1. Introducing Retina Library

1.1. A library for the Java Virtual Machine

First and foremost, Retina Library is a library - it is not a stand-alone software product that you install and use through a GUI. Rather, if you write programs for the JVM that have a need for Semantic Text Processing functionality, then you can add Retina Library as a dependency to the class path of those programs and call into the public API of Retina Library to perform Semantic Text Processing operations. These calls are local (intra-process) Java method calls. However, some of the algorithms and classes provided by Retina Library assume that the program in question is executed on an Apache Spark cluster, and will therefore fail if you have launched your program outside Apache Spark.

In the following, the term Retina Library Program is used to denote a program that uses Retina Library as a library.

1.2. Scala Examples

Retina Library has been implemented in a mixture of Java and Scala, where some basic functionality is provided both as a Java API as well as a Scala-friendly wrapper around that API. Other, more advanced features, in particular those depending on Apache Spark, have only been implemented in Scala. It is likely that the entire Retina Library API, including the parts implemented in Java and those implemented in Scala, can also be called from other JVM-based languages, in particular if they have good interoperability with Java, such as Groovy. This has not been tested by Cortical.io, however, and this document consequently shows only Scala code calling the Retina Library API.

1.3. Public API vs internal implementation

As a library, Retina Library has parts that are intended to be called by users of Retina Library and therefore form its public API, and other parts that are considered an internal detail of how Retina Library is implemented. These latter parts of Retina Library can technically be called by Retina Library users but this is strongly discouraged and not supported by Cortical.io. The public API of Retina Library, on the other hand, exists specifically to insulate Retina Library users from the faster-changing parts of the library. It is this public API that is documented in this guide, in particular in section The public API of Retina Library.

Only develop against the public API of Retina Library as documented in this guide.

1.4. Prerequisites for this document

This document assumes that you are familiar with

2. Supported configurations and versions

Retina Library 2.5.0 supports the following configurations for developing and executing Retina Library Programs:

  • At development (build) time:

    • Maven 3.3.6 or later using

    • JDK 1.7.0_80 or later, including 1.8

    • Scala 2.10.x compiler and library

  • At runtime:

    • JRE 1.7.0_80 or later, including 1.8

    • Scala 2.10.x library

    • optional: Apache Spark 1.5.2 or later, but not 2.x

      • Apache Spark distributions by databricks, Amazon (EMR), Cloudera and Hortonworks

The code in this document has been written for and verified on Scala 2.10.6, Apache Spark 1.6.2, JDK 1.7 and JRE 1.8.

3. Installation

This section helps you understand the installation of Retina Library and create a new Scala project/program using Retina Library as a library dependency. We will then execute this program on an Apache Spark cluster and stand-alone.

3.1. The Retina Library distribution jar file

Retina Library is distributed as a single, partially obfuscated jar file containing the Java bytecode of Retina Library. This document assumes version 2.5.0 of Retina Library and therefore the Retina Library distribution jar file is called retina-spark-2.5.0-obfuscated.jar .

Although the Retina Library distribution jar file is a jar file, it is not available in any public Maven repository. Rather, the Retina Library distribution jar file is delivered by other means, such as email, from Cortical.io to Retina Library licensees.

See section Supported configurations and versions for supported Scala and Java versions.

3.2. The Retina Library license jar file

Retina Library is commercial software and must be licensed. The license terms determine several aspects of the execution of Retina Library, such as

  • the expiration date of the license,

  • the maximum size of the Apache Spark cluster on which a Retina Library Program may execute,

  • whether the Retina Library Program must, may or must not execute on AWS (the Amazon cloud).

These characteristics of the license granted by Cortical.io to the licensee are encoded in the Retina Library license jar file and are enforced at runtime by Retina Library.

The Retina Library license jar file always has the name retina-spark-license.jar . This file must always reside in the same directory as the Retina Library distribution jar file.

The Retina Library license jar file and Retina Library distribution jar file must always be placed into the same file system directory.

The Retina Library license jar file, for obvious reasons, is not available in any public Maven repository but distributed by other means, such as email, to Retina Library licensees.

3.3. Create a new Retina Library Program using Scala, Maven and the Scala IDE for Eclipse

In this section we start a new Maven Scala project for a Retina Library Program. The program is a simple "Hello World"-type Scala application performing basic Semantic Text Processing. This application will be presented in two variants:

All of the code shown here is available in the projects retina-spark-template-app-no-spark and retina-spark-template-app, respectively. The source code for these projects is part of any Retina Library distribution.

3.3.1. A simple Scala Retina Library Program not requiring Apache Spark

Create an empty base directory for this project. This will be termed the project root in the following. All activities described in this section must be performed below the project root.

3.3.1.1. Scala code for a simple Apache Spark-independent Retina Library Program

A very simple Scala program using Retina Library without any Apache Spark features is shown in [HelloRetinaWithoutSpark.scala]:

HelloRetinaWithoutSpark.scala: A Scala Retina Library Program performing very basic Semantic Text Processing without the use of any Apache Spark features.
package example
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.{getCorticalEngine, getFingerprint}
object HelloRetinaWithoutSpark {
  val rdir = "./retinas"
  val rname = "english_subset"
  def main(args: Array[String]): Unit = {
    implicit val engine = getCorticalEngine(new FileRetinaLoader(rdir), rname)

    val size = engine.getRetinaSize
    val fp = getFingerprint("Hello Retina World!")
    println(s"The Semantic Fingerprint has ${fp.length} of $size possible positions set.")
  }
}

Section Perform Semantic Text Processing with Retina Library explains in more detail what happens in the Retina Library Program [HelloRetinaWithoutSpark.scala], but the main points are:

  • This is a Scala application, i.e. it is a Scala object with a main method of the required signature.

  • A Retina with the name english_subset is loaded from the file system directory ./retinas. This directory is relative to the current directory from which the Retina Library Program is launched. During development this is assumed to be the project root. See section Load a Retina for more details.

  • That Retina is used to create/retrieve a CorticalEngine which is assigned to an implicit variable so that it is implicitly available in the remainder of the program.

  • The size of the Retina, i.e. the maximum number of positions in its Semantic Fingerprints, is retrieved from the CorticalEngine (see section Prerequisites for this document for background material on Semantic Fingerprints).

  • The Semantic Fingerprint of a trivial piece of text is calculated.

  • The number of positions set in that Semantic Fingerprint versus the maximum number of positions is printed to the console.

Create a directory under your project root to hold your Scala source code, following the usual conventions: src/main/scala

Paste the code for [HelloRetinaWithoutSpark.scala] into a file of that name underneath that Scala source code directory. In Scala, that file may, but need not, reside in a directory that mirrors the package of the Scala object, i.e. example.

3.3.1.2. The Retina

Every Retina Library Program requires access to a Retina, which is loaded from some form of persistent storage at runtime (see sections Prerequisites for this document and Load a Retina for more about the concept of a Retina.) In the case of [HelloRetinaWithoutSpark.scala], a Retina named english_subset is loaded from the file system directory ./retinas. At development time this directory path is relative to the project root. The content of this directory must be similar to this:

$ find retinas
retinas
retinas/english_subset
retinas/english_subset/retina.line
retinas/english_subset/retina.properties

In other words, english_subset must be a directory directly below ./retinas and must contain at least the two files retina.line and retina.properties:

Your distribution of Retina Library must have included one or more Retinas. Copy them into the ./retinas directory as shown above. If the english_subset Retina is not part of your Retina Library distribution then choose a different Retina for [HelloRetinaWithoutSpark.scala] by changing the Retina name in the Scala code accordingly.

3.3.1.3. Maven build file for Apache Spark-independent Retina Library Programs

The most popular tools to build Scala programs are SBT and Maven. We will use Maven, because it is currently (still) better known and more widely supported and applicable than SBT.

Unfortuntely, the Maven pom.xml build file is verbose.

pom.xml: Maven build file for a Scala Retina Library Program that does not depend on Apache Spark features and can execute outside an Apache Spark runtime.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.example</groupId>
  <artifactId>retina-spark-template-app-no-spark</artifactId>
  <version>1.0.0-SNAPSHOT</version>
  <properties>
    <retina.spark.version>2.4.1</retina.spark.version>
    <!-- path to the Retina Spark distribution jar file -->
    <retina.spark.distrib.jar>${project.basedir}/lib/retina-spark-${retina.spark.version}-obfuscated.jar</retina.spark.distrib.jar>
    <!-- path to the Retina Spark license jar file retina-spark-license.jar; typically in the same directory as the Retina Spark distribution jar file -->
    <retina.spark.license.jar>${project.basedir}/lib/retina-spark-license.jar</retina.spark.license.jar>
    <java.version>1.7</java.version>
    <scala.version>2.10.6</scala.version>
    <scala.binary.version>2.10</scala.binary.version>
    <slf4j.version>1.7.10</slf4j.version>
    <sleepycat.version>3.3.75</sleepycat.version>
    <junit.version>4.12</junit.version>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  <repositories>
    <repository>
      <id>oracleReleases</id>
      <name>Oracle Released Java Packages</name>
      <url>http://download.oracle.com/maven</url>
      <layout>default</layout>
    </repository>
  </repositories>
  <dependencies>
    <dependency>
      <groupId>io.cortical</groupId>
      <artifactId>retina-spark</artifactId>
      <version>unused-because-loaded-with-system-scope</version>
      <scope>system</scope>
      <systemPath>${retina.spark.distrib.jar}</systemPath>
    </dependency>
    <dependency>
      <groupId>io.cortical</groupId>
      <artifactId>retina-spark-license</artifactId>
      <version>unused-because-loaded-with-system-scope</version>
      <!-- should be test scope but systemPath requires system scope -->
      <scope>system</scope>
      <systemPath>${retina.spark.license.jar}</systemPath>
    </dependency>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-reflect</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-log4j12</artifactId>
      <version>${slf4j.version}</version>
    </dependency>
    <dependency>
      <!-- note license restrictions -->
      <groupId>com.sleepycat</groupId>
      <artifactId>je</artifactId>
      <version>${sleepycat.version}</version>
    </dependency>
    <dependency>
      <groupId>org.reflections</groupId>
      <artifactId>reflections</artifactId>
      <version>0.9.10</version>
      <exclusions>
        <exclusion>
          <groupId>com.google.guava</groupId>
          <artifactId>guava</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
  </dependencies>
  <build>
    <plugins>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-source-plugin</artifactId>
      </plugin>
    </plugins>
    <pluginManagement>
      <plugins>
        <plugin>
          <groupId>net.alchim31.maven</groupId>
          <artifactId>scala-maven-plugin</artifactId>
          <version>3.2.2</version>
          <executions>
            <execution>
              <id>scala-compile-first</id>
              <phase>process-resources</phase>
              <goals>
                <goal>add-source</goal>
                <goal>compile</goal>
              </goals>
            </execution>
            <execution>
              <id>scala-test-compile</id>
              <phase>process-test-resources</phase>
              <goals>
                <goal>testCompile</goal>
              </goals>
            </execution>
          </executions>
          <configuration>
            <scalaVersion>${scala.version}</scalaVersion>
            <javacArgs>
              <javacArg>-source</javacArg>
              <javacArg>${java.version}</javacArg>
              <javacArg>-target</javacArg>
              <javacArg>${java.version}</javacArg>
              <javacArg>-Xlint:all,-serial,-path</javacArg>
            </javacArgs>
          </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.5.1</version>
          <configuration>
            <source>${java.version}</source>
            <target>${java.version}</target>
          </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.19.1</version>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-shade-plugin</artifactId>
          <version>2.4.3</version>
          <configuration>
            <artifactSet>
              <excludes>
                <exclude>io.cortical:*</exclude>
              </excludes>
            </artifactSet>
            <filters>
              <filter>
                <artifact>*:*</artifact>
                <excludes>
                  <exclude>META-INF/**/pom.*</exclude>
                </excludes>
              </filter>
            </filters>
          </configuration>
          <executions>
            <execution>
              <phase>package</phase>
              <goals>
                <goal>shade</goal>
              </goals>
            </execution>
          </executions>
        </plugin>
        <plugin>
          <groupId>org.eclipse.m2e</groupId>
          <artifactId>lifecycle-mapping</artifactId>
          <version>1.0.0</version>
          <configuration>
            <lifecycleMappingMetadata>
              <pluginExecutions>
                <pluginExecution>
                  <pluginExecutionFilter>
                    <groupId>net.alchim31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <versionRange>[3.2.2,)</versionRange>
                    <goals>
                      <goal>compile</goal>
                      <goal>testCompile</goal>
                    </goals>
                  </pluginExecutionFilter>
                  <action>
                    <ignore/>
                  </action>
                </pluginExecution>
              </pluginExecutions>
            </lifecycleMappingMetadata>
          </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-source-plugin</artifactId>
          <version>3.0.1</version>
          <executions>
            <execution>
              <id>attach-sources</id>
              <goals>
                <goal>jar</goal>
              </goals>
            </execution>
          </executions>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>
</project>

A detailed discussion of the Maven pom.xml is outside the scope of this document. Its most important aspects are:

  • It should be located in the project root and have the file name pom.xml .

  • The property retina.spark.version must be set to the version of Retina Library to be used in the Retina Library Program. This is the version in the Retina Library distribution jar file name.

  • The properties retina.spark.distrib.jar and retina.spark.license.jar must be set to the paths to the Retina Library distribution jar file and Retina Library license jar file, respectively. Dependencies are then defined to these two jar files (as system-scoped dependencies, such that the jar files are loaded from the file system rather than from a Maven repository: see sections The Retina Library distribution jar file and The Retina Library license jar file).

  • The Java and Scala versions are set to 1.7 and 2.10.6, respectively (see section Supported configurations and versions).

  • Further dependencies on the Scala library, commons-codec, a logging framework (slf4j) and JUnit are defined. The dependencies on commons-codec and slf4j are required for Retina Library even if your code does not make use of logging or commons-codec. The Scala library is required for Retina Library as well as the Retina Library Program. The dependency on JUnit is only needed if JUnit tests are included in the Retina Library Program project (which they are not, so far).

  • An explicit dependency on Oracle Berkeley DB Java Edition and the Oracle Maven repository for Oracle Berkeley DB Java Edition is defined. This is only needed if DiskSerializingSemanticSearchWrapper from Retina Library is used, and requires separate licensing of Oracle Berkeley DB Java Edition from Oracle Corporation.

  • The remainder of the pom.xml configures the compilation and packaging process.

  • Packaging produces an assembly jar (also know an as über jar, or shaded jar) with the help of the maven-shade-plugin. The assembly jar contains the Java bytecode of the Retina Library Program and all dependencies, excluding the Retina Library distribution jar file and Retina Library license jar file.

The need to define an explicit dependency on commons-codec and slf4j is considered outdated and will likely be removed in a future release of Retina Library.
3.3.1.4. Maven build from the command-line

Now that the newly created Retina Library Program project contains a Scala source file and a Maven pom.xml, it can be built from the command-line from within the project root:

$ mvn clean install
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building retina-spark-template-app-no-spark 1.0.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4.958 s
[INFO] Finished at: 2016-07-20T15:26:42+02:00
[INFO] Final Memory: 19M/323M
[INFO] ------------------------------------------------------------------------

A reasonably up-to-date Maven installation and a JDK is highly recommended for building the Retina Library Program (see section Supported configurations and versions), e.g.:

$ mvn -version
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T17:41:47+01:00)
Maven home: /usr/local/Cellar/maven/3.3.9/libexec
Java version: 1.8.0_66, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.11.5", arch: "x86_64", family: "mac"

The Maven build generates an assembly jar file in the target directory underneath the project root, e.g.

$ ls target/*.jar
target/original-retina-spark-template-app-no-spark-1.0.0-SNAPSHOT.jar target/retina-spark-template-app-no-spark-1.0.0-SNAPSHOT.jar

Here, the jar file original-*.jar is the original, non-assembly jar file, without dependencies, and can therefore be ignored. The jar file retina-spark-template-app-no-spark-1.0.0-SNAPSHOT.jar, by contrast, is the assembly jar containing everything needed to execute the Retina Library Program, except the Retina Library distribution jar file, Retina Library license jar file and any Retinas.

3.3.1.5. Execute the Retina Library Program from the command-line

After a successful Maven build, and given that a Retina to load at runtime has been provided, the Retina Library Program can be executed as follows from the command-line from within the project root:

$ java -cp \
~/local/opt/retina-spark/retina-spark-2.4.0-obfuscated.jar:target/retina-spark-template-app-no-spark-1.0.0-SNAPSHOT.jar \
example.HelloRetinaWithoutSpark

In other words, this command-line executes a standard Java application with fully-qualified class name example.HelloRetinaWithoutSpark using a Java classpath consisting of the assembly jar of this Retina Library Program and the Retina Library distribution jar file. The Retina Library distribution jar file and the Retina Library license jar file, in this case, are both located in directory ~/local/opt/retina-spark .

A reasonably up-to-date JRE is highly recommended for executing the Retina Library Program (see section Supported configurations and versions), e.g.:

$ java -version
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

After some initial output describing the Retina Library licensee and license conditions, the Retina Library Program should print the result:

The Semantic Fingerprint has 638 of 16384 possible positions set.

Congratulations, you have just created your first Retina Library Program from scratch and built and executed it from the command-line. The program has loaded a Retina, calculated a Semantic Fingerprint, and compared the number of positions in that Semantic Fingerprint to the maximum number of positions possible with that Retina.

3.3.1.6. Import into the Scala IDE for Eclipse

Now that the simple Retina Library Program compiles and executes from the command-line, it is convenient to import it into an IDE for further development. In this document the Scala IDE for Eclipse is used to demonstrate IDE usage, but there are many perfectly viable alternatives, such as IntelliJ IDEA.

Proceed as follows to import the Retina Library Program into Scala IDE for Eclipse:

  1. Download and install the latest release of Scala IDE for Eclipse from http://scala-ide.org.

  2. Open Scala IDE for Eclipse and select a Workspace.

  3. Select "File" > "Import…​" > "Maven" > "Existing Maven Projects" > "Next >"

  4. Browse to the project root of the Retina Library Program created previously and press "Open": Scala IDE for Eclipse should detect the Maven project and list it under "Projects:".

  5. Press "Finish": The project should now appear in the "Package Explorer".

  6. Right-click on the newly imported project and select "Configure" > "Add Scala Nature": The "Scala Library container" should appear under the project in the "Package Explorer".

  7. Correct the Scala library version by right-clicking on the "Scala Library container", selecting "Properties" and choosing, ideally, the exact same Scala version that was configured in the Maven pom.xml, e.g. "Fixed Scala Library container : 2.10.6".

  8. Add the Scala source directory src/main/scala to the project’s source folders by navigating to it in the "Package Explorer", right-clicking on it, and selecting "Build Path" > "Use as Source Folder".

Scala IDE for Eclipse has now been configured to correctly deal with the Retina Library Program as a Scala Maven project.

All future changes to the Scala source code or the Maven pom.xml of the Retina Library Program can from now on be done in Scala IDE for Eclipse. Furthermore, unit tests and the simple Scala application [HelloRetinaWithoutSpark.scala] can now be executed from within Scala IDE for Eclipse rather than from the command-line.

3.3.1.7. Execute the Retina Library Program from within the Scala IDE for Eclipse

After the successful import of the project into the Scala IDE for Eclipse, the Retina Library Program can be executed as follows:

  1. Navigate to the [HelloRetinaWithoutSpark.scala] file in the "Package Explorer".

  2. Right-click, select "Run As" > "Scala Application"

The Retina Library Program now executes within Scala IDE for Eclipse and all output previously seen on the command-line now appears in the "Console" of the Scala IDE for Eclipse.

3.3.2. A simple Scala Retina Library Program using Apache Spark features

In this section we create a second variant of the basic Semantic Text Processing functionality implemented in section A simple Scala Retina Library Program not requiring Apache Spark by adding very simple usage of Apache Spark features to it. The resulting Retina Library Program therefore depends on Apache Spark and will only run in an Apache Spark cluster.

Create an new empty project root for this project where all activities described in this section must be performed.

3.3.2.1. Scala code for a simple Apache Spark-dependent Retina Library Program

Building on the simple Apache Spark-independent Retina Library Program discussed before, the task of fingerprinting a larger number of texts is distributed over an Apache Spark cluster as shown in [HelloRetinaSpark.scala]:

HelloRetinaSpark.scala: A Scala Retina Library Program performing very basic Semantic Text Processing using Apache Spark features.
package example
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.{getCorticalEngine, getFingerprint}
import io.cortical.scala.spark.util.{sparkContext, withSparkContext}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object HelloRetinaSpark {
  import io.cortical.scala.spark.util.valueOfBroadcastCorticalEngine
  val rdir = "./retinas"
  val rname = "english_subset"
  def main(args: Array[String]): Unit = {
    withSparkContext(sparkContext(appName = "HelloRetinaSpark")) {
      work
    }
  }
  private def work(sc: SparkContext, sqlContext: SQLContext): Unit = {
    implicit val engine = sc.broadcast(getCorticalEngine(new FileRetinaLoader(rdir), rname))

    val size = engine.value.getRetinaSize

    val ns = sc.parallelize(1 to 1000)
    val texts = ns.map(n => s"Hello Retina World with num $n !")
    val fps = texts.map(text => getFingerprint(text))
    val lens = fps.map(_.length)
    val dLens = lens.distinct.collect.toSeq

    println(s"All Semantic Fingerprints have ${dLens mkString ","} of $size possible positions set.")
  }
}

Section Perform Semantic Text Processing with Retina Library explains in more detail the Semantic Text Processing features used in the Retina Library Program [HelloRetinaSpark.scala] . Briefly, the most important commonalities and differences to [HelloRetinaWithoutSpark.scala] are:

  • As before, this is a Scala application that loads the english_subset Retina from the ./retinas directory.

  • The Retina Library utility-functions sparkContext and withSparkContext are used to create an Apache Spark SparkContext in all execution situations (see section Scala and Spark utilities), perform work in the scope of that SparkContext, and then close it. The real work of this Retina Library Program is done in the function called work.

  • As before, the Retina, once it has been loaded, is used to create/retrieve a CorticalEngine. However, in an Apache Spark environment, it is crucial that the CorticalEngine is distributed to all Spark cluster nodes as a Spark Broadcast variable. It is this Spark Broadcast variable that is then assigned to an implicit variable.

  • The size of the Retina is retrieved from the CorticalEngine in the same way as before, taking into account that variable engine is now a Spark Broadcast variable containing a CorticalEngine.

  • The import io.cortical.scala.api._ line is needed to import an implicit conversion from a Spark Broadcast variable containing a CorticalEngine to a CorticalEngine which is used transparently in the call to getFingerprint.

  • Using straightforward Apache Spark features, the Semantic Fingerprints of 1000 trivial pieces of text are calculated in parallel on the Apache Spark cluster.

  • The number of positions set in each of these 1000 Semantic Fingerprint is determined in parallel and the distinct (unique) counts collected into the Spark driver. Since all 1000 pieces of text are very similar, the lengths of all Semantic Fingerprints is expected to be the same and hence only one distinct value is expected to be returned to the Spark driver.

  • The distinct numbers of positions set in all Semantic Fingerprints versus the maximum number of positions is printed to the console.

Paste the code for [HelloRetinaSpark.scala] into a Scala source file of that name.

Note that we will also need a Retina available at runtime as discussed previously in section The Retina.

3.3.2.2. Maven build for an Apache Spark-enabled Scala Retina Library Program

A Maven pom.xml build file for a Scala Retina Library Program is shown in the following:

pom.xml: Maven build file for a Scala Retina Library Program that depends on Apache Spark at build and execution time.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.example</groupId>
  <artifactId>retina-spark-template-app</artifactId>
  <version>1.0.0-SNAPSHOT</version>
  <properties>
    <retina.spark.version>2.4.1</retina.spark.version>
    <!-- path to the Retina Spark distribution jar file -->
    <retina.spark.distrib.jar>${project.basedir}/lib/retina-spark-${retina.spark.version}-obfuscated.jar</retina.spark.distrib.jar>
    <!-- path to the Retina Spark license jar file retina-spark-license.jar; typically in the same directory as the Retina Spark distribution jar file -->
    <retina.spark.license.jar>${project.basedir}/lib/retina-spark-license.jar</retina.spark.license.jar>
    <java.version>1.7</java.version>
    <scala.version>2.10.6</scala.version>
    <scala.binary.version>2.10</scala.binary.version>
    <spark.version>1.6.2</spark.version>
    <slf4j.version>1.7.10</slf4j.version>
    <sleepycat.version>3.3.75</sleepycat.version>
    <junit.version>4.12</junit.version>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  <repositories>
    <repository>
      <id>oracleReleases</id>
      <name>Oracle Released Java Packages</name>
      <url>http://download.oracle.com/maven</url>
      <layout>default</layout>
    </repository>
  </repositories>
  <dependencies>
    <dependency>
      <groupId>io.cortical</groupId>
      <artifactId>retina-spark</artifactId>
      <version>unused-because-loaded-with-system-scope</version>
      <scope>system</scope>
      <systemPath>${retina.spark.distrib.jar}</systemPath>
    </dependency>
    <dependency>
      <groupId>io.cortical</groupId>
      <artifactId>retina-spark-license</artifactId>
      <version>unused-because-loaded-with-system-scope</version>
      <!-- should be test scope but systemPath requires system scope -->
      <scope>system</scope>
      <systemPath>${retina.spark.license.jar}</systemPath>
    </dependency>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-log4j12</artifactId>
      <version>${slf4j.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <!-- note license restrictions -->
      <groupId>com.sleepycat</groupId>
      <artifactId>je</artifactId>
      <version>${sleepycat.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.binary.version}</artifactId>
      <version>${spark.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_${scala.binary.version}</artifactId>
      <version>${spark.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-mllib_${scala.binary.version}</artifactId>
      <version>${spark.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.reflections</groupId>
      <artifactId>reflections</artifactId>
      <version>0.9.10</version>
      <exclusions>
        <exclusion>
          <groupId>com.google.guava</groupId>
          <artifactId>guava</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
  </dependencies>
  <build>
    <plugins>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-source-plugin</artifactId>
      </plugin>
    </plugins>
    <pluginManagement>
      <plugins>
        <plugin>
          <groupId>net.alchim31.maven</groupId>
          <artifactId>scala-maven-plugin</artifactId>
          <version>3.2.2</version>
          <executions>
            <execution>
              <id>scala-compile-first</id>
              <phase>process-resources</phase>
              <goals>
                <goal>add-source</goal>
                <goal>compile</goal>
              </goals>
            </execution>
            <execution>
              <id>scala-test-compile</id>
              <phase>process-test-resources</phase>
              <goals>
                <goal>testCompile</goal>
              </goals>
            </execution>
          </executions>
          <configuration>
            <scalaVersion>${scala.version}</scalaVersion>
            <javacArgs>
              <javacArg>-source</javacArg>
              <javacArg>${java.version}</javacArg>
              <javacArg>-target</javacArg>
              <javacArg>${java.version}</javacArg>
              <javacArg>-Xlint:all,-serial,-path</javacArg>
            </javacArgs>
          </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.5.1</version>
          <configuration>
            <source>${java.version}</source>
            <target>${java.version}</target>
          </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.19.1</version>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-shade-plugin</artifactId>
          <version>2.4.3</version>
          <configuration>
            <artifactSet>
              <excludes>
                <exclude>io.cortical:*</exclude>
              </excludes>
            </artifactSet>
            <relocations>
              <relocation>
                <pattern>com.fasterxml.jackson.databind</pattern>
                <shadedPattern>io.cortical.ext.fasterxml.jackson.databind</shadedPattern>
              </relocation>
              <relocation>
                <pattern>com.fasterxml.jackson.annotation</pattern>
                <shadedPattern>io.cortical.ext.fasterxml.jackson.annotation</shadedPattern>
              </relocation>
              <relocation>
                <pattern>com.fasterxml.jackson.core</pattern>
                <shadedPattern>io.cortical.ext.fasterxml.jackson.core</shadedPattern>
              </relocation>
              <relocation>
                <pattern>com.google.common</pattern>
                <shadedPattern>io.cortical.ext.google.common</shadedPattern>
              </relocation>
            </relocations>
            <filters>
              <filter>
                <artifact>*:*</artifact>
                <excludes>
                  <exclude>META-INF/**/pom.*</exclude>
                </excludes>
              </filter>
            </filters>
          </configuration>
          <executions>
            <execution>
              <phase>package</phase>
              <goals>
                <goal>shade</goal>
              </goals>
            </execution>
          </executions>
        </plugin>
        <plugin>
          <groupId>org.eclipse.m2e</groupId>
          <artifactId>lifecycle-mapping</artifactId>
          <version>1.0.0</version>
          <configuration>
            <lifecycleMappingMetadata>
              <pluginExecutions>
                <pluginExecution>
                  <pluginExecutionFilter>
                    <groupId>net.alchim31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <versionRange>[3.2.2,)</versionRange>
                    <goals>
                      <goal>compile</goal>
                      <goal>testCompile</goal>
                    </goals>
                  </pluginExecutionFilter>
                  <action>
                    <ignore/>
                  </action>
                </pluginExecution>
              </pluginExecutions>
            </lifecycleMappingMetadata>
          </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-source-plugin</artifactId>
          <version>3.0.1</version>
          <executions>
            <execution>
              <id>attach-sources</id>
              <goals>
                <goal>jar</goal>
              </goals>
            </execution>
          </executions>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>
</project>

A detailed discussion of this Maven [pom.xml] is outside the scope of this document. Its most important similarities and differences compared to the pom.xml for an Apache Spark-independent Retina Library Program are:

  • The property spark.version must be set to the version of Apache Spark to be used in the Retina Library Program (here 1.6.2; see section Supported configurations and versions).

  • All dependencies that are available on the classpath of the Apache Spark runtime must be marked with provided scope.

  • Dependencies on Apache Spark core, SQL and MLlib modules are defined (with provided scope).

  • An explicit dependency on Oracle Berkeley DB Java Edition and the Oracle Maven repository for Oracle Berkeley DB Java Edition is defined. This is only needed if DiskSerializingSemanticSearchWrapper from Retina Library is used, and requires separate licensing of Oracle Berkeley DB Java Edition from Oracle Corporation.

  • The remainder of the pom.xml configures the compilation and packaging process. All dependencies that are known to clash with those provided on the Apache Spark classpath are to be excluded from the assembly jar.

The Maven build using this [pom.xml] now works as expected from the command-line within the project root, compiling [HelloRetinaSpark.scala] and producing an assembly jar:

$ mvn clean install
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building retina-spark-template-app 1.0.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5.751 s
[INFO] Finished at: 2016-07-21T17:21:14+02:00
[INFO] Final Memory: 29M/638M
[INFO] ------------------------------------------------------------------------

This maven build produces the assembly jar file retina-spark-template-app-1.0.0-SNAPSHOT.jar, which contains everything needed to execute the Retina Library Program, except what is provided by the Apache Spark runtime, the Retina Library distribution jar file, the Retina Library license jar file, and any Retinas.

3.3.2.3. Use the Scala IDE for Eclipse with an Apache Spark-enabled Scala Retina Library Program

The import of a Scala Retina Library Program that depends on Apache Spark into the Scala IDE for Eclipse works in the same way as described in section Import into the Scala IDE for Eclipse for an Apache Spark-independent Retina Library Program.

The same is true for the execution of an Apache Spark-dependent Retina Library Program: section Execute the Retina Library Program from within the Scala IDE for Eclipse is applicable without change.

The execution of a program like [HelloRetinaSpark.scala] from within Scala IDE for Eclipse works because the Retina Library utility function sparkContext when used under these circumstances creates an Apache Spark SparkContext in Spark local mode. This is described in more detail in section Scala and Spark utilities.

3.3.2.4. Execute the Retina Library Program from the command-line in a single JVM using Apache Spark local mode

Using Spark local mode, a Retina Library Program can be executed in a single JVM with the full Apache Spark runtime environment. In this execution mode, the Java classpath is as it would be in an Apache Spark cluster, but there is only one Apache Spark node and all communication is JVM-local.

Spark local mode can be used to execute a Retina Library Program from within the Scala IDE for Eclipse (see section Use the Scala IDE for Eclipse with an Apache Spark-enabled Scala Retina Library Program) or from the command-line as shown in the following:

$ spark-submit --master local[*] \
  --jars ~/local/opt/retina-spark/retina-spark-2.4.0-obfuscated.jar \
  --class example.HelloRetinaSpark  \
  target/retina-spark-template-app-1.0.0-SNAPSHOT.jar

This statement executes as an Apache Spark job a Java application with fully-qualified class name example.HelloRetinaSpark using a Java classpath consisting of the assembly jar of this Retina Library Program, the Retina Library distribution jar file and the Apache Spark runtime. The Retina Library distribution jar file and the Retina Library license jar file, in this case, are both located in directory ~/local/opt/retina-spark . Apache Spark will use all available cores on the current machine to execute the Retina Library Program.

3.3.2.5. Execute the Retina Library Program from the command-line on a distributed Apache Spark cluster

Spark local mode is useful during development and for workloads that can be handled by a single machine. More realistically, though, Retina Library Programs will be executed on a distributed Apache Spark cluster, i.e. a cluster of several Spark cluster nodes.

A discussion of the different variants of launching Apache Spark clusters is beyond this document. Retina Library can be used with all cluster modes available in Apache Spark. The following interaction shows launching the Retina Library Program in Apache Spark cluster in so-called standalone mode.

First start the Apache Spark cluster with one master and as many slaves as desired, i.e. from the Apache Spark base directory on what will become the Apache Spark master node execute

$ sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to ...

And from the Apache Spark base directory on as many (typically) other nodes, which will become worker nodes, execute

$ sbin/start-slave.sh spark://127.0.0.1:7077
starting org.apache.spark.deploy.worker.Worker, logging to ...

choosing a URL that points at the correct master node.

Then on any machine with an Apache Spark installation and access to the master node use spark-submit to launch the Retina Library Program by pointing it to the master node, choosing a URL that points at the correct master node:

$ spark-submit --master spark://127.0.0.1:7077 \
  --jars ~/local/opt/retina-spark/retina-spark-2.4.0-obfuscated.jar \
  --class example.HelloRetinaSpark  \
  target/retina-spark-template-app-1.0.0-SNAPSHOT.jar

This statement executes as an Apache Spark job the same Java application as in section Execute the Retina Library Program from the command-line in a single JVM using Apache Spark local mode but this time the Apache Spark job is distributed over the JVMs and physical machines that comprise the Spark cluster. The Retina Library distribution jar file and the Retina Library license jar file must both be located in directory ~/local/opt/retina-spark .

The Retina is loaded from the file system on the Spark driver node - in this case the machine where the master was launched - and is then distributed from there to all Spark cluster nodes. Output is also performed on the Spark driver.

Congratulations, you have now implemented and executed a fully distributed Maven Scala Retina Library Program! The program has loaded a Retina, calculated a large number of Semantic Fingerprints, and compared the number of positions in all those Semantic Fingerprints to the maximum number of positions possible with that Retina.

4. The public API of Retina Library

The public API of Retina Library consists of all types and functions that are intended to be called by users of Retina Library when implementing a Retina Library Program. It consists fundamentally of

Most example code in this section makes use of base classes that aim to reduce boiler plate and allow us to focus on the Retina Library feature in question. For completeness, these base classes are shown in Base classes used in example code. When studying example code it suffices to know that the following symbols are defined in these base classes and are available in example code:

  • For a Retina Library Program executing in an Apache Spark cluster,

    • sc and sqlContext denote an instance of SparkContext and SQLContext, respectively,

    • engine denotes an instance of a Spark Broadcast variable containing a CorticalEngine for a default Retina (i.e., Broadcast[CorticalEngine]).

  • For a Retina Library Program executing outside an Apache Spark cluster,

    • engine denotes an instance of CorticalEngine.

In general, example code in this section that works outside of Apache Spark will be shown without the use of any Apache Spark features, i.e. in an Apache Spark-independent form.

4.1. Enumeration of the Retina Library public API

The Retina Library public API comprises the following Java/Scala packages, types and functions:

  • com.neovisionaries.i18n.LanguageCode

  • io.cortical.document.api.DocumentFingerprintDb

  • io.cortical.document.impl.IndexedDocumentFingerprintDb

  • io.cortical.engine.api.CorticalEngine

  • io.cortical.engine.api.CorticalEngineFactory

  • io.cortical.fingerprint.compare.api.FingerprintComparisons

  • io.cortical.model.core.CoreTerm

  • io.cortical.model.languages.Languages

  • io.cortical.nlp.pos.CorePosTypes

  • io.cortical.retina.source.FileRetinaLoader

  • io.cortical.retina.source.ResourceRetinaLoader

  • io.cortical.retina.source.RetinaLoader

  • io.cortical.retina.source.RetinaProperties

  • io.cortical.retina.source.S3RetinaLoader

  • io.cortical.scala.api

  • io.cortical.scala.api.CorticalApi

  • io.cortical.scala.api.DocumentFingerprintDb

  • io.cortical.scala.api.Fingerprint

  • io.cortical.scala.api.Fingerprinted

  • io.cortical.scala.api.FullSemanticSearcher

  • io.cortical.scala.api.PartitionedDocumentDb

  • io.cortical.scala.api.PartitionedFileCachingDocumentDb

  • io.cortical.scala.api.PreservingDocumentDb

  • io.cortical.scala.api.ParentDocumentDb

  • io.cortical.scala.api.Scored

  • io.cortical.scala.api.SemanticClassifier

  • io.cortical.scala.api.SemanticSearcher

  • io.cortical.scala.api.SemanticTextClassifier

  • io.cortical.scala.api.StoringSemanticSearcher

  • io.cortical.scala.api.StringLabelSemanticClassifier

  • io.cortical.scala.api.Textual

  • io.cortical.scala.api.UpdateableDocumentFingerprintDb

  • io.cortical.scala.api.UpdateableSemanticSearcher

  • io.cortical.scala.api.document

  • io.cortical.scala.api.document.Doc

  • io.cortical.scala.api.document.DocID

  • io.cortical.scala.api.document.DocIDSemanticSearcher

  • io.cortical.scala.api.document.DocPreserving

  • io.cortical.scala.api.document.DocSemanticSearcher

  • io.cortical.scala.api.document.FingerprintedDoc

  • io.cortical.scala.api.document.FingerprintedTextDoc

  • io.cortical.scala.api.document.PreservingFingerprintedTextDoc

  • io.cortical.scala.api.document.PreservingScoredFingerprintedTextDoc

  • io.cortical.scala.api.document.PreservingFingerprintedParentTextDoc

  • io.cortical.scala.api.document.ScoredFingerprintedTextDoc

  • io.cortical.scala.api.document.TextDoc

  • io.cortical.scala.api.document.persistence,

  • io.cortical.scala.api.document.persistence.DiskSerializingSemanticSearchWrapper

  • io.cortical.scala.api.metadata

  • io.cortical.scala.api.metadata.Metadata

  • io.cortical.language.detection.api.LanguageDetection

  • io.cortical.language.detection.impl.LanguageDetectionImpl

  • io.cortical.scala.api.orderingForScored

  • io.cortical.scala.spark.util

  • io.cortical.scala.spark.util.numOfWorkerNodesInSparkCluster

  • io.cortical.scala.spark.util.sparkContext

  • io.cortical.scala.spark.util.valueOfBroadcastSemanticSearcher

  • io.cortical.scala.spark.util.valueOfBroadcastCorticalEngine

  • io.cortical.scala.spark.util.withSparkContext

4.2. Perform Semantic Text Processing with Retina Library

This section explains the part of the Retina Library public API that relates to Semantic Text Processing. Knowledge of the fundamentals of Semantic Text Processing is assumed (see section Prerequisites for this document).

4.2.1. CorticalApi and CorticalEngine: Core algorithms for Semantic Text Processing

CorticalEngine and CorticalApi are two types - the former primarily for Java code, the latter for Scala code - that give access to the core Semantic Text Processing features in Retina Library. This section shows how to use the algorithms provided by CorticalEngine and CorticalApi, and explains those algorithms to the extent necessary to make sense of the code shown. For a deeper, more scientific explanation of these algorithms please consult the references listed in section Prerequisites for this document.

Both CorticalEngine and CorticalApi give access to the same algorithms using slightly different syntax. The remainder of this section will mostly show example code using CorticalApi because it is the simpler choice for Retina Library Programs written in Scala.

4.2.1.1. Load a Retina

A Retina is a fairly large - on the order of tens or hundreds of megabytes - data structure that captures a Semantic Space, i.e. the meaning of the terms used in a corpus (body) of documents. Almost all operations in Retina Library require a Retina. A Retina is trained by Cortical.io from a document corpus and delivered to users of Retina Library as a set of files. During the execution of a Retina Library Program one or more Retinas must be loaded into the Retina Library Program by reading these files.

A Retina is typically specific to one language. It is common to have several Retinas for the same language capturing different Semantic Spaces expressed in that language. For instance, a "general English" Retina and an "automotive English" Retina both contain English terms, but the former contains more terms that have nothing to do with cars, vehicles, etc., whereas the latter contains more terms in that domain, with better semantic resolution of those terms. It is also common to have several Retinas for the same Semantic Space expressed in different languages - so-called (cross-language) aligned Retinas. For instance, three Retinas, one in Spanish, another in German and a third in English, all capturing the same Semantic Space in their respective language, in such a way that the representations of meaning (the Semantic Fingerprints) captured by these Retina are transferable between Retinas and hence between languages. This is the basis for cross-language functionality in Retina Library.

Retina Library provides an abstraction for loading a Retina from persistent storage: the RetinaLoader. Retina Library ships with three implementations of the RetinaLoader: one that reads from the file system, one that reads from an Amazon S3 bucket, and one that reads from the Java classpath. The latter is intended for unit tests, as it is only practical when the Retina is sufficiently small, yet is very convenient as it eliminates any dependency on an external storage location (file system directory or S3 bucket).

Typically, a RetinaLoader instance is immediately used to create a CorticalEngine, as described in section Create a CorticalEngine. However, RetinaLoader also supports useful operations in its own right, which are shown in [LoadRetinas.scala]:

LoadRetinas.scala: Creating RetinaLoaders and using them to explore Retinas.
package example.feature
import io.cortical.retina.source.{FileRetinaLoader, ResourceRetinaLoader, RetinaLoader, S3RetinaLoader}

import scala.collection.JavaConverters._
object LoadRetinas extends S3Constants {
  def main(args: Array[String]): Unit = {
    val frl: RetinaLoader = new FileRetinaLoader("./retinas")
    val srl: RetinaLoader = new S3RetinaLoader(AwsAccessKey, AwsSecretKey, S3Endpoint, RetinasS3BucketName)
    val rrl: RetinaLoader = new ResourceRetinaLoader("/small-retinas")

    val fretinas = frl.getAvailableRetinaNames.asScala
    val sretinas = srl.getAvailableRetinaNames.asScala
    val rretina = rrl.getRetinaProperties("spanish_subset")
    assert(rretina.getLanguage == "es")
    println(
      s"""
         |Available Retinas
         |  in directory: ${fretinas mkString ","}
         |  in S3 bucket: ${sretinas mkString ","}
         |  on classpath: at least one in ${rretina.getLanguage}
      """.stripMargin)
  }
}

In the [LoadRetinas.scala] example, three RetinaLoaders are created: one that reads from the file system directory ./retinas (relative to the current directory), one that reads from the AWS S3 bucket identified by the given values, and a third one that reads from the directory small-retinas at the root of the Java classpath (hence /small-retinas). The former two RetinaLoaders support enquiring about all available Retinas at the location passed to the RetinaLoader, whereas the latter does not. Every RetinaLoader can load properties describing a given Retina, such as the language of that Retina.

All RetinaLoaders assume a layout like the following underneath the root directory used by the RetinaLoader to load Retinas:

.
./arabic
./arabic/retina.line
./arabic/retina.properties
./business_intelligence
./business_intelligence/retina.line
./business_intelligence/retina.properties
./chinese
./chinese/retina.line
./chinese/retina.properties
./danish
./danish/retina.line
./danish/retina.properties
./en_associative
./en_associative/retina.line
./en_associative/retina.properties
./english_retina
./english_retina/retina.line
./english_retina/retina.properties
./english_subset
./english_subset/retina.line
./english_subset/retina.properties
./eu_market_english
./eu_market_english/retina.line
./eu_market_english/retina.properties
...
./spanish
./spanish/retina.line
./spanish/retina.properties
./spanish_subset
./spanish_subset/retina.line
./spanish_subset/retina.properties

When loading Retinas from a file system directory or the Java classpath, it is the root directory of this tree that is passed to the respective RetinaLoader. When loading from an S3 bucket, the top-level directory in the bucket must itself be the root of this tree.

4.2.1.2. Create a CorticalEngine

CorticalEngine is the most fundamental entry point to the core Semantic Text Processing features of Retina Library. It is a Java interface - an object implementing that interface can be created in one of two ways:

  • in pure Java, the CorticalEngineFactory can be used to create/retrieve the CorticalEngine for the Retina with a given name. This is shown in [CreateCorticalEngines1.scala].

  • in Scala, the Scala-friendly CorticalApi can be used to achieve the same effect less verbously, as shown in [CreateCorticalEngines2.scala].

CreateCorticalEngines1.scala: Loading a Retina and creating the CorticalEngine for that Retina using the Java CorticalEngineFactory.
package example.feature
import io.cortical.engine.api.{CorticalEngine, CorticalEngineFactory}
import io.cortical.retina.source.FileRetinaLoader
object CreateCorticalEngines1 {
  def main(args: Array[String]): Unit = {
    val loader = new FileRetinaLoader("./retinas")
    val factory: CorticalEngineFactory = CorticalEngineFactory.getInstance(loader)
    val ceEN: CorticalEngine = factory.getCorticalEngine("english_subset")
    val ceDE: CorticalEngine = factory.getCorticalEngine("german_subset")
    println(s"The Retinas support fingerprints with ${ceEN.getRetinaSize} and ${ceDE.getRetinaSize} positions.")
  }
}

CreateCorticalEngines2.scala: Loading a Retina and creating the CorticalEngine for that Retina using the Scala CorticalApi.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.getCorticalEngine
object CreateCorticalEngines2 {
  def main(args: Array[String]): Unit = {
    val loader = new FileRetinaLoader("./retinas")
    val ceEN: CorticalEngine = getCorticalEngine(loader, "english_subset")
    val ceDE: CorticalEngine = getCorticalEngine(loader, "german_subset")
    println(s"The Retinas support fingerprints with ${ceEN.getRetinaSize} and ${ceDE.getRetinaSize} positions.")
  }
}

CorticalEngine essentially adds Semantic Text Processing operations on top of a Retina and is therefore the most important type in Retina Library: whenever a Retina is used in Retina Library, it is used from a CorticalEngine that has been created for that Retina. CorticalApi is a Scala adaptation on top of CorticalEngine: Scala code uses both CorticalEngine and CorticalApi, where operations are invoked through CorticalApi and CorticalEngine mainly plays the role of a handle for a Retina. The CorticalApi has no references to a CorticalEngine or Retina - it is stateless. See section Pass the CorticalEngine to CorticalApi for a discussion of this.

The code shown in examples [CreateCorticalEngines1.scala] and [CreateCorticalEngines2.scala] works fine, but when the Retina Library Program executes in an Apache Spark cluster there is one more important aspect to consider: since a Retina is large, and the CorticalEngine directly references a (exactly one) Retina, the distribution of a CorticalEngine over the Spark cluster nodes must be optimised through the use of a Spark Broadcast variable. The idiom for this, which is strongly recommended for all Retina Library Programs executing in Apache Spark clusters, is shown in [CreateCorticalEngines3.scala].

CreateCorticalEngines3.scala: Loading a Retina and creating the CorticalEngine for that Retina using the Scala CorticalApi in an Apache Spark cluster.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.getCorticalEngine
import org.apache.spark.broadcast.Broadcast
object CreateCorticalEngines3 extends SparkApp {
  override protected def work(): Unit = {
    val loader = new FileRetinaLoader("./retinas")
    val ceEN: Broadcast[CorticalEngine] = sc.broadcast(getCorticalEngine(loader, "english_subset"))
    val ceDE: Broadcast[CorticalEngine] = sc.broadcast(getCorticalEngine(loader, "german_subset"))
    println(s"The Retinas support fingerprints with ${ceEN.value.getRetinaSize} and ${ceDE.value.getRetinaSize} positions.")
  }
}

The salient feature of [CreateCorticalEngines3.scala] is the fact that no reference to any CorticalEngine is kept - every CorticalEngine is immediately broadcast over the Apache Spark cluster, and the only reference that is retained is that to a Spark Broadcast variable containing the CorticalEngine. In that way, accidental (inefficient) serialization of the CorticalEngine to the Spark cluster nodes is prevented. Note the use of .value to retrieve the CorticalEngine from its Spark Broadcast variable.

It should also be noted that the RetinaLoader instances are only usable on the Spark cluster node on which they were created - which should always be the Spark driver.

4.2.1.3. Pass the CorticalEngine to CorticalApi

CorticalApi is a Scala object whose functions largely mirror the methods of CorticalEngine with the addition of an additional, curried, implicit parameter for the CorticalEngine to use. For instance, the signature of getTerm in CorticalEngine is

CorticalEngine getTerm (Java)
CoreTerm getTerm(String term);

whereas the signature and implementation of that same method in CorticalApi is

CorticalApi getTerm (Scala)
def getTerm(term: String)(implicit engine: CorticalEngine): CoreTerm = engine.getTerm(term)

Other functions in CorticalApi perform more work to bridge Java and Scala, but the general approach of passing the CorticalEngine instance to use to the CorticalApi object as an implicit parameter stays the same.

This means that if a Retina Library Program uses just one Retina and hence just one CorticalEngine - an important special case of Retina Library Programs - then the idiomatic use of Retina Library is to define that CorticalEngine as an implicit val which will then be passed transparently through the magic of Scala implicit`s to every invocation of a CorticalApi function. This is shown in example `[ImplicitCorticalEngine1.scala]:

ImplicitCorticalEngine1.scala: Passing a CorticalEngine and its Retina explicitly and implicitly to CorticalApi functions.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.{getCorticalEngine, getRetinaSize}
object ImplicitCorticalEngine1 {
  def main(args: Array[String]): Unit = {
    implicit val engine: CorticalEngine = getCorticalEngine(new FileRetinaLoader("./retinas"), "english_subset")
    val size1 = getRetinaSize(engine) // explicit (unnecessary)
    val size2 = getRetinaSize // implicit
    assert(size1 == size2)
    println(s"The Retina supports fingerprints with $size1 positions.")
  }
}

In example [ImplicitCorticalEngine1.scala], the same CorticalEngine instance is passed to two invocations of the CorticalApi function getRetinaSize, first explicitly and then implicitly. The latter is preferred for Retina Library Programs that work with only one CorticalEngine.

In the case of a Retina Library Program executing in an Apache Spark cluster, the CorticalEngine will always be wrapped in a Spark Broadcast variable, as discussed previously. This leads us to the following extended use of implicit`s shown in `[ImplicitCorticalEngine2.scala]:

ImplicitCorticalEngine2.scala: Passing a Spark Broadcast variable containing a CorticalEngine explicitly and implicitly to CorticalApi functions.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.{getCorticalEngine, getRetinaSize}
import io.cortical.scala.spark.util.valueOfBroadcastCorticalEngine
import org.apache.spark.broadcast.Broadcast
object ImplicitCorticalEngine2 extends SparkApp {
  override protected def work(): Unit = {
    implicit val engine: Broadcast[CorticalEngine] = sc.broadcast(getCorticalEngine(new FileRetinaLoader("./retinas"), "english_subset"))
    val size1 = getRetinaSize(engine.value) // explicit (unnecessary)
    val size2 = getRetinaSize // implicit
    assert(size1 == size2)
    println(s"The Retina supports fingerprints with $size1 positions.")
  }
}

In example [ImplicitCorticalEngine2.scala] a Spark Broadcast variable containing a CorticalEngine instance is assigned to a Scala implicit val. Thus when a CorticalApi function is called and the CorticalEngine should be passed explicitly to that function, the CorticalEngine must first be retrieved from the Spark Broadcast variable using .value. However, the implicit passing of the CorticalEngine parameter to the CorticalApi function is as convenient and transparent as before, thanks to the implicit conversion function valueOfBroadcastCorticalEngine which takes care of the implicit unwrapping of an `implicit`ly available Spark Broadcast variable containing a CorticalEngine.

Example code shown in this section uses implicit argument-passing for CorticalEngine and Spark Broadcast variable of CorticalEngine whenever possible, to minimize clutter.

4.2.1.4. Simple text operations

CorticalEngine and CorticalApi provide basic text operations that are useful when working with text, although they do not constitute Semantic Text Processing by themselves.

The fact that these operations are provided through CorticalEngine and CorticalApi has been identified as needing improvement, because these operations are not inherently tied to a Retina. These features will likely be provided by other means in a future release of Retina Library.
4.2.1.4.1. Tokenize text

Tokenizing text, in the first instance, means splitting text into words. However, with the information contained in a Retina being available during tokenization in Retina Library, the tokenization algorithm in Retina Library performs several additional functions, as shown in [Tokenize.scala]:

Tokenize.scala: Tokenizing text into CoreTerm objects, including information from the Retina, if available.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.model.core.CoreTerm
import io.cortical.nlp.pos.CorePosTypes.NOUN
import io.cortical.scala.api.CorticalApi.tokenize
object Tokenize extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val ts: Seq[CoreTerm] = tokenize("I LOVE New York: It has 7 flXWys!")
    assert(ts.length == 6)
    assert(ts(2).getTerm == "new york" && ts(2).getDf > 0 && ts(2).getPosTypes.contains(NOUN))
    assert(ts(5).getTerm == "flxwys" && ts(5).getDf.isNaN && null == ts(5).getPosTypes)
    s"The text was split into ${ts.length} tokens: ${ts map (_.getTerm) mkString ","}"
  }
}

Example [Tokenize.scala] shows that text tokenization in Retina Library

  • returns all tokens as CoreTerm objects,

  • only returns words and not, for instance, punctuation characters or numbers and digits,

  • converts the text of all tokens to lower-case,

  • detects compound terms (if they are in the Retina), i.e. returns each compound term as a single CoreTerm object with the (lower-cased) text of the compound term,

  • also includes non-sensical tokens, or terms simply not present in the Retina, e.g. terms from a foreign language,

  • includes additional information about each token if that token term is found in the Retina, e.g. possible POS types or the DF value of that term in the training corpus that gave rise to the Retina. If the term is not in the Retina then that information is missing.

4.2.1.4.2. Split text into sentences

A piece of text, in the form of a String, can be split into sentences as shown in [SplitIntoSentences.scala]. The algorithm currently used by Retina Library is simple and intended for western scripts. In particular, sentence boundaries are currently identified by full-stops, exclamation marks and question marks, although common abbreviations (like "Dr." in the example) are correctly disregarded as sentence boundaries.

SplitIntoSentences.scala: Splitting a piece of text into sentences.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.splitIntoSentences
object SplitIntoSentences extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val sents: Seq[String] = splitIntoSentences("This is text. It has 3 sentences. Dr. Freud agrees.")
    assert(sents.length == 3)
    s"The text was split into ${sents.length} sentences: ${sents mkString "\n"}"
  }
}

Sentence splitting as currently implemented by Retina Library is intended for simple use-cases where the convenience of being able to do sentence splitting without any external library dependencies trumps the sophistication of the algorithm.

Cortical.io will enhance the sentence splitting algorithm to be more versatile and sophisticated as and when the need arises. But it is not intended as a replacement for sophisticated NLP libraries, which can and should always be used in conjunction with Retina Library when state-of-the-art sentence splitting is required.
4.2.1.4.3. Slice text

Often, a larger piece of text needs to be, conceptually, split into paragraphs, but the text contains no clues (such as blank lines) as to the beginning and end of individual paragraphs. For instance, all formatting may have been lost, or the text never contained any formatting in the first place. In any case, splitting the text into sentences is not what is requested, as consecutive sentences may cover the same topic and hence be considered part of the same logical paragraph.

Retina Library provides an operation called slicing that first splits text into sentences (as described in section Split text into sentences) and subsequently merges consecutive sentences into slices such that the meaning of sentences within the same slice changes little whereas the meaning between slices changes more. In other words, this algorithm aims to detect what would normally be considered well-formed paragraphs. However, the algorithm does not require clues in the form of formatting to detect boundaries between slices, as it works on the basis of Semantic Text Processing.

Slicing is shown in example [Slice.scala]:

Slice.scala: Slicing text into consecutive stretches of sentences that preserve the same meaning.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.slice
object Slice extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val text =
      """According to Dr. Hawking, after the initial expansion, the
        |Universe cooled sufficiently to allow the formation first
        |of subatomic particles and later of simple atoms. Giant clouds
        |of these primordial elements later coalesced through gravity
        |to form stars. Assuming that the prevailing model is correct,
        |the age of the Universe is measured to be 13.799±0.021
        |billion years. After the initial expansion, the universe cooled
        |sufficiently to allow the formation of subatomic particles,
        |and later simple atoms.
        |The Kingdom of England is usually considered to begin with
        |Alfred the Great, King of Wessex. While Alfred was not the
        |first king to lay claim to rule all of the English, his rule
        |represents the first unbroken line of Kings to rule the whole
        |of England, the House of Wessex. The last English monarch
        |was Queen Anne, who became Queen of Great Britain when England
        |merged with Scotland to form a union in 1707.""".stripMargin
    val slices: Seq[String] = slice(text)
    assert(slices.length == 2)
    val startOfFirstSlice: String = "According to Dr. Hawking"
    assert(slices(0) startsWith startOfFirstSlice, s"first slice doesn't start with '$startOfFirstSlice' but '${slices(0)}'")
    val startOfSecondSlice: String = "The Kingdom of England"
    assert(slices(1) startsWith startOfSecondSlice, s"second slice doesn't start with '$startOfSecondSlice' but '${slices(1)}'")
    s"The text was cut into ${slices.length} slices: ${slices mkString "\n\n"}"
  }
}

In [Slice.scala] the text clearly separates into two topics, which is detected by the slicing algorithm. Sentences are assigned to the first slice until the topic changes. All subsequent sentences are assigned to the second slice.

The definition of the slicing algorithm is intentionally left vague so that future improvements in the detection of slices and slice boundaries can be incorporated into Retina Library. The intent of slicing will, however always remain the same, i.e. the assignment of consecutive sentences to semantically homogenous groups.
4.2.1.5. Fundamental Semantic Text Processing algorithms
4.2.1.5.1. Get the size of the Retina

The size of a Retina is the number of positions in the Semantic Fingerprints contained in that Retina. Retina Library provides access to the Retina size as shown in [GetRetinaSize.scala]:

GetRetinaSize.scala: Retrieving the size of the Retina through CorticalApi.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getRetinaSize
object GetRetinaSize extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val size: Int = getRetinaSize
    s"The Retina supports fingerprints with $size positions."
  }
}
4.2.1.5.2. Retrieve terms from the Retina

Terms can be retrieved from the Retina as shown in [GetTerm.scala]:

GetTerm.scala: Retrieving a term from the Retina, returning an CoreTerm object with the lower-cased term String and additional information only if the term is in the Retina.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.model.core.CoreTerm
import io.cortical.nlp.pos.CorePosTypes.NOUN
import io.cortical.scala.api.CorticalApi.getTerm
object GetTerm extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val ts: Seq[CoreTerm] = Seq("LOVE", "New York", "flXWys") map getTerm
    assert(ts(0).getTerm == "love" && ts(0).getDf > 0 && ts(0).getPosTypes.contains(NOUN))
    assert(ts(1).getTerm == "new york" && ts(1).getDf > 0 && ts(1).getPosTypes.contains(NOUN))
    assert(ts(2).getTerm == "flxwys" && ts(2).getDf.isNaN && null == ts(2).getPosTypes)
    s"${ts(0).getTerm} and ${ts(1).getTerm} are in the Retina, ${ts(2).getTerm} is not."
  }
}

As was the case with the tokenization operation demonstrated in [Tokenize.scala], a CoreTerm object is always returned, regardless of whether the term is actually in the Retina or not. However, the CoreTerm includes additional information about the term if that term is found in the Retina, e.g. possible POS types or the DF value of that term in the training corpus that gave rise to the Retina. If the term is not in the Retina then that information is missing. Also, the returned term is always in all-lower-case, regardless of what case combination was passed in to the function.

4.2.1.5.3. Fingerprint text

One of the core algorithms in Retina Library is calculating the Semantic Fingerprint of a single term or a piece of text. Several variants of this algorithm are provided, as can be seen in example [GetFingerprint.scala]:

GetFingerprint.scala: Calculating the Semantic Fingerprint from Strings and lists of CoreTerms, optionally restricted to certain POS types.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.nlp.pos.CorePosTypes.NOUN
import io.cortical.scala.api.CorticalApi.{getFingerprint, getTerm}
import io.cortical.scala.api.Fingerprint
object GetFingerprint extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val fp1: Fingerprint = getFingerprint("car")
    val fp2: Fingerprint = getFingerprint("My car is a bicycle")
    val fp3: Fingerprint = getFingerprint("My car is a bicycle", NOUN)
    val fp4: Fingerprint = getFingerprint(Seq(getTerm("car"), getTerm("bicycle")))
    assert(fp2.length > fp1.length)
    assert(fp3.toList == fp4.toList)
    s"Number of positions in fingerprints: ${fp1.length}, ${fp2.length}, ${fp3.length}, ${fp4.length}"
  }
}

Example [GetFingerprint.scala] shows that:

  • In Retina Library, a Semantic Fingerprint is represented as type Fingerprint, which is just an alias for Array[Int], listing in ascending order the positions in the binary Semantic Fingerprint which are set, while all other positions are un-set (a Semantic Fingerprint is a sparse binary data structure).

  • Semantic Fingerprint can be derived from Strings or lists of CoreTerm objects.

  • If a Semantic Fingerprint is calculated from a String, that String may be a single term or a longer piece of text. Furthermore, a POS type may optionally be specified so that only terms of that POS type are considered when calculating the Semantic Fingerprint.

Future releases of Retina Library may provide a richer abstraction of a Semantic Fingerprint than the current type alias Fingerprint, while keeping the fundamental storage format and runtime representation unchanged as Array[Int].

In general, passing a sequence of CoreTerm objects to the fingerprint calculation algorithm is an indication that the user wants the given terms to be used as-is, with as little manipulation as possible. In contrast, passing a String to fingerprint calculation gives that algorithm more freedom in choosing tokens from that String that it considers optimal for the quality of the resulting Semantic Fingerprint.

4.2.1.5.4. Compare Semantic Fingerprints

The second core algorithm in Retina Library, after mapping text to Semantic Fingerprints, is the measurement of the similarity or, conversely, distance of two Semantic Fingerprints. Similarity (or distance) is a floating-point number: the higher the similarity (the smaller the distance) of two Semantic Fingerprints, the closer the meaning of the two pieces of text that gave rise to these Semantic Fingerprints.

For two Semantic Fingerprints to be meaningfully compared they need not necessarily be derived from the same Retina: it is sufficient if they were calculated using aligned Retinas (see section Load a Retina).

Retina Library used to provide the function named compare for the calculation of the cosine similarity between two Fingerprint objects. Later versions of Retina Library added other Fingerprint comparison algorithms, both similarity measures as well as distance measures. As a result, the compare function is now deprecated in favour of the API demonstrated in example [Compare.scala].

Example [Compare.scala] shows how to compare Semantic Fingerprints using some of the comparison algorithms provided in Retina Library:

Compare.scala: Calculating distance and similarity of two Fingerprints through some of the comparison algorithms provided in Retina Library.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.fingerprint.compare.api.FingerprintComparisons
import io.cortical.scala.api.CorticalApi.{getComparisons, getFingerprint}
object Compare extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val fp1 = getFingerprint("This is a car.")
    val fp2 = getFingerprint("My car is a bicycle.")
    val fp3 = getFingerprint("My car is a bike.")

    val comp: FingerprintComparisons = getComparisons

    val s12: Double = comp.cosineSimilarity(fp1, fp2)
    val s23: Double = comp.cosineSimilarity(fp2, fp3)
    val d12: Double = comp.euclideanDistance(fp1, fp2)
    val d23: Double = comp.euclideanDistance(fp2, fp3)
    val d13: Double = comp.normalisedOverlapAllSimilarity(fp1, fp3)
    assert(0 < s12 && s12 <= 1)
    assert(0 < s23 && s23 <= 1)
    assert(0 < d12 && d12 <= 1)
    assert(0 < d23 && d23 <= 1)
    assert(0 < d13 && d13 <= 1)
    assert(s12 < s23)
    assert(d12 > d23)
    s"Cosine similarities are $s12 and $s23 while euclidean distances are $d12 and $d23."
  }
}

Example [Compare.scala] shows that

  • most comparison measures, including cosine similariy, euclidean distance and normalised overlap similarity are always in the interval [0,1],

  • two pieces of text that are more similar in meaning give rise to two Fingerprints that are more similar when `compare`d,

  • if similarity is high then distance is low and vice versa,

  • further distance and similarity measures are available when needed.

4.2.1.5.5. Retrieve similar terms from the Retina

In the context of Semantic Text Processing with Retina Library, similar terms denotes terms contained in a given Retina which have a strong semantic relationship with a given Semantic Fingerprint. That Semantic Fingerprint may have been derived from

  • a single term in the same Retina,

  • a single term in a different - but aligned - Retina,

  • a piece of text using the same or a different, aligned Retina.

If the input Semantic Fingerprint for the retrieval of similar terms derives from a single term, then it is tempting to think of the similar terms as the synonyms of that input term. This is however misguided: similar terms are all terms from the Retina with a strong semantic association (expressed as a high semantic similarity) with the input term - including, but not limited to synonyms of that term.

Retrieving similar terms from the Retina is shown in example [GetSimilarTerms.scala]:

GetSimilarTerms.scala: Retrieving a number of terms similar to a given Fingerprint from the Retina, optionally restricted to certain POS types.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.model.core.CoreTerm
import io.cortical.nlp.pos.CorePosTypes.ADJECTIVE
import io.cortical.scala.api.CorticalApi.{getFingerprint, getSimilarTerms}
object GetSimilarTerms extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val fp = getFingerprint("My car is a bicycle")
    val ts: Seq[CoreTerm] = getSimilarTerms(fp, 10)
    val as: Seq[CoreTerm] = getSimilarTerms(fp, 10, ADJECTIVE)
    assert(ts exists (_.getTerm == "bike"))
    assert(as exists (_.getTerm == "four-wheel"))
    s"All similar terms: ${ts map (_.getTerm) mkString ","}; adjectives: ${as map (_.getTerm) mkString ","}"
  }
}
4.2.1.5.6. Determine the context terms of a Semantic Fingerprint

Retina Library provides an experimental algorithm to determine the context terms of a Semantic Fingerprint and, therefore, any piece of text. The context terms are those terms from the Retina that capture the essential semantic aspects of a given Semantic Fingerprint. This is different from the similar terms for that Semantic Fingerprint, as discussed in section Retrieve similar terms from the Retina: both algorithms start with an input Semantic Fingerprint, and both algorithms return terms contained in a given Retina. But while similar terms are simply the terms whose Semantic Fingerprints have the highest similarity to the input Semantic Fingerprint, context terms are algorithmically selected to best describe the different semantic dimensions of the input Semantic Fingerprint.

The algorithm which determines context terms is under active development and will change without notice in future versions of Retina Library.

Example [GetContext.scala] demonstrates the simplest possible way of determining the context of a given piece of text. More elaborate ways of doing this, such as by specifying the number of desired context terms, or by starting from a Semantic Fingerprint rather than a piece of text, are available in CorticalApi and, in particular, in CorticalEngine.

GetContext.scala: Determining the context (terms) of a given piece of text using CorticalApi. CorticalEngine defines a more general method that takes a Semantic Fingerprint rather than a piece of text as the input (not shown).
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getContext
object GetContext extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val ctxt: Seq[String] = getContext("Many teams play in the FA Cup.")
    assert(ctxt.length >= 2)
    assert(ctxt contains "game")
    assert(ctxt contains "club")
    s"The context of that sentence is defined by the terms ${ctxt mkString ","}."
  }
}
4.2.1.6. Higher-level Semantic Text Processing algorithms

The algorithms discussed in section Fundamental Semantic Text Processing algorithms are the algorithmic core of Semantic Text Processing in Retina Library and build the basis for higher-level functionality. Some of that functionality is implemented in the form of additional algorithms in CorticalEngine and CorticalApi and will be discussed in this section. Other higher-level functionality goes beyond simple algorithms and is the topic of later sections (<<Semantic Text Classification using SemanticClassifier>> and <<Semantic Search using SemanticSearcher>>).

4.2.1.6.1. Create category filters

A category filter in the terminology of Retina Library is a Semantic Fingerprint that combines and subsumes several input Semantic Fingerprints. The term stems from one typical application of category filters, namely the representation of a single category of texts, such that pieces of text that fall into that category can be filtered-out from a larger set by matching against the category filter. This is, however, just one application of category filters. Furthermore, the facilities discussed in section <<Semantic Text Classification using SemanticClassifier>> provide a more rigorous and flexible approach to categorizing pieces of text.

CreateCategoryFilter.scala: Creating a category filter from a list of Fingerprints and using that to distinguish between text that belongs and doesn’t belong into that category.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.{compare, createCategoryFilter, getFingerprint}
import io.cortical.scala.api.Fingerprint
object CreateCategoryFilter extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val fps = Seq("A text about cars", "Another car text", "Cars are us!", "Cars will be cars.") map (getFingerprint(_))
    val cft: Fingerprint = createCategoryFilter(fps)
    val pos = getFingerprint("Let me boast about my car.")
    val neg = getFingerprint("I'm only interested in bicycles.")
    val simPos = compare(pos, cft)
    val simNeg = compare(neg, cft)
    assert(simPos > simNeg)
    s"Similarities of positive and negative cases are $simPos and $simNeg, respectively."
  }
}
The createCategoryFilter function also supports the passing of a "noise fingerprint", which represents a Semantic Fingerprint signal that should be disregarded when calculating the category fingerprint. This concept is experimental and should be used with caution. It may also be removed in future versions of Retina Library.
4.2.1.6.2. Extract keywords from text

In Retina Library, keywords are tokens (terms) selected from a piece of text that are semantically similar to the entire piece of text. When extracting keywords from text, the user must decide how many keywords shall be returned. The algorithm then selects that number from the tokens of that text.

Example [ExtractKeywords.scala] shows the extraction of keywords from a paragraph of text:

ExtractKeywords.scala: Extracting a given number of keywords that are semantically representative of the given piece of text.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.model.core.CoreTerm
import io.cortical.scala.api.CorticalApi.extractKeywords
object ExtractKeywords extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val text =
      """According to Dr. Hawking, after the initial expansion, the
        |Universe cooled sufficiently to allow the formation first
        |of subatomic particles and later of simple atoms. Giant clouds
        |of these primordial elements later coalesced through gravity
        |to form stars. Assuming that the prevailing model is correct,
        |the age of the Universe is measured to be 13.799±0.021
        |billion years.""".stripMargin
    val kws: Seq[CoreTerm] = extractKeywords(text, 5)
    assert(kws exists (_.getTerm == "gravity"))
    assert(kws forall (_.getDf >= 0))
    s"Extracted keywords ${kws map (_.getTerm) mkString ","}"
  }
}

Keywords are always contained in the Retina used by the algorithm, because by definition they must have a Semantic Fingerprint. Hence the CoreTerm objects returned by the keyword extraction algorithm always contain additional information about the term, such as its DF in the Retina training corpus, or its possible POS tags.

4.2.2. Semantic Text Classification using SemanticClassifier

Semantic Text Classification is a machine learning feature of Retina Library which is formalised in trait SemanticClassifier: Given a previously unseen piece of text, a SemanticClassifier is able to assign a label to that text, where the label uniquely identifies the class. A SemanticClassifier hence classifies text into one of a number of classes. For this to work, the SemanticClassifier must previously have been trained on a training set of pairs of text and labels, i.e. examples of pieces of text that have been (by definition) correctly assigned to one class each.

SemanticClassifier is generic in the type of label it uses to identify classes. The type alias StringLabelSemanticClassifier uses Strings as class labels.

Version 2.5.0 of Retina Library ships with just one implementation of SemanticClassifier, which is in fact an implementation of StringLabelSemanticClassifier: SemanticTextClassifier.

A general design decision in Retina Library that applies to SemanticClassifier (as well as to SemanticSearcher discussed in section <<Semantic Search using SemanticSearcher>>) is that Scala companion objects to classes that implement a Retina Library feature trait like SemanticClassifier (or SemanticSearcher) have Scala-idiomatic apply factory methods that are declared with a return type of that trait rather than of the implementation class. Concretely, the apply factory method in the companion object to SemanticTextClassifier is declared to return a StringLabelSemanticClassifier instead of a SemanticTextClassifier. Irrespective of the declared type, the runtime type of the object returned by that factory method is of course SemanticTextClassifier.

All implementation classes (but not their companion objects) of the SemanticClassifier trait are to be treated as private. For reasons of backwards compatibility, this is currently not the case for SemanticTextClassifier, but a future version of Retina Library will reduce the visibility of that class to private.

Example SemanticTextClassification.scala shows correct and idiomatic usage of Semantic Text Classification with Retina Library.

SemanticTextClassification.scala: Training a SemanticTextClassifier, a concrete implementation of SemanticClassifier that uses String labels to identify classes, and using it to predict the classes of two previously unseen pieces of text.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.{SemanticTextClassifier, StringLabelSemanticClassifier}
object SemanticTextClassification extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val cars = "cars"
    val bikes = "bikes"
    val trainSet: Seq[(String, String)] = Seq(
      cars -> "A text about cars", cars -> "Another car text",
      cars -> "Cars are us!", cars -> "Cars will be cars.",
      bikes -> "Text about bicycles", bikes -> "Another bike text",
      bikes -> "Bikes are us!", bikes -> "Bicycles will be bikes.")
    val stc: StringLabelSemanticClassifier = SemanticTextClassifier(trainSet)
    val l1: String = stc.classify("Let me boast about my four-wheel drive.")
    val (l2: String, _, _, conf2: Double) = stc.classifyWithDetail("I'm only interested in bicycles.")
    assert(l1 == cars)
    assert(l2 == bikes)
    assert(0 < conf2 && conf2 <= 1)
    s"The unseen texts were classified as $l1 and $l2, the latter with confidence $conf2."
  }
}

As SemanticTextClassification.scala shows, training is done on a set of pre-labelled pieces of text, after which an (immutable) StringLabelSemanticClassifier instance is created using the factory method for SemanticTextClassifier. The SemanticClassifier supplies two main methods for classification given a (typicall previously unseen) piece of text:

  • one that just returns the label of the class to which this text is predicted to belong, and

  • one that also returns various metrics about the quality of the prediction to that class. The most important of these metrics is the last one, which is a confidence score from the interval [0,1].

4.2.3. Semantic Search using SemanticSearcher

Semantic Search is an important aspect of Semantic Text Processing. It means searching a set of texts by some query text to find those texts that are most similar in meaning to the query text. Thus Semantic Search is different from keyword-based search approaches because the exact words in all pieces of text involved in the Semantic Search operation matter only insofar as they convey meaning: in Semantic Search, in contrast to keyword-based search, the words themselves, as strings of characters, are not matched between query and the set of texts searched over.

Typically, the set of texts to be searched is seen as a "database" of text "documents", although both the terms "database" and "document" are used informally here. In particular, "database" does not imply persistence, ACID transactions or SQL-like capabilities, but rather simply a collection (set) of pieces of text available for Semantic Search. Similarly, "document" does not imply structure, formatting or file-formats typically associated with the word "document", but rather simply an identifiable piece of text.

Unsurprisingly, in the world of Retina Library, Semantic Search is based on comparing the Semantic Fingerprint of a query text to the Semantic Fingerprints of the pieces of text in the document database.

Cortical.io acknowledges that this particular usage of the term "document" is confusing. In future versions of Retina Library the term "document" could therefore be replaced with a term that carries less pre-conceived associations - such as "(text) snippet".

There are two important abstractions for Semantic Search in Retina Library:

  • SemanticSearcher captures the notion of a document database that can be semantically searched by some query text,

  • Doc and related traits formalize the various aspects of documents (in the sense introduced above, i.e. snippets of text).

We will discuss both abstractions in the remainder of this section. First, though, we will present the Java foundations to Semantic Search in Retina Library upon which SemanticSearcher builds.

4.2.3.1. The Java DocumentFingerprintDb as the basis for SemanticSearcher

The most basic support for performing Semantic Search in Retina Library is through the Java interface DocumentFingerprintDb and its (currently Java-only) implementation IndexedDocumentFingerprintDb. The interface DocumentFingerprintDb describes a simple mutable database of Semantic Fingerprints and a search operation over those Semantic Fingerprints, where the query is also expressed as a Semantic Fingerprint. The class IndexedDocumentFingerprintDb is a straightforward implementation of that interface using an inverted index data structure that is maintained in its entirety on the JVM heap.

The pieces of text that give rise to the Semantic Fingerprints exchanged with DocumentFingerprintDb occur neither in the interface nor its implementation. In other words, there are no documents in this basic implementation of Semantic Search.

Example [SearchingWithJavaDocumentFingerprintDb.scala] shows the features available in this kind of Semantic Search.

SearchingWithJavaDocumentFingerprintDb.scala: Basic Semantic Search using Java interface DocumentFingerprintDb and its implementation IndexedDocumentFingerprintDb.
package example.feature
import java.util

import io.cortical.document.api.DocumentFingerprintDb
import io.cortical.document.impl.IndexedDocumentFingerprintDb
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.{getFingerprint, getRetinaSize}
object SearchingWithJavaDocumentFingerprintDb extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val db: DocumentFingerprintDb = new IndexedDocumentFingerprintDb(getRetinaSize)
    db.addDocument("id1", getFingerprint("Cars are us!"))
    db.addDocument("id2", getFingerprint("Cars will be cars."))
    db.addDocument("id3", getFingerprint("A text about cars."))
    db.addDocument("id4", getFingerprint("Another car text"))
    assert(db.containsDocument("id3"))
    assert(db.containsDocument("id2"))
    db.removeDocument("id2")
    assert(!db.containsDocument("id2"))
    val ids: util.List[String] = db.search(getFingerprint("Find the text about cars."), 10)
    assert(ids.size == 3)
    assert(ids.get(0) == "id3")
    assert(ids.get(1) == "id4")
    s"Searching the fingerprint DB returned the doc IDs $ids"
  }
}

Please note that the Java interface DocumentFingerprintDb discussed here is distinct from the DocumentFingerprintDb implementation of the Scala trait SemanticSearcher discussed in section <<DocumentFingerprintDb>>.

4.2.3.2. Document abstractions in Retina Library

Retina Library tries to decouple algorithms that work on and with documents (again, using "document" to mean "snippet of text") from the actual classes used to represent those documents. The mechanism by which Retina Library realizes this decoupling is a combination of three factors:

  • The 1st factor is a hierarchy of document traits that capture the attributes a document has/must have. For instance, Doc is the root of this hierarchy and has a document identifier and metadata, TextDoc inherits from Doc and adds a String text attribute, whereas FingerprintedDoc inherits from Doc and adds a Fingerprint attribute.

  • The 2nd factor is a collection of Scala "companion" objects to these document traits with factory apply methods that create instances of concrete but private implementations of these traits. These companion objects and concrete implementations comprise a complete but optional set of document classes that can be used in Retina Library Programs that do not have their own, separate, classes to represent documents.

  • The 3rd factor is the definition of algorithms and abstractions such as SemanticSearcher exclusively in terms of these document traits. This allows any implementation of these document traits to be supplied, including, but not limited to those returned by the factory methods in the document trait companion objects.

[Documents.scala] shows examples of some of the important document traits and companion objects in Retina Library.

Documents.scala: Important document abstraction traits and their companion objects in Retina Library.
package example.feature
import com.neovisionaries.i18n.LanguageCode.en
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.{compare, getFingerprint}
import io.cortical.scala.api.document.{FingerprintedDoc, FingerprintedTextDoc, ScoredFingerprintedTextDoc, TextDoc}
import io.cortical.scala.api.metadata.Metadata
import io.cortical.scala.api.metadata.Metadata.MetadataWrapper
object Documents extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val td: TextDoc = TextDoc("id1", Metadata(en), "English text")
    val fptd: FingerprintedTextDoc = FingerprintedTextDoc(td, getFingerprint(td.text))
    assert(fptd.docId == "id1")
    assert(fptd.text == td.text)
    assert(fptd.metadata.lang == en)
    val sfptd: ScoredFingerprintedTextDoc = ScoredFingerprintedTextDoc(fptd, 1.0)
    assert(sfptd.docId == "id1")
    assert(sfptd.score == 1.0)
    val fpd: FingerprintedDoc = new FingerprintedDoc {
      override val docId = "id2"
      override val fp = getFingerprint("English text")
    }
    val sim = compare(fptd.fp, fpd.fp)
    assert(0.999 <= sim && sim <= 1.0)
    s"Created a text doc $td, fingerprinted it to $fptd and scored it to $sfptd, then created a custom fingerprinted doc $fpd"
  }
}
All document traits including the word Preserving are considered experimental and may be changed in a future release of Retina Library. In particular, their functionality may be merged into the standard document traits discussed in this section.
4.2.3.3. SemanticSearcher and its implementations

The Scala trait SemanticSearcher is the main abstraction for performing Semantic Search over a set of texts. The pieces of text to be searched-over are represented as document traits - see section Document abstractions in Retina Library.

SemanticSearcher is generic and covariant in the type of result returned from a Semantic Search. Most implementations of SemanticSearcher return document traits, but DocumentFingerprintDb, for instance, just returns the identifiers of the documents (DocIDs) that matched the query.

Retina Library ships with several implementation of SemanticSearcher, which have different runtime characteristics. Also, some SemanticSearchers are immutable, whereas others are backed by a document database that can be manipulated. These distinctions between capabilities of SemanticSearchers is captured through various sub-traits of SemanticSearcher:

  • The base-trait SemanticSearcher just defines methods for semantically searching the document database.

  • The sub-trait StoringSemanticSearcher of SemanticSearcher adds methods for retrieving the documents and their IDs that form part of the document database. It is generic and covariant in the type of document stored in the document database. Not all SemanticSearchers retain their documents in their original form (they might just store the Fingerprints of the documents), but those that do, implement this sub-trait to announce their capability to users.

  • The sub-trait UpdateableSemanticSearcher of SemanticSearcher adds methods to add, remove and update documents in the document database used by the SemanticSearcher, as well as to enquire whether a document is in that database. It is generic and contra-variant in the type of document stored in the document database, which must be a sub-type of FingerprintedDoc. SemanticSearchers that are mutable implement this trait.

  • Finally, the trait FullSemanticSearcher is extended by SemanticSearcher implementations that are both mutable and give access to their documents in their original form. It extends StoringSemanticSearcher and UpdateableSemanticSearcher and is generic and invariant in the type of document stored in the document database, which must be a sub-type of FingerprintedDoc.

The most important concrete implementations of SemanticSearcher are briefly listed in the remainder of this section. As always in Retina Library, the concrete implementations created by the factory methods of their companion objects are private (see <<Semantic Text Classification using SemanticClassifier>>).

All implementation classes (but not their companion objects) of the SemanticSearcher trait and its sub-traits are to be treated as private. For reasons of backwards compatibility, this is currently not the case for all of these classes, but a future version of Retina Library will reduce the visibility of those classes which are not currently private to private.
4.2.3.3.1. DocumentFingerprintDb

The simplest supported SemanticSearcher implementation is DocumentFingerprintDb. Please note that the DocumentFingerprintDb implementation of the Scala trait SemanticSearcher discussed here is distinct from the Java interface DocumentFingerprintDb discussed in section The Java DocumentFingerprintDb as the basis for SemanticSearcher. It represents an immutable document database that only supports semantic search, doesn’t store documents in their original form, and returns search results as the DocIDs. Hence search results contain no score and are ordered by descending fingerprint overlap with the query, which is in general not the same order as descending cosine similarity! The fingerprints of all documents in the database are stored on the heap of the current JVM.

SearchingWithDocumentFingerprintDb.scala: Semantic Search using DocumentFingerprintDb.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.DocumentFingerprintDb
import io.cortical.scala.api.document.{DocID, DocIDSemanticSearcher, FingerprintedDoc}
object SearchingWithDocumentFingerprintDb extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val docs = Seq(
      FingerprintedDoc("id1", getFingerprint("Cars are us!")),
      FingerprintedDoc("id2", getFingerprint("Cars will be cars.")),
      FingerprintedDoc("id3", getFingerprint("A text about cars.")),
      FingerprintedDoc("id4", getFingerprint("Another car text")))
    val db: DocIDSemanticSearcher = DocumentFingerprintDb(docs)
    val ids: Seq[DocID] = db.search(getFingerprint("Find the text about cars."), 10)
    assert(ids.size == 4)
    assert(ids(0) == "id3")
    assert(ids(1) == "id4")
    s"Searching the fingerprint DB returned the doc IDs $ids"
  }
}
4.2.3.3.2. UpdateableDocumentFingerprintDb

Has the same basic features as DocumentFingerprintDb but is mutable and hence implements UpdateableSemanticSearcher. It is therefore also generic in the type of document it allows to be added/updated/removed (even though it doesn’t retain those documents in their original form).

SearchingWithUpdateableDocumentFingerprintDb.scala: Semantic Search using UpdateableDocumentFingerprintDb.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.document.{DocID, FingerprintedDoc}
import io.cortical.scala.api.{UpdateableDocumentFingerprintDb, UpdateableSemanticSearcher}
object SearchingWithUpdateableDocumentFingerprintDb extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val docs = Seq(
      FingerprintedDoc("id1", getFingerprint("Cars are us!")),
      FingerprintedDoc("id2", getFingerprint("Cars will be cars.")),
      FingerprintedDoc("id3", getFingerprint("A text about cars.")),
      FingerprintedDoc("id4", getFingerprint("Another car text")))
    val db: UpdateableSemanticSearcher[FingerprintedDoc, DocID] = UpdateableDocumentFingerprintDb(docs)
    assert(db.contains("id3"))
    assert(db.contains("id2"))
    db.remove("id2")
    assert(!db.contains("id2"))
    val ids: Seq[DocID] = db.search(getFingerprint("Find the text about cars."), 10)
    assert(ids.size == 3)
    assert(ids(0) == "id3")
    assert(ids(1) == "id4")
    s"Searching the fingerprint DB returned the doc IDs $ids"
  }
}
4.2.3.3.3. PreservingDocumentDb

A mutable database of documents, their text and fingerprints, preserving the original documents such that the search results can refer to them. Implements FullSemanticSearcher. The score contained in the search results is cosine similarity and search results are ordered by descending cosine similarity. The fingerprints of all documents in the database are stored on the heap of the current JVM.

SearchingWithPreservingDocumentDb.scala: Semantic Search using PreservingDocumentDb.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.PreservingDocumentDb
import io.cortical.scala.api.document._
object SearchingWithPreservingDocumentDb extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val tdocs: Seq[TextDoc] = Seq(
      TextDoc("id1", "Cars are us!"),
      TextDoc("id2", "Cars will be cars."),
      TextDoc("id3", "A text about cars."),
      TextDoc("id4", "Another car text"))
    val docs: Seq[FingerprintedTextDoc] = tdocs map (d => FingerprintedTextDoc(d, getFingerprint(d.text)))
    val db = PreservingDocumentDb(docs)
    assert(db.contains("id3"))
    val doc3: Option[FingerprintedTextDoc] = db.get("id3")
    assert(doc3.get.docId == "id3")
    assert(db.contains("id2"))
    db.remove("id2")
    assert(!db.contains("id2"))
    val results: Seq[ScoredFingerprintedTextDoc] = db.search(getFingerprint("Find the text about cars."), 10)
    assert(results.size == 3)
    assert(results(0).docId == "id3")
    assert(results(1).docId == "id4")
    s"Searching the document DB returned the doc IDs ${results map (_.docId) mkString ","}"
  }
}
4.2.3.3.4. ParentDocumentDb

A mutable database of documents, their text and fingerprints, preserving the original documents passed into the constructor - but search results do not refer to them. Crucially, documents have to inherit from ParentDoc, such that they contain child documents, and it is these child documents that are subject to search, while the result of a search is consolidated to be the parent documents.

In other respects this behaves like a PreservingDocumentDb.

The document IDs of child documents must be globally unique - not just unique within their parent document!
SearchingWithParentDocumentDb.scala: Semantic Search using ParentDocumentDb.
package example.feature

import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.{FullSemanticSearcher, ParentDocumentDb}
import io.cortical.scala.api.document._

object SearchingWithParentDocumentDb extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val tdocs: Seq[TextDoc] = Seq(
      TextDoc("id1", "Cars are us!"),
      TextDoc("id2", "Cars will be cars."),
      TextDoc("id3", "A text about cars."),
      TextDoc("id4", "Another car text"))
    val docs: Seq[FingerprintedTextDoc] = tdocs map (d => FingerprintedTextDoc(d, getFingerprint(d.text)))
    val db: FullSemanticSearcher[FingerprintedParentTextDoc, PreservingScoredFingerprintedTextDoc[FingerprintedParentTextDoc]] = ParentDocumentDb(docs map (doc => PreservingFingerprintedParentTextDoc(doc, Seq(doc))))
    assert(db.contains("id3"))
    val doc3: Option[FingerprintedTextDoc] = db.get("id3")
    assert(doc3.get.docId == "id3")
    assert(db.contains("id2"))
    db.remove("id2")
    assert(!db.contains("id2"))
    val results: Seq[PreservingScoredFingerprintedTextDoc[FingerprintedParentTextDoc]] = db.search(getFingerprint("Find the text about cars."), 10)
    assert(results(0).docId == "id3")
    assert(results(1).docId == "id4")
    s"Searching the document DB returned the doc IDs ${results map (_.docId) mkString ","}"
  }
}
4.2.3.3.5. PartitionedDocumentDb

An immutable document DB that is partitioned over an Apache Spark cluster, with one instance per Apache Spark executor. Currently only implements SemanticSearcher but could also extend StoringSemanticSearcher.

The major advantage of this implementation of SemanticSearcher is that it does not store all documents and their fingerprints in just one JVM but rather distributes (partitions) them over the {SCN}s in an Apache Spark cluster.

It uses Apache Spark StorageLevel MEMORY_AND_DISK, i.e. allows swapping-out of elements of the partitioned data structure to disk through native Apache Spark mechanisms.

SearchingWithPartitionedDocumentDb.scala: Semantic Search over an Apache Spark cluster using PartitionedDocumentDb.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.PartitionedDocumentDb
import io.cortical.scala.api.document._
import io.cortical.scala.spark.util.valueOfBroadcastCorticalEngine
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
object SearchingWithPartitionedDocumentDb extends FeatureApp {
  override protected def feature()(implicit ce: Broadcast[CorticalEngine]) = {
    val tdocs: RDD[TextDoc] = sc.parallelize(Seq(
      TextDoc("id1", "Cars are us!"),
      TextDoc("id2", "Cars will be cars."),
      TextDoc("id3", "A text about cars."),
      TextDoc("id4", "Another car text")))
    val docs: RDD[FingerprintedTextDoc] = tdocs map (d => FingerprintedTextDoc(d, getFingerprint(d.text)))
    val db: DocSemanticSearcher = PartitionedDocumentDb(sc, docs)
    val results: Seq[ScoredFingerprintedTextDoc] = db.search(getFingerprint("Find the text about cars."), 10)
    assert(results.size == 4)
    assert(results(0).docId == "id3")
    assert(results(1).docId == "id4")
    s"Searching the document DB returned the doc IDs ${results map (_.docId) mkString ","}"
  }
}
4.2.3.3.6. PartitionedFileCachingDocumentDb

Similar in spirit to PartitionedDocumentDb but uses Parquet as an on-disk storage format for the partitioned data structure.

SearchingWithPartitionedFileCachingDocumentDb.scala: Semantic Search over an Apache Spark cluster using PartitionedFileCachingDocumentDb.
package example.feature
import java.util.UUID

import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.PartitionedFileCachingDocumentDb
import io.cortical.scala.api.document._
import io.cortical.scala.spark.util.valueOfBroadcastCorticalEngine
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
object SearchingWithPartitionedFileCachingDocumentDb extends FeatureApp {
  override protected def feature()(implicit ce: Broadcast[CorticalEngine]) = {
    val tdocs: RDD[TextDoc] = sc.parallelize(Seq(
      TextDoc("id1", "Cars are us!"),
      TextDoc("id2", "Cars will be cars."),
      TextDoc("id3", "A text about cars."),
      TextDoc("id4", "Another car text")))
    val docs: RDD[FingerprintedTextDoc] = tdocs map (d => FingerprintedTextDoc(d, getFingerprint(d.text)))
    val db: DocSemanticSearcher = PartitionedFileCachingDocumentDb(sqlContext, docs, 2, 4, s"/tmp/${UUID.randomUUID.toString}/")
    val results: Seq[ScoredFingerprintedTextDoc] = db.search(getFingerprint("Find the text about cars."), 10)
    assert(results.size == 4)
    assert(results(0).docId == "id3")
    assert(results(1).docId == "id4")
    s"Searching the document DB returned the doc IDs ${results map (_.docId) mkString ","}"
  }
}
4.2.3.3.7. DiskSerializingSemanticSearchWrapper

DiskSerializingSemanticSearchWrapper is a decorator around UpdateableSemanticSearchers that persists all documents updated in the underlying SemanticSearcher to disk in the local filesystem and directory. This is shown in example [SearchingWithDiskSerialization.scala]:

SearchingWithDiskSerialization.scala: DiskSerializingSemanticSearchWrapper wraps around an instance of UpdateableSemanticSearcher to add persistence to the local filesystem.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.UpdateableDocumentFingerprintDb
import io.cortical.scala.api.document.persistence.DiskSerializingSemanticSearchWrapper.persisted
import io.cortical.scala.api.document.{DocID, FingerprintedDoc}
object SearchingWithDiskSerialization extends FeatureApp {
  override protected def feature()(implicit ce: CorticalEngine) = {
    val docs = Seq(
      FingerprintedDoc("id1", getFingerprint("Cars are us!")),
      FingerprintedDoc("id2", getFingerprint("Cars will be cars.")),
      FingerprintedDoc("id3", getFingerprint("A text about cars.")),
      FingerprintedDoc("id4", getFingerprint("Another car text")))
    val db1 = persisted(UpdateableDocumentFingerprintDb(), "docDB1")
    for (d <- docs) db1.add(d)
    val ids1: Seq[DocID] = db1.search(getFingerprint("Find the text about cars."), 10)
    assert(ids1.size == 4)
    assert(ids1(0) == "id3")
    assert(ids1(1) == "id4")
    val db2 = persisted(UpdateableDocumentFingerprintDb(), "docDB1") // load from disk
    val ids2: Seq[DocID] = db2.search(getFingerprint("Find the text about cars."), 10)
    assert(ids2.size == ids1.size)
    assert(ids2(0) == ids1(0))
    assert(ids2(1) == ids1(1))
    s"Searching the persisted fingerprint DB returned the doc IDs $ids2 after loading from disk"
  }
}

The implementation of the DiskSerializingSemanticSearchWrapper shipped with Retina Library uses Oracle Berkeley DB Java Edition, a product that requires separate licensing from Oracle Corporation. Authors of Retina Library Programs who wish to use the DiskSerializingSemanticSearchWrapper must ensure they comply with the license terms of Oracle Berkeley DB Java Edition and must define an explicit dependency to Oracle Berkeley DB Java Edition in their Maven pom.xml files, as has been shown previously for usage with and without Apache Spark.

4.3. General utilities provided in Retina Library

4.3.1. Language detection

Retina Library includes an implementation of automated language detection, i.e., the analysis of text given in a String to determine the language in which that text is written. Language detection works best for text that comprises one or more sentences. In contrast, it often gives unreliable results for very short text.

Language detection is demonstrated in example [DetectLanguage.scala].

DetectLanguage.scala: Automated language detection in Retina Library.
package example.feature
import io.cortical.language.detection.api.LanguageDetection
import io.cortical.model.languages.Languages
object DetectLanguage {
  def main(args: Array[String]): Unit = {
    val detector: LanguageDetection = LanguageDetection.DEFAULT
    val en: Languages = detector.detectLanguage("This is obviously the Queen's language.")
    val es: Languages = detector.detectLanguage("Este es un pais soleado.")
    assert(Languages.EN.equals(en))
    assert(Languages.ES.equals(es))
    println(s"Detected language of text in $en and $es")
  }
}

As can be seen in example [DetectLanguage.scala], language detection does not rely on the Retina, and is therefore not discussed in section Perform Semantic Text Processing with Retina Library.

4.3.2. Scala and Spark utilities

TODO

Appendix A: Base classes used in example code

Retina Library features demonstrated outside of an Apache Spark cluster typically inherit from base class FeatureApp.scala.

FeatureApp.scala: Base class for example code demonstrating a given Retina Library feature available outside of Apache Spark.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.getCorticalEngine
import org.slf4j.LoggerFactory
abstract class FeatureApp {
  protected val LOG = LoggerFactory.getLogger(getClass)
  private val rdir = "./retinas"
  private val rname = "english_subset"
  def main(args: Array[String]): Unit = {
    implicit val ce = getCorticalEngine(new FileRetinaLoader(rdir), rname)
    val result = feature()
    LOG info result
    System.exit(0)
  }
  protected def feature()(implicit ce: CorticalEngine): String
}

Retina Library features demonstrated in an Apache Spark runtime environment typically inherit from base class [FeatureApp.scala], which itself inherits from [SparkApp.scala].

FeatureApp.scala: Base class for example code demonstrating a given Retina Library feature within an Apache Spark cluster.
package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.getCorticalEngine
import org.apache.spark.broadcast.Broadcast
import org.slf4j.LoggerFactory
abstract class FeatureApp extends SparkApp {
  protected val LOG = LoggerFactory.getLogger(getClass)
  private val rdir = "./retinas"
  private val rname = "english_subset"
  override protected def work(): Unit = {
    implicit val ce = sc.broadcast(getCorticalEngine(new FileRetinaLoader(rdir), rname))
    val result = feature()
    LOG info result
  }
  protected def feature()(implicit ce: Broadcast[CorticalEngine]): String
}

Base class [SparkApp.scala] is a common base class for code that must execute within an Apache Spark cluster.

SparkApp.scala: Base class for example code running within an Apache Spark cluster.
package example.feature
import io.cortical.scala.spark.util.{sparkContext, withSparkContext}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
abstract class SparkApp {
  private var scVar: SparkContext = _
  private var sqlContextVar: SQLContext = _
  protected def sc = scVar
  protected def sqlContext = sqlContextVar
  def main(args: Array[String]): Unit = {
    val appName = getClass.getSimpleName
    withSparkContext(sparkContext(appName)) { (sc, sqlContext) =>
      scVar = sc
      sqlContextVar = sqlContext

      work
    }
    System.exit(0)
  }
  protected def work(): Unit
}

Bibliography