Welcome to Retina Library, a Java/Scala library for Semantic Text Processing with some functionality specifically targeted at execution on an Apache Spark cluster!
Between versions 2.4.0 and 2.4.1 the marketing name of this product changed from Retina Spark to Retina Library. From version 2.4.1 onwards this documentation reflects the new product name, but the technical artefacts, most importantly the Retina Library distribution jar file and Retina Library license jar file, still reflect the old product name Retina Spark. |
This document helps you getting started with Retina Library. It helps you write programs using Retina Library, execute them on a single Java Virtual Machine (JVM) as well as in an Apache Spark cluster, and gives you an overview of the core functionality and APIs implemented in Retina Library.
1. Introducing Retina Library
1.1. A library for the Java Virtual Machine
First and foremost, Retina Library is a library - it is not a stand-alone software product that you install and use through a GUI. Rather, if you write programs for the JVM that have a need for Semantic Text Processing functionality, then you can add Retina Library as a dependency to the class path of those programs and call into the public API of Retina Library to perform Semantic Text Processing operations. These calls are local (intra-process) Java method calls. However, some of the algorithms and classes provided by Retina Library assume that the program in question is executed on an Apache Spark cluster, and will therefore fail if you have launched your program outside Apache Spark.
In the following, the term Retina Library Program is used to denote a program that uses Retina Library as a library.
1.2. Scala Examples
Retina Library has been implemented in a mixture of Java and Scala, where some basic functionality is provided both as a Java API as well as a Scala-friendly wrapper around that API. Other, more advanced features, in particular those depending on Apache Spark, have only been implemented in Scala. It is likely that the entire Retina Library API, including the parts implemented in Java and those implemented in Scala, can also be called from other JVM-based languages, in particular if they have good interoperability with Java, such as Groovy. This has not been tested by Cortical.io, however, and this document consequently shows only Scala code calling the Retina Library API.
1.3. Public API vs internal implementation
As a library, Retina Library has parts that are intended to be called by users of Retina Library and therefore form its public API, and other parts that are considered an internal detail of how Retina Library is implemented. These latter parts of Retina Library can technically be called by Retina Library users but this is strongly discouraged and not supported by Cortical.io. The public API of Retina Library, on the other hand, exists specifically to insulate Retina Library users from the faster-changing parts of the library. It is this public API that is documented in this guide, in particular in section The public API of Retina Library.
Only develop against the public API of Retina Library as documented in this guide. |
1.4. Prerequisites for this document
This document assumes that you are familiar with
-
the Scala programming language - for an introduction see the Scala Documentation Page [scala].
-
Apache Spark - for an introduction see the Spark Overview [spark].
-
Apache Maven - for an introduction see the Maven Getting Started Guide [mvn].
-
Semantic Text Processing with the Retina technology - for an introduction see the Cortical.io Articles [cioarts].
2. Supported configurations and versions
Retina Library 2.5.0 supports the following configurations for developing and executing Retina Library Programs:
-
At development (build) time:
-
Maven 3.3.6 or later using
-
JDK 1.7.0_80 or later, including 1.8
-
Scala 2.10.x compiler and library
-
-
At runtime:
-
JRE 1.7.0_80 or later, including 1.8
-
Scala 2.10.x library
-
optional: Apache Spark 1.5.2 or later, but not 2.x
-
Apache Spark distributions by databricks, Amazon (EMR), Cloudera and Hortonworks
-
-
The code in this document has been written for and verified on Scala 2.10.6, Apache Spark 1.6.2, JDK 1.7 and JRE 1.8.
3. Installation
This section helps you understand the installation of Retina Library and create a new Scala project/program using Retina Library as a library dependency. We will then execute this program on an Apache Spark cluster and stand-alone.
3.1. The Retina Library distribution jar file
Retina Library is distributed as a single, partially obfuscated jar file containing the Java bytecode of Retina Library. This document assumes version 2.5.0 of Retina Library and therefore the Retina Library distribution jar file is called retina-spark-2.5.0-obfuscated.jar
.
Although the Retina Library distribution jar file is a jar file, it is not available in any public Maven repository. Rather, the Retina Library distribution jar file is delivered by other means, such as email, from Cortical.io to Retina Library licensees.
See section Supported configurations and versions for supported Scala and Java versions.
3.2. The Retina Library license jar file
Retina Library is commercial software and must be licensed. The license terms determine several aspects of the execution of Retina Library, such as
-
the expiration date of the license,
-
the maximum size of the Apache Spark cluster on which a Retina Library Program may execute,
-
whether the Retina Library Program must, may or must not execute on AWS (the Amazon cloud).
These characteristics of the license granted by Cortical.io to the licensee are encoded in the Retina Library license jar file and are enforced at runtime by Retina Library.
The Retina Library license jar file always has the name retina-spark-license.jar
. This file must always reside in the same directory as the Retina Library distribution jar file.
The Retina Library license jar file and Retina Library distribution jar file must always be placed into the same file system directory. |
The Retina Library license jar file, for obvious reasons, is not available in any public Maven repository but distributed by other means, such as email, to Retina Library licensees.
3.3. Create a new Retina Library Program using Scala, Maven and the Scala IDE for Eclipse
In this section we start a new Maven Scala project for a Retina Library Program. The program is a simple "Hello World"-type Scala application performing basic Semantic Text Processing. This application will be presented in two variants:
-
The first variant, implemented in section A simple Scala Retina Library Program not requiring Apache Spark uses only Retina Library features that do not require Apache Spark and hence also runs outside an Apache Spark cluster.
-
The second variant, implemented in section A simple Scala Retina Library Program using Apache Spark features, builds on top of the first variant and adds Apache Spark as a dependency, uses Retina Library features that require Apache Spark and therefore requires an Apache Spark runtime environment.
All of the code shown here is available in the projects retina-spark-template-app-no-spark
and retina-spark-template-app
, respectively. The source code for these projects is part of any Retina Library distribution.
3.3.1. A simple Scala Retina Library Program not requiring Apache Spark
Create an empty base directory for this project. This will be termed the project root in the following. All activities described in this section must be performed below the project root.
3.3.1.1. Scala code for a simple Apache Spark-independent Retina Library Program
A very simple Scala program using Retina Library without any Apache Spark features is shown in [HelloRetinaWithoutSpark.scala]
:
HelloRetinaWithoutSpark.scala
: A Scala Retina Library Program performing very basic Semantic Text Processing without the use of any Apache Spark features.package example
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.{getCorticalEngine, getFingerprint}
object HelloRetinaWithoutSpark {
val rdir = "./retinas"
val rname = "english_subset"
def main(args: Array[String]): Unit = {
implicit val engine = getCorticalEngine(new FileRetinaLoader(rdir), rname)
val size = engine.getRetinaSize
val fp = getFingerprint("Hello Retina World!")
println(s"The Semantic Fingerprint has ${fp.length} of $size possible positions set.")
}
}
Section Perform Semantic Text Processing with Retina Library explains in more detail what happens in the Retina Library Program [HelloRetinaWithoutSpark.scala]
, but the main points are:
-
This is a Scala application, i.e. it is a Scala
object
with amain
method of the required signature. -
A Retina with the name
english_subset
is loaded from the file system directory./retinas
. This directory is relative to the current directory from which the Retina Library Program is launched. During development this is assumed to be the project root. See section Load a Retina for more details. -
That Retina is used to create/retrieve a
CorticalEngine
which is assigned to animplicit
variable so that it is implicitly available in the remainder of the program. -
The size of the Retina, i.e. the maximum number of positions in its Semantic Fingerprints, is retrieved from the
CorticalEngine
(see section Prerequisites for this document for background material on Semantic Fingerprints). -
The Semantic Fingerprint of a trivial piece of text is calculated.
-
The number of positions set in that Semantic Fingerprint versus the maximum number of positions is printed to the console.
Create a directory under your project root to hold your Scala source code, following the usual conventions: src/main/scala
Paste the code for [HelloRetinaWithoutSpark.scala]
into a file of that name underneath that Scala source code directory. In Scala, that file may, but need not, reside in a directory that mirrors the package of the Scala object
, i.e. example
.
3.3.1.2. The Retina
Every Retina Library Program requires access to a Retina, which is loaded from some form of persistent storage at runtime (see sections Prerequisites for this document and Load a Retina for more about the concept of a Retina.) In the case of [HelloRetinaWithoutSpark.scala]
, a Retina named english_subset
is loaded from the file system directory ./retinas
. At development time this directory path is relative to the project root. The content of this directory must be similar to this:
$ find retinas
retinas
retinas/english_subset
retinas/english_subset/retina.line
retinas/english_subset/retina.properties
In other words, english_subset
must be a directory directly below ./retinas
and must contain at least the two files retina.line
and retina.properties
:
Your distribution of Retina Library must have included one or more Retinas. Copy them into the ./retinas
directory as shown above. If the english_subset
Retina is not part of your Retina Library distribution then choose a different Retina for [HelloRetinaWithoutSpark.scala]
by changing the Retina name in the Scala code accordingly.
3.3.1.3. Maven build file for Apache Spark-independent Retina Library Programs
The most popular tools to build Scala programs are SBT and Maven. We will use Maven, because it is currently (still) better known and more widely supported and applicable than SBT.
Unfortuntely, the Maven pom.xml
build file is verbose.
pom.xml
: Maven build file for a Scala Retina Library Program that does not depend on Apache Spark features and can execute outside an Apache Spark runtime.<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>retina-spark-template-app-no-spark</artifactId>
<version>1.0.0-SNAPSHOT</version>
<properties>
<retina.spark.version>2.4.1</retina.spark.version>
<!-- path to the Retina Spark distribution jar file -->
<retina.spark.distrib.jar>${project.basedir}/lib/retina-spark-${retina.spark.version}-obfuscated.jar</retina.spark.distrib.jar>
<!-- path to the Retina Spark license jar file retina-spark-license.jar; typically in the same directory as the Retina Spark distribution jar file -->
<retina.spark.license.jar>${project.basedir}/lib/retina-spark-license.jar</retina.spark.license.jar>
<java.version>1.7</java.version>
<scala.version>2.10.6</scala.version>
<scala.binary.version>2.10</scala.binary.version>
<slf4j.version>1.7.10</slf4j.version>
<sleepycat.version>3.3.75</sleepycat.version>
<junit.version>4.12</junit.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<repositories>
<repository>
<id>oracleReleases</id>
<name>Oracle Released Java Packages</name>
<url>http://download.oracle.com/maven</url>
<layout>default</layout>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>io.cortical</groupId>
<artifactId>retina-spark</artifactId>
<version>unused-because-loaded-with-system-scope</version>
<scope>system</scope>
<systemPath>${retina.spark.distrib.jar}</systemPath>
</dependency>
<dependency>
<groupId>io.cortical</groupId>
<artifactId>retina-spark-license</artifactId>
<version>unused-because-loaded-with-system-scope</version>
<!-- should be test scope but systemPath requires system scope -->
<scope>system</scope>
<systemPath>${retina.spark.license.jar}</systemPath>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<!-- note license restrictions -->
<groupId>com.sleepycat</groupId>
<artifactId>je</artifactId>
<version>${sleepycat.version}</version>
</dependency>
<dependency>
<groupId>org.reflections</groupId>
<artifactId>reflections</artifactId>
<version>0.9.10</version>
<exclusions>
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
</plugin>
</plugins>
<pluginManagement>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<javacArgs>
<javacArg>-source</javacArg>
<javacArg>${java.version}</javacArg>
<javacArg>-target</javacArg>
<javacArg>${java.version}</javacArg>
<javacArg>-Xlint:all,-serial,-path</javacArg>
</javacArgs>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.19.1</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<configuration>
<artifactSet>
<excludes>
<exclude>io.cortical:*</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/**/pom.*</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.eclipse.m2e</groupId>
<artifactId>lifecycle-mapping</artifactId>
<version>1.0.0</version>
<configuration>
<lifecycleMappingMetadata>
<pluginExecutions>
<pluginExecution>
<pluginExecutionFilter>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<versionRange>[3.2.2,)</versionRange>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</pluginExecutionFilter>
<action>
<ignore/>
</action>
</pluginExecution>
</pluginExecutions>
</lifecycleMappingMetadata>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
<version>3.0.1</version>
<executions>
<execution>
<id>attach-sources</id>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
A detailed discussion of the Maven pom.xml
is outside the scope of this document. Its most important aspects are:
-
It should be located in the project root and have the file name
pom.xml
. -
The property
retina.spark.version
must be set to the version of Retina Library to be used in the Retina Library Program. This is the version in the Retina Library distribution jar file name. -
The properties
retina.spark.distrib.jar
andretina.spark.license.jar
must be set to the paths to the Retina Library distribution jar file and Retina Library license jar file, respectively. Dependencies are then defined to these two jar files (assystem
-scoped dependencies, such that the jar files are loaded from the file system rather than from a Maven repository: see sections The Retina Library distribution jar file and The Retina Library license jar file). -
The Java and Scala versions are set to
1.7
and2.10.6
, respectively (see section Supported configurations and versions). -
Further dependencies on the Scala library,
commons-codec
, a logging framework (slf4j
) and JUnit are defined. The dependencies oncommons-codec
andslf4j
are required for Retina Library even if your code does not make use of logging orcommons-codec
. The Scala library is required for Retina Library as well as the Retina Library Program. The dependency on JUnit is only needed if JUnit tests are included in the Retina Library Program project (which they are not, so far). -
An explicit dependency on Oracle Berkeley DB Java Edition and the Oracle Maven repository for Oracle Berkeley DB Java Edition is defined. This is only needed if
DiskSerializingSemanticSearchWrapper
from Retina Library is used, and requires separate licensing of Oracle Berkeley DB Java Edition from Oracle Corporation. -
The remainder of the
pom.xml
configures the compilation and packaging process. -
Packaging produces an assembly jar (also know an as über jar, or shaded jar) with the help of the
maven-shade-plugin
. The assembly jar contains the Java bytecode of the Retina Library Program and all dependencies, excluding the Retina Library distribution jar file and Retina Library license jar file.
The need to define an explicit dependency on commons-codec and slf4j is considered outdated and will likely be removed in a future release of Retina Library.
|
3.3.1.4. Maven build from the command-line
Now that the newly created Retina Library Program project contains a Scala source file and a Maven pom.xml
, it can be built from the command-line from within the project root:
$ mvn clean install
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building retina-spark-template-app-no-spark 1.0.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4.958 s
[INFO] Finished at: 2016-07-20T15:26:42+02:00
[INFO] Final Memory: 19M/323M
[INFO] ------------------------------------------------------------------------
A reasonably up-to-date Maven installation and a JDK is highly recommended for building the Retina Library Program (see section Supported configurations and versions), e.g.:
$ mvn -version
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T17:41:47+01:00)
Maven home: /usr/local/Cellar/maven/3.3.9/libexec
Java version: 1.8.0_66, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.11.5", arch: "x86_64", family: "mac"
The Maven build generates an assembly jar file in the target
directory underneath the project root, e.g.
$ ls target/*.jar
target/original-retina-spark-template-app-no-spark-1.0.0-SNAPSHOT.jar target/retina-spark-template-app-no-spark-1.0.0-SNAPSHOT.jar
Here, the jar file original-*.jar
is the original, non-assembly jar file, without dependencies, and can therefore be ignored.
The jar file retina-spark-template-app-no-spark-1.0.0-SNAPSHOT.jar
, by contrast, is the assembly jar containing everything needed to execute the Retina Library Program, except the Retina Library distribution jar file, Retina Library license jar file and any Retinas.
3.3.1.5. Execute the Retina Library Program from the command-line
After a successful Maven build, and given that a Retina to load at runtime has been provided, the Retina Library Program can be executed as follows from the command-line from within the project root:
$ java -cp \
~/local/opt/retina-spark/retina-spark-2.4.0-obfuscated.jar:target/retina-spark-template-app-no-spark-1.0.0-SNAPSHOT.jar \
example.HelloRetinaWithoutSpark
In other words, this command-line executes a standard Java application with fully-qualified class name example.HelloRetinaWithoutSpark
using a Java classpath consisting of the assembly jar of this Retina Library Program and the Retina Library distribution jar file. The Retina Library distribution jar file and the Retina Library license jar file, in this case, are both located in directory ~/local/opt/retina-spark
.
A reasonably up-to-date JRE is highly recommended for executing the Retina Library Program (see section Supported configurations and versions), e.g.:
$ java -version
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
After some initial output describing the Retina Library licensee and license conditions, the Retina Library Program should print the result:
The Semantic Fingerprint has 638 of 16384 possible positions set.
Congratulations, you have just created your first Retina Library Program from scratch and built and executed it from the command-line. The program has loaded a Retina, calculated a Semantic Fingerprint, and compared the number of positions in that Semantic Fingerprint to the maximum number of positions possible with that Retina.
3.3.1.6. Import into the Scala IDE for Eclipse
Now that the simple Retina Library Program compiles and executes from the command-line, it is convenient to import it into an IDE for further development. In this document the Scala IDE for Eclipse is used to demonstrate IDE usage, but there are many perfectly viable alternatives, such as IntelliJ IDEA.
Proceed as follows to import the Retina Library Program into Scala IDE for Eclipse:
-
Download and install the latest release of Scala IDE for Eclipse from http://scala-ide.org.
-
Open Scala IDE for Eclipse and select a Workspace.
-
Select "File" > "Import…" > "Maven" > "Existing Maven Projects" > "Next >"
-
Browse to the project root of the Retina Library Program created previously and press "Open": Scala IDE for Eclipse should detect the Maven project and list it under "Projects:".
-
Press "Finish": The project should now appear in the "Package Explorer".
-
Right-click on the newly imported project and select "Configure" > "Add Scala Nature": The "Scala Library container" should appear under the project in the "Package Explorer".
-
Correct the Scala library version by right-clicking on the "Scala Library container", selecting "Properties" and choosing, ideally, the exact same Scala version that was configured in the Maven
pom.xml
, e.g. "Fixed Scala Library container : 2.10.6". -
Add the Scala source directory
src/main/scala
to the project’s source folders by navigating to it in the "Package Explorer", right-clicking on it, and selecting "Build Path" > "Use as Source Folder".
Scala IDE for Eclipse has now been configured to correctly deal with the Retina Library Program as a Scala Maven project.
All future changes to the Scala source code or the Maven pom.xml
of the Retina Library Program can from now on be done in Scala IDE for Eclipse.
Furthermore, unit tests and the simple Scala application [HelloRetinaWithoutSpark.scala]
can now be executed from within Scala IDE for Eclipse rather than from the command-line.
3.3.1.7. Execute the Retina Library Program from within the Scala IDE for Eclipse
After the successful import of the project into the Scala IDE for Eclipse, the Retina Library Program can be executed as follows:
-
Navigate to the
[HelloRetinaWithoutSpark.scala]
file in the "Package Explorer". -
Right-click, select "Run As" > "Scala Application"
The Retina Library Program now executes within Scala IDE for Eclipse and all output previously seen on the command-line now appears in the "Console" of the Scala IDE for Eclipse.
3.3.2. A simple Scala Retina Library Program using Apache Spark features
In this section we create a second variant of the basic Semantic Text Processing functionality implemented in section A simple Scala Retina Library Program not requiring Apache Spark by adding very simple usage of Apache Spark features to it. The resulting Retina Library Program therefore depends on Apache Spark and will only run in an Apache Spark cluster.
Create an new empty project root for this project where all activities described in this section must be performed.
3.3.2.1. Scala code for a simple Apache Spark-dependent Retina Library Program
Building on the simple Apache Spark-independent Retina Library Program discussed before, the task of fingerprinting a larger number of texts is distributed over an Apache Spark cluster as shown in [HelloRetinaSpark.scala]
:
HelloRetinaSpark.scala
: A Scala Retina Library Program performing very basic Semantic Text Processing using Apache Spark features.package example
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.{getCorticalEngine, getFingerprint}
import io.cortical.scala.spark.util.{sparkContext, withSparkContext}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object HelloRetinaSpark {
import io.cortical.scala.spark.util.valueOfBroadcastCorticalEngine
val rdir = "./retinas"
val rname = "english_subset"
def main(args: Array[String]): Unit = {
withSparkContext(sparkContext(appName = "HelloRetinaSpark")) {
work
}
}
private def work(sc: SparkContext, sqlContext: SQLContext): Unit = {
implicit val engine = sc.broadcast(getCorticalEngine(new FileRetinaLoader(rdir), rname))
val size = engine.value.getRetinaSize
val ns = sc.parallelize(1 to 1000)
val texts = ns.map(n => s"Hello Retina World with num $n !")
val fps = texts.map(text => getFingerprint(text))
val lens = fps.map(_.length)
val dLens = lens.distinct.collect.toSeq
println(s"All Semantic Fingerprints have ${dLens mkString ","} of $size possible positions set.")
}
}
Section Perform Semantic Text Processing with Retina Library explains in more detail the Semantic Text Processing features used in the Retina Library Program [HelloRetinaSpark.scala]
. Briefly, the most important commonalities and differences to [HelloRetinaWithoutSpark.scala]
are:
-
As before, this is a Scala application that loads the
english_subset
Retina from the./retinas
directory. -
The Retina Library utility-functions
sparkContext
andwithSparkContext
are used to create an Apache SparkSparkContext
in all execution situations (see section Scala and Spark utilities), perform work in the scope of thatSparkContext
, and then close it. The real work of this Retina Library Program is done in the function calledwork
. -
As before, the Retina, once it has been loaded, is used to create/retrieve a
CorticalEngine
. However, in an Apache Spark environment, it is crucial that theCorticalEngine
is distributed to all Spark cluster nodes as a SparkBroadcast
variable. It is this SparkBroadcast
variable that is then assigned to animplicit
variable. -
The size of the Retina is retrieved from the
CorticalEngine
in the same way as before, taking into account that variableengine
is now a SparkBroadcast
variable containing aCorticalEngine
. -
The
import io.cortical.scala.api._
line is needed to import an implicit conversion from a SparkBroadcast
variable containing aCorticalEngine
to aCorticalEngine
which is used transparently in the call togetFingerprint
. -
Using straightforward Apache Spark features, the Semantic Fingerprints of 1000 trivial pieces of text are calculated in parallel on the Apache Spark cluster.
-
The number of positions set in each of these 1000 Semantic Fingerprint is determined in parallel and the distinct (unique) counts collected into the Spark driver. Since all 1000 pieces of text are very similar, the lengths of all Semantic Fingerprints is expected to be the same and hence only one distinct value is expected to be returned to the Spark driver.
-
The distinct numbers of positions set in all Semantic Fingerprints versus the maximum number of positions is printed to the console.
Paste the code for [HelloRetinaSpark.scala]
into a Scala source file of that name.
Note that we will also need a Retina available at runtime as discussed previously in section The Retina.
3.3.2.2. Maven build for an Apache Spark-enabled Scala Retina Library Program
A Maven pom.xml
build file for a Scala Retina Library Program is shown in the following:
pom.xml
: Maven build file for a Scala Retina Library Program that depends on Apache Spark at build and execution time.<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>retina-spark-template-app</artifactId>
<version>1.0.0-SNAPSHOT</version>
<properties>
<retina.spark.version>2.4.1</retina.spark.version>
<!-- path to the Retina Spark distribution jar file -->
<retina.spark.distrib.jar>${project.basedir}/lib/retina-spark-${retina.spark.version}-obfuscated.jar</retina.spark.distrib.jar>
<!-- path to the Retina Spark license jar file retina-spark-license.jar; typically in the same directory as the Retina Spark distribution jar file -->
<retina.spark.license.jar>${project.basedir}/lib/retina-spark-license.jar</retina.spark.license.jar>
<java.version>1.7</java.version>
<scala.version>2.10.6</scala.version>
<scala.binary.version>2.10</scala.binary.version>
<spark.version>1.6.2</spark.version>
<slf4j.version>1.7.10</slf4j.version>
<sleepycat.version>3.3.75</sleepycat.version>
<junit.version>4.12</junit.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<repositories>
<repository>
<id>oracleReleases</id>
<name>Oracle Released Java Packages</name>
<url>http://download.oracle.com/maven</url>
<layout>default</layout>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>io.cortical</groupId>
<artifactId>retina-spark</artifactId>
<version>unused-because-loaded-with-system-scope</version>
<scope>system</scope>
<systemPath>${retina.spark.distrib.jar}</systemPath>
</dependency>
<dependency>
<groupId>io.cortical</groupId>
<artifactId>retina-spark-license</artifactId>
<version>unused-because-loaded-with-system-scope</version>
<!-- should be test scope but systemPath requires system scope -->
<scope>system</scope>
<systemPath>${retina.spark.license.jar}</systemPath>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<!-- note license restrictions -->
<groupId>com.sleepycat</groupId>
<artifactId>je</artifactId>
<version>${sleepycat.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.reflections</groupId>
<artifactId>reflections</artifactId>
<version>0.9.10</version>
<exclusions>
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
</plugin>
</plugins>
<pluginManagement>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<javacArgs>
<javacArg>-source</javacArg>
<javacArg>${java.version}</javacArg>
<javacArg>-target</javacArg>
<javacArg>${java.version}</javacArg>
<javacArg>-Xlint:all,-serial,-path</javacArg>
</javacArgs>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.19.1</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<configuration>
<artifactSet>
<excludes>
<exclude>io.cortical:*</exclude>
</excludes>
</artifactSet>
<relocations>
<relocation>
<pattern>com.fasterxml.jackson.databind</pattern>
<shadedPattern>io.cortical.ext.fasterxml.jackson.databind</shadedPattern>
</relocation>
<relocation>
<pattern>com.fasterxml.jackson.annotation</pattern>
<shadedPattern>io.cortical.ext.fasterxml.jackson.annotation</shadedPattern>
</relocation>
<relocation>
<pattern>com.fasterxml.jackson.core</pattern>
<shadedPattern>io.cortical.ext.fasterxml.jackson.core</shadedPattern>
</relocation>
<relocation>
<pattern>com.google.common</pattern>
<shadedPattern>io.cortical.ext.google.common</shadedPattern>
</relocation>
</relocations>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/**/pom.*</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.eclipse.m2e</groupId>
<artifactId>lifecycle-mapping</artifactId>
<version>1.0.0</version>
<configuration>
<lifecycleMappingMetadata>
<pluginExecutions>
<pluginExecution>
<pluginExecutionFilter>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<versionRange>[3.2.2,)</versionRange>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</pluginExecutionFilter>
<action>
<ignore/>
</action>
</pluginExecution>
</pluginExecutions>
</lifecycleMappingMetadata>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
<version>3.0.1</version>
<executions>
<execution>
<id>attach-sources</id>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
A detailed discussion of this Maven [pom.xml]
is outside the scope of this document. Its most important similarities and differences compared to the pom.xml
for an Apache Spark-independent Retina Library Program are:
-
The property
spark.version
must be set to the version of Apache Spark to be used in the Retina Library Program (here 1.6.2; see section Supported configurations and versions). -
All dependencies that are available on the classpath of the Apache Spark runtime must be marked with
provided
scope. -
Dependencies on Apache Spark core, SQL and MLlib modules are defined (with
provided
scope). -
An explicit dependency on Oracle Berkeley DB Java Edition and the Oracle Maven repository for Oracle Berkeley DB Java Edition is defined. This is only needed if
DiskSerializingSemanticSearchWrapper
from Retina Library is used, and requires separate licensing of Oracle Berkeley DB Java Edition from Oracle Corporation. -
The remainder of the
pom.xml
configures the compilation and packaging process. All dependencies that are known to clash with those provided on the Apache Spark classpath are to be excluded from the assembly jar.
The Maven build using this [pom.xml]
now works as expected from the command-line within the project root, compiling [HelloRetinaSpark.scala]
and producing an assembly jar:
$ mvn clean install
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building retina-spark-template-app 1.0.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5.751 s
[INFO] Finished at: 2016-07-21T17:21:14+02:00
[INFO] Final Memory: 29M/638M
[INFO] ------------------------------------------------------------------------
This maven build produces the assembly jar file retina-spark-template-app-1.0.0-SNAPSHOT.jar
, which contains everything needed to execute the Retina Library Program, except what is provided by the Apache Spark runtime, the Retina Library distribution jar file, the Retina Library license jar file, and any Retinas.
3.3.2.3. Use the Scala IDE for Eclipse with an Apache Spark-enabled Scala Retina Library Program
The import of a Scala Retina Library Program that depends on Apache Spark into the Scala IDE for Eclipse works in the same way as described in section Import into the Scala IDE for Eclipse for an Apache Spark-independent Retina Library Program.
The same is true for the execution of an Apache Spark-dependent Retina Library Program: section Execute the Retina Library Program from within the Scala IDE for Eclipse is applicable without change.
The execution of a program like [HelloRetinaSpark.scala]
from within Scala IDE for Eclipse works because the Retina Library utility function sparkContext
when used under these circumstances creates an Apache Spark SparkContext
in Spark local mode. This is described in more detail in section Scala and Spark utilities.
3.3.2.4. Execute the Retina Library Program from the command-line in a single JVM using Apache Spark local mode
Using Spark local mode, a Retina Library Program can be executed in a single JVM with the full Apache Spark runtime environment. In this execution mode, the Java classpath is as it would be in an Apache Spark cluster, but there is only one Apache Spark node and all communication is JVM-local.
Spark local mode can be used to execute a Retina Library Program from within the Scala IDE for Eclipse (see section Use the Scala IDE for Eclipse with an Apache Spark-enabled Scala Retina Library Program) or from the command-line as shown in the following:
$ spark-submit --master local[*] \
--jars ~/local/opt/retina-spark/retina-spark-2.4.0-obfuscated.jar \
--class example.HelloRetinaSpark \
target/retina-spark-template-app-1.0.0-SNAPSHOT.jar
This statement executes as an Apache Spark job a Java application with fully-qualified class name example.HelloRetinaSpark
using a Java classpath consisting of the assembly jar of this Retina Library Program, the Retina Library distribution jar file and the Apache Spark runtime. The Retina Library distribution jar file and the Retina Library license jar file, in this case, are both located in directory ~/local/opt/retina-spark
. Apache Spark will use all available cores on the current machine to execute the Retina Library Program.
3.3.2.5. Execute the Retina Library Program from the command-line on a distributed Apache Spark cluster
Spark local mode is useful during development and for workloads that can be handled by a single machine. More realistically, though, Retina Library Programs will be executed on a distributed Apache Spark cluster, i.e. a cluster of several Spark cluster nodes.
A discussion of the different variants of launching Apache Spark clusters is beyond this document. Retina Library can be used with all cluster modes available in Apache Spark. The following interaction shows launching the Retina Library Program in Apache Spark cluster in so-called standalone mode.
First start the Apache Spark cluster with one master and as many slaves as desired, i.e. from the Apache Spark base directory on what will become the Apache Spark master node execute
$ sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to ...
And from the Apache Spark base directory on as many (typically) other nodes, which will become worker nodes, execute
$ sbin/start-slave.sh spark://127.0.0.1:7077
starting org.apache.spark.deploy.worker.Worker, logging to ...
choosing a URL that points at the correct master node.
Then on any machine with an Apache Spark installation and access to the master node use spark-submit
to launch the Retina Library Program by pointing it to the master node, choosing a URL that points at the correct master node:
$ spark-submit --master spark://127.0.0.1:7077 \
--jars ~/local/opt/retina-spark/retina-spark-2.4.0-obfuscated.jar \
--class example.HelloRetinaSpark \
target/retina-spark-template-app-1.0.0-SNAPSHOT.jar
This statement executes as an Apache Spark job the same Java application as in section Execute the Retina Library Program from the command-line in a single JVM using Apache Spark local mode but this time the Apache Spark job is distributed over the JVMs and physical machines that comprise the Spark cluster. The Retina Library distribution jar file and the Retina Library license jar file must both be located in directory ~/local/opt/retina-spark
.
The Retina is loaded from the file system on the Spark driver node - in this case the machine where the master was launched - and is then distributed from there to all Spark cluster nodes. Output is also performed on the Spark driver.
Congratulations, you have now implemented and executed a fully distributed Maven Scala Retina Library Program! The program has loaded a Retina, calculated a large number of Semantic Fingerprints, and compared the number of positions in all those Semantic Fingerprints to the maximum number of positions possible with that Retina.
4. The public API of Retina Library
The public API of Retina Library consists of all types and functions that are intended to be called by users of Retina Library when implementing a Retina Library Program. It consists fundamentally of
-
features that support Semantic Text Processing, the heart and soul of Retina Library to be discussed in section Perform Semantic Text Processing with Retina Library, and
-
supporting utilities not directly related to Semantic Text Processing, to be discussed in section General utilities provided in Retina Library.
Most example code in this section makes use of base classes that aim to reduce boiler plate and allow us to focus on the Retina Library feature in question. For completeness, these base classes are shown in Base classes used in example code. When studying example code it suffices to know that the following symbols are defined in these base classes and are available in example code:
-
For a Retina Library Program executing in an Apache Spark cluster,
-
sc
andsqlContext
denote an instance ofSparkContext
andSQLContext
, respectively, -
engine
denotes an instance of a SparkBroadcast
variable containing aCorticalEngine
for a default Retina (i.e.,Broadcast[CorticalEngine]
).
-
-
For a Retina Library Program executing outside an Apache Spark cluster,
-
engine
denotes an instance ofCorticalEngine
.
-
In general, example code in this section that works outside of Apache Spark will be shown without the use of any Apache Spark features, i.e. in an Apache Spark-independent form.
4.1. Enumeration of the Retina Library public API
The Retina Library public API comprises the following Java/Scala packages, types and functions:
-
com.neovisionaries.i18n.LanguageCode
-
io.cortical.document.api.DocumentFingerprintDb
-
io.cortical.document.impl.IndexedDocumentFingerprintDb
-
io.cortical.engine.api.CorticalEngine
-
io.cortical.engine.api.CorticalEngineFactory
-
io.cortical.fingerprint.compare.api.FingerprintComparisons
-
io.cortical.model.core.CoreTerm
-
io.cortical.model.languages.Languages
-
io.cortical.nlp.pos.CorePosTypes
-
io.cortical.retina.source.FileRetinaLoader
-
io.cortical.retina.source.ResourceRetinaLoader
-
io.cortical.retina.source.RetinaLoader
-
io.cortical.retina.source.RetinaProperties
-
io.cortical.retina.source.S3RetinaLoader
-
io.cortical.scala.api
-
io.cortical.scala.api.CorticalApi
-
io.cortical.scala.api.DocumentFingerprintDb
-
io.cortical.scala.api.Fingerprint
-
io.cortical.scala.api.Fingerprinted
-
io.cortical.scala.api.FullSemanticSearcher
-
io.cortical.scala.api.PartitionedDocumentDb
-
io.cortical.scala.api.PartitionedFileCachingDocumentDb
-
io.cortical.scala.api.PreservingDocumentDb
-
io.cortical.scala.api.ParentDocumentDb
-
io.cortical.scala.api.Scored
-
io.cortical.scala.api.SemanticClassifier
-
io.cortical.scala.api.SemanticSearcher
-
io.cortical.scala.api.SemanticTextClassifier
-
io.cortical.scala.api.StoringSemanticSearcher
-
io.cortical.scala.api.StringLabelSemanticClassifier
-
io.cortical.scala.api.Textual
-
io.cortical.scala.api.UpdateableDocumentFingerprintDb
-
io.cortical.scala.api.UpdateableSemanticSearcher
-
io.cortical.scala.api.document
-
io.cortical.scala.api.document.Doc
-
io.cortical.scala.api.document.DocID
-
io.cortical.scala.api.document.DocIDSemanticSearcher
-
io.cortical.scala.api.document.DocPreserving
-
io.cortical.scala.api.document.DocSemanticSearcher
-
io.cortical.scala.api.document.FingerprintedDoc
-
io.cortical.scala.api.document.FingerprintedTextDoc
-
io.cortical.scala.api.document.PreservingFingerprintedTextDoc
-
io.cortical.scala.api.document.PreservingScoredFingerprintedTextDoc
-
io.cortical.scala.api.document.PreservingFingerprintedParentTextDoc
-
io.cortical.scala.api.document.ScoredFingerprintedTextDoc
-
io.cortical.scala.api.document.TextDoc
-
io.cortical.scala.api.document.persistence
, -
io.cortical.scala.api.document.persistence.DiskSerializingSemanticSearchWrapper
-
io.cortical.scala.api.metadata
-
io.cortical.scala.api.metadata.Metadata
-
io.cortical.language.detection.api.LanguageDetection
-
io.cortical.language.detection.impl.LanguageDetectionImpl
-
io.cortical.scala.api.orderingForScored
-
io.cortical.scala.spark.util
-
io.cortical.scala.spark.util.numOfWorkerNodesInSparkCluster
-
io.cortical.scala.spark.util.sparkContext
-
io.cortical.scala.spark.util.valueOfBroadcastSemanticSearcher
-
io.cortical.scala.spark.util.valueOfBroadcastCorticalEngine
-
io.cortical.scala.spark.util.withSparkContext
4.2. Perform Semantic Text Processing with Retina Library
This section explains the part of the Retina Library public API that relates to Semantic Text Processing. Knowledge of the fundamentals of Semantic Text Processing is assumed (see section Prerequisites for this document).
4.2.1. CorticalApi
and CorticalEngine
: Core algorithms for Semantic Text Processing
CorticalEngine
and CorticalApi
are two types - the former primarily for Java code, the latter for Scala code - that give access to the core Semantic Text Processing features in Retina Library. This section shows how to use the algorithms provided by CorticalEngine
and CorticalApi
, and explains those algorithms to the extent necessary to make sense of the code shown. For a deeper, more scientific explanation of these algorithms please consult the references listed in section Prerequisites for this document.
Both CorticalEngine
and CorticalApi
give access to the same algorithms using slightly different syntax. The remainder of this section will mostly show example code using CorticalApi
because it is the simpler choice for Retina Library Programs written in Scala.
4.2.1.1. Load a Retina
A Retina is a fairly large - on the order of tens or hundreds of megabytes - data structure that captures a Semantic Space, i.e. the meaning of the terms used in a corpus (body) of documents. Almost all operations in Retina Library require a Retina. A Retina is trained by Cortical.io from a document corpus and delivered to users of Retina Library as a set of files. During the execution of a Retina Library Program one or more Retinas must be loaded into the Retina Library Program by reading these files.
A Retina is typically specific to one language. It is common to have several Retinas for the same language capturing different Semantic Spaces expressed in that language. For instance, a "general English" Retina and an "automotive English" Retina both contain English terms, but the former contains more terms that have nothing to do with cars, vehicles, etc., whereas the latter contains more terms in that domain, with better semantic resolution of those terms. It is also common to have several Retinas for the same Semantic Space expressed in different languages - so-called (cross-language) aligned Retinas. For instance, three Retinas, one in Spanish, another in German and a third in English, all capturing the same Semantic Space in their respective language, in such a way that the representations of meaning (the Semantic Fingerprints) captured by these Retina are transferable between Retinas and hence between languages. This is the basis for cross-language functionality in Retina Library.
Retina Library provides an abstraction for loading a Retina from persistent storage: the RetinaLoader
. Retina Library ships with three implementations of the RetinaLoader
: one that reads from the file system, one that reads from an Amazon S3 bucket, and one that reads from the Java classpath. The latter is intended for unit tests, as it is only practical when the Retina is sufficiently small, yet is very convenient as it eliminates any dependency on an external storage location (file system directory or S3 bucket).
Typically, a RetinaLoader
instance is immediately used to create a CorticalEngine
, as described in section Create a CorticalEngine
. However, RetinaLoader
also supports useful operations in its own right, which are shown in [LoadRetinas.scala]
:
LoadRetinas.scala
: Creating RetinaLoader
s and using them to explore Retinas.package example.feature
import io.cortical.retina.source.{FileRetinaLoader, ResourceRetinaLoader, RetinaLoader, S3RetinaLoader}
import scala.collection.JavaConverters._
object LoadRetinas extends S3Constants {
def main(args: Array[String]): Unit = {
val frl: RetinaLoader = new FileRetinaLoader("./retinas")
val srl: RetinaLoader = new S3RetinaLoader(AwsAccessKey, AwsSecretKey, S3Endpoint, RetinasS3BucketName)
val rrl: RetinaLoader = new ResourceRetinaLoader("/small-retinas")
val fretinas = frl.getAvailableRetinaNames.asScala
val sretinas = srl.getAvailableRetinaNames.asScala
val rretina = rrl.getRetinaProperties("spanish_subset")
assert(rretina.getLanguage == "es")
println(
s"""
|Available Retinas
| in directory: ${fretinas mkString ","}
| in S3 bucket: ${sretinas mkString ","}
| on classpath: at least one in ${rretina.getLanguage}
""".stripMargin)
}
}
In the [LoadRetinas.scala]
example, three RetinaLoader
s are created: one that reads from the file system directory ./retinas
(relative to the current directory), one that reads from the AWS S3 bucket identified by the given values, and a third one that reads from the directory small-retinas
at the root of the Java classpath (hence /small-retinas
). The former two RetinaLoader
s support enquiring about all available Retinas at the location passed to the RetinaLoader
, whereas the latter does not. Every RetinaLoader
can load properties describing a given Retina, such as the language of that Retina.
All RetinaLoader
s assume a layout like the following underneath the root directory used by the RetinaLoader
to load Retinas:
.
./arabic
./arabic/retina.line
./arabic/retina.properties
./business_intelligence
./business_intelligence/retina.line
./business_intelligence/retina.properties
./chinese
./chinese/retina.line
./chinese/retina.properties
./danish
./danish/retina.line
./danish/retina.properties
./en_associative
./en_associative/retina.line
./en_associative/retina.properties
./english_retina
./english_retina/retina.line
./english_retina/retina.properties
./english_subset
./english_subset/retina.line
./english_subset/retina.properties
./eu_market_english
./eu_market_english/retina.line
./eu_market_english/retina.properties
...
./spanish
./spanish/retina.line
./spanish/retina.properties
./spanish_subset
./spanish_subset/retina.line
./spanish_subset/retina.properties
When loading Retinas from a file system directory or the Java classpath, it is the root directory of this tree that is passed to the respective RetinaLoader
. When loading from an S3 bucket, the top-level directory in the bucket must itself be the root of this tree.
4.2.1.2. Create a CorticalEngine
CorticalEngine
is the most fundamental entry point to the core Semantic Text Processing features of Retina Library. It is a Java interface - an object implementing that interface can be created in one of two ways:
-
in pure Java, the
CorticalEngineFactory
can be used to create/retrieve theCorticalEngine
for the Retina with a given name. This is shown in[CreateCorticalEngines1.scala]
. -
in Scala, the Scala-friendly
CorticalApi
can be used to achieve the same effect less verbously, as shown in[CreateCorticalEngines2.scala]
.
CreateCorticalEngines1.scala
: Loading a Retina and creating the CorticalEngine
for that Retina using the Java CorticalEngineFactory
.package example.feature
import io.cortical.engine.api.{CorticalEngine, CorticalEngineFactory}
import io.cortical.retina.source.FileRetinaLoader
object CreateCorticalEngines1 {
def main(args: Array[String]): Unit = {
val loader = new FileRetinaLoader("./retinas")
val factory: CorticalEngineFactory = CorticalEngineFactory.getInstance(loader)
val ceEN: CorticalEngine = factory.getCorticalEngine("english_subset")
val ceDE: CorticalEngine = factory.getCorticalEngine("german_subset")
println(s"The Retinas support fingerprints with ${ceEN.getRetinaSize} and ${ceDE.getRetinaSize} positions.")
}
}
CreateCorticalEngines2.scala
: Loading a Retina and creating the CorticalEngine
for that Retina using the Scala CorticalApi
.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.getCorticalEngine
object CreateCorticalEngines2 {
def main(args: Array[String]): Unit = {
val loader = new FileRetinaLoader("./retinas")
val ceEN: CorticalEngine = getCorticalEngine(loader, "english_subset")
val ceDE: CorticalEngine = getCorticalEngine(loader, "german_subset")
println(s"The Retinas support fingerprints with ${ceEN.getRetinaSize} and ${ceDE.getRetinaSize} positions.")
}
}
CorticalEngine
essentially adds Semantic Text Processing operations on top of a Retina and is therefore the most important type in Retina Library: whenever a Retina is used in Retina Library, it is used from a CorticalEngine
that has been created for that Retina. CorticalApi
is a Scala adaptation on top of CorticalEngine
: Scala code uses both CorticalEngine
and CorticalApi
, where operations are invoked through CorticalApi
and CorticalEngine
mainly plays the role of a handle for a Retina. The CorticalApi
has no references to a CorticalEngine
or Retina - it is stateless. See section Pass the CorticalEngine
to CorticalApi
for a discussion of this.
The code shown in examples [CreateCorticalEngines1.scala]
and [CreateCorticalEngines2.scala]
works fine, but when the Retina Library Program executes in an Apache Spark cluster there is one more important aspect to consider: since a Retina is large, and the CorticalEngine
directly references a (exactly one) Retina, the distribution of a CorticalEngine
over the Spark cluster nodes must be optimised through the use of a Spark Broadcast
variable. The idiom for this, which is strongly recommended for all Retina Library Programs executing in Apache Spark clusters, is shown in [CreateCorticalEngines3.scala]
.
CreateCorticalEngines3.scala
: Loading a Retina and creating the CorticalEngine
for that Retina using the Scala CorticalApi
in an Apache Spark cluster.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.getCorticalEngine
import org.apache.spark.broadcast.Broadcast
object CreateCorticalEngines3 extends SparkApp {
override protected def work(): Unit = {
val loader = new FileRetinaLoader("./retinas")
val ceEN: Broadcast[CorticalEngine] = sc.broadcast(getCorticalEngine(loader, "english_subset"))
val ceDE: Broadcast[CorticalEngine] = sc.broadcast(getCorticalEngine(loader, "german_subset"))
println(s"The Retinas support fingerprints with ${ceEN.value.getRetinaSize} and ${ceDE.value.getRetinaSize} positions.")
}
}
The salient feature of [CreateCorticalEngines3.scala]
is the fact that no reference to any CorticalEngine
is kept - every CorticalEngine
is immediately broadcast over the Apache Spark cluster, and the only reference that is retained is that to a Spark Broadcast
variable containing the CorticalEngine
. In that way, accidental (inefficient) serialization of the CorticalEngine
to the Spark cluster nodes is prevented. Note the use of .value
to retrieve the CorticalEngine
from its Spark Broadcast
variable.
It should also be noted that the RetinaLoader
instances are only usable on the Spark cluster node on which they were created - which should always be the Spark driver.
4.2.1.3. Pass the CorticalEngine
to CorticalApi
CorticalApi
is a Scala object
whose functions largely mirror the methods of CorticalEngine
with the addition of an additional, curried, implicit
parameter for the CorticalEngine
to use. For instance, the signature of getTerm
in CorticalEngine
is
CorticalEngine
getTerm
(Java)CoreTerm getTerm(String term);
whereas the signature and implementation of that same method in CorticalApi
is
CorticalApi
getTerm
(Scala)def getTerm(term: String)(implicit engine: CorticalEngine): CoreTerm = engine.getTerm(term)
Other functions in CorticalApi
perform more work to bridge Java and Scala, but the general approach of passing the CorticalEngine
instance to use to the CorticalApi
object
as an implicit
parameter stays the same.
This means that if a Retina Library Program uses just one Retina and hence just one CorticalEngine
- an important special case of Retina Library Programs - then the idiomatic use of Retina Library is to define that CorticalEngine
as an implicit val
which will then be passed transparently through the magic of Scala implicit`s to every invocation of a
:CorticalApi
function. This is shown in example `[ImplicitCorticalEngine1.scala]
ImplicitCorticalEngine1.scala
: Passing a CorticalEngine
and its Retina explicitly and implicitly to CorticalApi
functions.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.{getCorticalEngine, getRetinaSize}
object ImplicitCorticalEngine1 {
def main(args: Array[String]): Unit = {
implicit val engine: CorticalEngine = getCorticalEngine(new FileRetinaLoader("./retinas"), "english_subset")
val size1 = getRetinaSize(engine) // explicit (unnecessary)
val size2 = getRetinaSize // implicit
assert(size1 == size2)
println(s"The Retina supports fingerprints with $size1 positions.")
}
}
In example [ImplicitCorticalEngine1.scala]
, the same CorticalEngine
instance is passed to two invocations of the CorticalApi
function getRetinaSize
, first explicitly and then implicitly. The latter is preferred for Retina Library Programs that work with only one CorticalEngine
.
In the case of a Retina Library Program executing in an Apache Spark cluster, the CorticalEngine
will always be wrapped in a Spark Broadcast
variable, as discussed previously. This leads us to the following extended use of implicit`s shown in `[ImplicitCorticalEngine2.scala]
:
ImplicitCorticalEngine2.scala
: Passing a Spark Broadcast
variable containing a CorticalEngine
explicitly and implicitly to CorticalApi
functions.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.{getCorticalEngine, getRetinaSize}
import io.cortical.scala.spark.util.valueOfBroadcastCorticalEngine
import org.apache.spark.broadcast.Broadcast
object ImplicitCorticalEngine2 extends SparkApp {
override protected def work(): Unit = {
implicit val engine: Broadcast[CorticalEngine] = sc.broadcast(getCorticalEngine(new FileRetinaLoader("./retinas"), "english_subset"))
val size1 = getRetinaSize(engine.value) // explicit (unnecessary)
val size2 = getRetinaSize // implicit
assert(size1 == size2)
println(s"The Retina supports fingerprints with $size1 positions.")
}
}
In example [ImplicitCorticalEngine2.scala]
a Spark Broadcast
variable containing a CorticalEngine
instance is assigned to a Scala implicit val
. Thus when a CorticalApi
function is called and the CorticalEngine
should be passed explicitly to that function, the CorticalEngine
must first be retrieved from the Spark Broadcast
variable using .value
. However, the implicit
passing of the CorticalEngine
parameter to the CorticalApi
function is as convenient and transparent as before, thanks to the implicit
conversion function valueOfBroadcastCorticalEngine
which takes care of the implicit
unwrapping of an `implicit`ly available Spark Broadcast
variable containing a CorticalEngine
.
Example code shown in this section uses implicit
argument-passing for CorticalEngine
and Spark Broadcast
variable of CorticalEngine
whenever possible, to minimize clutter.
4.2.1.4. Simple text operations
CorticalEngine
and CorticalApi
provide basic text operations that are useful when working with text, although they do not constitute Semantic Text Processing by themselves.
The fact that these operations are provided through CorticalEngine and CorticalApi has been identified as needing improvement, because these operations are not inherently tied to a Retina. These features will likely be provided by other means in a future release of Retina Library.
|
4.2.1.4.1. Tokenize text
Tokenizing text, in the first instance, means splitting text into words. However, with the information contained in a Retina being available during tokenization in Retina Library, the tokenization algorithm in Retina Library performs several additional functions, as shown in [Tokenize.scala]
:
Tokenize.scala
: Tokenizing text into CoreTerm
objects, including information from the Retina, if available.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.model.core.CoreTerm
import io.cortical.nlp.pos.CorePosTypes.NOUN
import io.cortical.scala.api.CorticalApi.tokenize
object Tokenize extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val ts: Seq[CoreTerm] = tokenize("I LOVE New York: It has 7 flXWys!")
assert(ts.length == 6)
assert(ts(2).getTerm == "new york" && ts(2).getDf > 0 && ts(2).getPosTypes.contains(NOUN))
assert(ts(5).getTerm == "flxwys" && ts(5).getDf.isNaN && null == ts(5).getPosTypes)
s"The text was split into ${ts.length} tokens: ${ts map (_.getTerm) mkString ","}"
}
}
Example [Tokenize.scala]
shows that text tokenization in Retina Library
-
returns all tokens as
CoreTerm
objects, -
only returns words and not, for instance, punctuation characters or numbers and digits,
-
converts the text of all tokens to lower-case,
-
detects compound terms (if they are in the Retina), i.e. returns each compound term as a single
CoreTerm
object with the (lower-cased) text of the compound term, -
also includes non-sensical tokens, or terms simply not present in the Retina, e.g. terms from a foreign language,
-
includes additional information about each token if that token term is found in the Retina, e.g. possible POS types or the DF value of that term in the training corpus that gave rise to the Retina. If the term is not in the Retina then that information is missing.
4.2.1.4.2. Split text into sentences
A piece of text, in the form of a String
, can be split into sentences as shown in [SplitIntoSentences.scala]
. The algorithm currently used by Retina Library is simple and intended for western scripts. In particular, sentence boundaries are currently identified by full-stops, exclamation marks and question marks, although common abbreviations (like "Dr." in the example) are correctly disregarded as sentence boundaries.
SplitIntoSentences.scala
: Splitting a piece of text into sentences.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.splitIntoSentences
object SplitIntoSentences extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val sents: Seq[String] = splitIntoSentences("This is text. It has 3 sentences. Dr. Freud agrees.")
assert(sents.length == 3)
s"The text was split into ${sents.length} sentences: ${sents mkString "\n"}"
}
}
Sentence splitting as currently implemented by Retina Library is intended for simple use-cases where the convenience of being able to do sentence splitting without any external library dependencies trumps the sophistication of the algorithm.
Cortical.io will enhance the sentence splitting algorithm to be more versatile and sophisticated as and when the need arises. But it is not intended as a replacement for sophisticated NLP libraries, which can and should always be used in conjunction with Retina Library when state-of-the-art sentence splitting is required. |
4.2.1.4.3. Slice text
Often, a larger piece of text needs to be, conceptually, split into paragraphs, but the text contains no clues (such as blank lines) as to the beginning and end of individual paragraphs. For instance, all formatting may have been lost, or the text never contained any formatting in the first place. In any case, splitting the text into sentences is not what is requested, as consecutive sentences may cover the same topic and hence be considered part of the same logical paragraph.
Retina Library provides an operation called slicing that first splits text into sentences (as described in section Split text into sentences) and subsequently merges consecutive sentences into slices such that the meaning of sentences within the same slice changes little whereas the meaning between slices changes more. In other words, this algorithm aims to detect what would normally be considered well-formed paragraphs. However, the algorithm does not require clues in the form of formatting to detect boundaries between slices, as it works on the basis of Semantic Text Processing.
Slicing is shown in example [Slice.scala]
:
Slice.scala
: Slicing text into consecutive stretches of sentences that preserve the same meaning.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.slice
object Slice extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val text =
"""According to Dr. Hawking, after the initial expansion, the
|Universe cooled sufficiently to allow the formation first
|of subatomic particles and later of simple atoms. Giant clouds
|of these primordial elements later coalesced through gravity
|to form stars. Assuming that the prevailing model is correct,
|the age of the Universe is measured to be 13.799±0.021
|billion years. After the initial expansion, the universe cooled
|sufficiently to allow the formation of subatomic particles,
|and later simple atoms.
|The Kingdom of England is usually considered to begin with
|Alfred the Great, King of Wessex. While Alfred was not the
|first king to lay claim to rule all of the English, his rule
|represents the first unbroken line of Kings to rule the whole
|of England, the House of Wessex. The last English monarch
|was Queen Anne, who became Queen of Great Britain when England
|merged with Scotland to form a union in 1707.""".stripMargin
val slices: Seq[String] = slice(text)
assert(slices.length == 2)
val startOfFirstSlice: String = "According to Dr. Hawking"
assert(slices(0) startsWith startOfFirstSlice, s"first slice doesn't start with '$startOfFirstSlice' but '${slices(0)}'")
val startOfSecondSlice: String = "The Kingdom of England"
assert(slices(1) startsWith startOfSecondSlice, s"second slice doesn't start with '$startOfSecondSlice' but '${slices(1)}'")
s"The text was cut into ${slices.length} slices: ${slices mkString "\n\n"}"
}
}
In [Slice.scala]
the text clearly separates into two topics, which is detected by the slicing algorithm. Sentences are assigned to the first slice until the topic changes. All subsequent sentences are assigned to the second slice.
The definition of the slicing algorithm is intentionally left vague so that future improvements in the detection of slices and slice boundaries can be incorporated into Retina Library. The intent of slicing will, however always remain the same, i.e. the assignment of consecutive sentences to semantically homogenous groups. |
4.2.1.5. Fundamental Semantic Text Processing algorithms
4.2.1.5.1. Get the size of the Retina
The size of a Retina is the number of positions in the Semantic Fingerprints contained in that Retina. Retina Library provides access to the Retina size as shown in [GetRetinaSize.scala]
:
GetRetinaSize.scala
: Retrieving the size of the Retina through CorticalApi
.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getRetinaSize
object GetRetinaSize extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val size: Int = getRetinaSize
s"The Retina supports fingerprints with $size positions."
}
}
4.2.1.5.2. Retrieve terms from the Retina
Terms can be retrieved from the Retina as shown in [GetTerm.scala]
:
GetTerm.scala
: Retrieving a term from the Retina, returning an CoreTerm
object with the lower-cased term String
and additional information only if the term is in the Retina.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.model.core.CoreTerm
import io.cortical.nlp.pos.CorePosTypes.NOUN
import io.cortical.scala.api.CorticalApi.getTerm
object GetTerm extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val ts: Seq[CoreTerm] = Seq("LOVE", "New York", "flXWys") map getTerm
assert(ts(0).getTerm == "love" && ts(0).getDf > 0 && ts(0).getPosTypes.contains(NOUN))
assert(ts(1).getTerm == "new york" && ts(1).getDf > 0 && ts(1).getPosTypes.contains(NOUN))
assert(ts(2).getTerm == "flxwys" && ts(2).getDf.isNaN && null == ts(2).getPosTypes)
s"${ts(0).getTerm} and ${ts(1).getTerm} are in the Retina, ${ts(2).getTerm} is not."
}
}
As was the case with the tokenization operation demonstrated in [Tokenize.scala]
, a CoreTerm
object is always returned, regardless of whether the term is actually in the Retina or not. However, the CoreTerm
includes additional information about the term if that term is found in the Retina, e.g. possible POS types or the DF value of that term in the training corpus that gave rise to the Retina. If the term is not in the Retina then that information is missing. Also, the returned term is always in all-lower-case, regardless of what case combination was passed in to the function.
4.2.1.5.3. Fingerprint text
One of the core algorithms in Retina Library is calculating the Semantic Fingerprint of a single term or a piece of text. Several variants of this algorithm are provided, as can be seen in example [GetFingerprint.scala]
:
GetFingerprint.scala
: Calculating the Semantic Fingerprint from String
s and lists of CoreTerm
s, optionally restricted to certain POS types.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.nlp.pos.CorePosTypes.NOUN
import io.cortical.scala.api.CorticalApi.{getFingerprint, getTerm}
import io.cortical.scala.api.Fingerprint
object GetFingerprint extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val fp1: Fingerprint = getFingerprint("car")
val fp2: Fingerprint = getFingerprint("My car is a bicycle")
val fp3: Fingerprint = getFingerprint("My car is a bicycle", NOUN)
val fp4: Fingerprint = getFingerprint(Seq(getTerm("car"), getTerm("bicycle")))
assert(fp2.length > fp1.length)
assert(fp3.toList == fp4.toList)
s"Number of positions in fingerprints: ${fp1.length}, ${fp2.length}, ${fp3.length}, ${fp4.length}"
}
}
Example [GetFingerprint.scala]
shows that:
-
In Retina Library, a Semantic Fingerprint is represented as type
Fingerprint
, which is just an alias forArray[Int]
, listing in ascending order the positions in the binary Semantic Fingerprint which are set, while all other positions are un-set (a Semantic Fingerprint is a sparse binary data structure). -
Semantic Fingerprint can be derived from
String
s or lists ofCoreTerm
objects. -
If a Semantic Fingerprint is calculated from a
String
, thatString
may be a single term or a longer piece of text. Furthermore, a POS type may optionally be specified so that only terms of that POS type are considered when calculating the Semantic Fingerprint.
Future releases of Retina Library may provide a richer abstraction of a Semantic Fingerprint than the current type alias Fingerprint , while keeping the fundamental storage format and runtime representation unchanged as Array[Int] .
|
In general, passing a sequence of CoreTerm
objects to the fingerprint calculation algorithm is an indication that the user wants the given terms to be used as-is, with as little manipulation as possible. In contrast, passing a String
to fingerprint calculation gives that algorithm more freedom in choosing tokens from that String
that it considers optimal for the quality of the resulting Semantic Fingerprint.
4.2.1.5.4. Compare Semantic Fingerprints
The second core algorithm in Retina Library, after mapping text to Semantic Fingerprints, is the measurement of the similarity or, conversely, distance of two Semantic Fingerprints. Similarity (or distance) is a floating-point number: the higher the similarity (the smaller the distance) of two Semantic Fingerprints, the closer the meaning of the two pieces of text that gave rise to these Semantic Fingerprints.
For two Semantic Fingerprints to be meaningfully compared they need not necessarily be derived from the same Retina: it is sufficient if they were calculated using aligned Retinas (see section Load a Retina).
Retina Library used to provide the function named compare
for the calculation of the cosine similarity between two Fingerprint
objects. Later versions of Retina Library added other Fingerprint
comparison algorithms, both similarity measures as well as distance measures. As a result, the compare
function is now deprecated in favour of the API demonstrated in example [Compare.scala]
.
Example [Compare.scala]
shows how to compare Semantic Fingerprints using some of the comparison algorithms provided in Retina Library:
Compare.scala
: Calculating distance and similarity of two Fingerprint
s through some of the comparison algorithms provided in Retina Library.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.fingerprint.compare.api.FingerprintComparisons
import io.cortical.scala.api.CorticalApi.{getComparisons, getFingerprint}
object Compare extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val fp1 = getFingerprint("This is a car.")
val fp2 = getFingerprint("My car is a bicycle.")
val fp3 = getFingerprint("My car is a bike.")
val comp: FingerprintComparisons = getComparisons
val s12: Double = comp.cosineSimilarity(fp1, fp2)
val s23: Double = comp.cosineSimilarity(fp2, fp3)
val d12: Double = comp.euclideanDistance(fp1, fp2)
val d23: Double = comp.euclideanDistance(fp2, fp3)
val d13: Double = comp.normalisedOverlapAllSimilarity(fp1, fp3)
assert(0 < s12 && s12 <= 1)
assert(0 < s23 && s23 <= 1)
assert(0 < d12 && d12 <= 1)
assert(0 < d23 && d23 <= 1)
assert(0 < d13 && d13 <= 1)
assert(s12 < s23)
assert(d12 > d23)
s"Cosine similarities are $s12 and $s23 while euclidean distances are $d12 and $d23."
}
}
Example [Compare.scala]
shows that
-
most comparison measures, including cosine similariy, euclidean distance and normalised overlap similarity are always in the interval
[0,1]
, -
two pieces of text that are more similar in meaning give rise to two
Fingerprint
s that are more similar when `compare`d, -
if similarity is high then distance is low and vice versa,
-
further distance and similarity measures are available when needed.
4.2.1.5.5. Retrieve similar terms from the Retina
In the context of Semantic Text Processing with Retina Library, similar terms denotes terms contained in a given Retina which have a strong semantic relationship with a given Semantic Fingerprint. That Semantic Fingerprint may have been derived from
-
a single term in the same Retina,
-
a single term in a different - but aligned - Retina,
-
a piece of text using the same or a different, aligned Retina.
If the input Semantic Fingerprint for the retrieval of similar terms derives from a single term, then it is tempting to think of the similar terms as the synonyms of that input term. This is however misguided: similar terms are all terms from the Retina with a strong semantic association (expressed as a high semantic similarity) with the input term - including, but not limited to synonyms of that term.
Retrieving similar terms from the Retina is shown in example [GetSimilarTerms.scala]
:
GetSimilarTerms.scala
: Retrieving a number of terms similar to a given Fingerprint
from the Retina, optionally restricted to certain POS types.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.model.core.CoreTerm
import io.cortical.nlp.pos.CorePosTypes.ADJECTIVE
import io.cortical.scala.api.CorticalApi.{getFingerprint, getSimilarTerms}
object GetSimilarTerms extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val fp = getFingerprint("My car is a bicycle")
val ts: Seq[CoreTerm] = getSimilarTerms(fp, 10)
val as: Seq[CoreTerm] = getSimilarTerms(fp, 10, ADJECTIVE)
assert(ts exists (_.getTerm == "bike"))
assert(as exists (_.getTerm == "four-wheel"))
s"All similar terms: ${ts map (_.getTerm) mkString ","}; adjectives: ${as map (_.getTerm) mkString ","}"
}
}
4.2.1.5.6. Determine the context terms of a Semantic Fingerprint
Retina Library provides an experimental algorithm to determine the context terms of a Semantic Fingerprint and, therefore, any piece of text. The context terms are those terms from the Retina that capture the essential semantic aspects of a given Semantic Fingerprint. This is different from the similar terms for that Semantic Fingerprint, as discussed in section Retrieve similar terms from the Retina: both algorithms start with an input Semantic Fingerprint, and both algorithms return terms contained in a given Retina. But while similar terms are simply the terms whose Semantic Fingerprints have the highest similarity to the input Semantic Fingerprint, context terms are algorithmically selected to best describe the different semantic dimensions of the input Semantic Fingerprint.
The algorithm which determines context terms is under active development and will change without notice in future versions of Retina Library. |
Example [GetContext.scala]
demonstrates the simplest possible way of determining the context of a given piece of text. More elaborate ways of doing this, such as by specifying the number of desired context terms, or by starting from a Semantic Fingerprint rather than a piece of text, are available in CorticalApi
and, in particular, in CorticalEngine
.
GetContext.scala
: Determining the context (terms) of a given piece of text using CorticalApi
. CorticalEngine
defines a more general method that takes a Semantic Fingerprint rather than a piece of text as the input (not shown).package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getContext
object GetContext extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val ctxt: Seq[String] = getContext("Many teams play in the FA Cup.")
assert(ctxt.length >= 2)
assert(ctxt contains "game")
assert(ctxt contains "club")
s"The context of that sentence is defined by the terms ${ctxt mkString ","}."
}
}
4.2.1.6. Higher-level Semantic Text Processing algorithms
The algorithms discussed in section Fundamental Semantic Text Processing algorithms are the algorithmic core of Semantic Text Processing in Retina Library and build the basis for higher-level functionality. Some of that functionality is implemented in the form of additional algorithms in CorticalEngine
and CorticalApi
and will be discussed in this section. Other higher-level functionality goes beyond simple algorithms and is the topic of later sections (<<Semantic Text Classification using SemanticClassifier
>> and <<Semantic Search using SemanticSearcher
>>).
4.2.1.6.1. Create category filters
A category filter in the terminology of Retina Library is a Semantic Fingerprint that combines and subsumes several input Semantic Fingerprints. The term stems from one typical application of category filters, namely the representation of a single category of texts, such that pieces of text that fall into that category can be filtered-out from a larger set by matching against the category filter. This is, however, just one application of category filters. Furthermore, the facilities discussed in section <<Semantic Text Classification using SemanticClassifier
>> provide a more rigorous and flexible approach to categorizing pieces of text.
CreateCategoryFilter.scala
: Creating a category filter from a list of Fingerprint
s and using that to distinguish between text that belongs and doesn’t belong into that category.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.{compare, createCategoryFilter, getFingerprint}
import io.cortical.scala.api.Fingerprint
object CreateCategoryFilter extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val fps = Seq("A text about cars", "Another car text", "Cars are us!", "Cars will be cars.") map (getFingerprint(_))
val cft: Fingerprint = createCategoryFilter(fps)
val pos = getFingerprint("Let me boast about my car.")
val neg = getFingerprint("I'm only interested in bicycles.")
val simPos = compare(pos, cft)
val simNeg = compare(neg, cft)
assert(simPos > simNeg)
s"Similarities of positive and negative cases are $simPos and $simNeg, respectively."
}
}
The createCategoryFilter function also supports the passing of a "noise fingerprint", which represents a Semantic Fingerprint signal that should be disregarded when calculating the category fingerprint. This concept is experimental and should be used with caution. It may also be removed in future versions of Retina Library.
|
4.2.1.6.2. Extract keywords from text
In Retina Library, keywords are tokens (terms) selected from a piece of text that are semantically similar to the entire piece of text. When extracting keywords from text, the user must decide how many keywords shall be returned. The algorithm then selects that number from the tokens of that text.
Example [ExtractKeywords.scala]
shows the extraction of keywords from a paragraph of text:
ExtractKeywords.scala
: Extracting a given number of keywords that are semantically representative of the given piece of text.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.model.core.CoreTerm
import io.cortical.scala.api.CorticalApi.extractKeywords
object ExtractKeywords extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val text =
"""According to Dr. Hawking, after the initial expansion, the
|Universe cooled sufficiently to allow the formation first
|of subatomic particles and later of simple atoms. Giant clouds
|of these primordial elements later coalesced through gravity
|to form stars. Assuming that the prevailing model is correct,
|the age of the Universe is measured to be 13.799±0.021
|billion years.""".stripMargin
val kws: Seq[CoreTerm] = extractKeywords(text, 5)
assert(kws exists (_.getTerm == "gravity"))
assert(kws forall (_.getDf >= 0))
s"Extracted keywords ${kws map (_.getTerm) mkString ","}"
}
}
Keywords are always contained in the Retina used by the algorithm, because by definition they must have a Semantic Fingerprint. Hence the CoreTerm
objects returned by the keyword extraction algorithm always contain additional information about the term, such as its DF in the Retina training corpus, or its possible POS tags.
4.2.2. Semantic Text Classification using SemanticClassifier
Semantic Text Classification is a machine learning feature of Retina Library which is formalised in trait
SemanticClassifier
: Given a previously unseen piece of text, a SemanticClassifier
is able to assign a label to that text, where the label uniquely identifies the class. A SemanticClassifier
hence classifies text into one of a number of classes. For this to work, the SemanticClassifier
must previously have been trained on a training set of pairs of text and labels, i.e. examples of pieces of text that have been (by definition) correctly assigned to one class each.
SemanticClassifier
is generic in the type of label it uses to identify classes. The type alias StringLabelSemanticClassifier
uses String
s as class labels.
Version 2.5.0 of Retina Library ships with just one implementation of SemanticClassifier
, which is in fact an implementation of StringLabelSemanticClassifier
: SemanticTextClassifier
.
A general design decision in Retina Library that applies to SemanticClassifier
(as well as to SemanticSearcher
discussed in section <<Semantic Search using SemanticSearcher
>>) is that Scala companion objects to classes that implement a Retina Library feature trait
like SemanticClassifier
(or SemanticSearcher
) have Scala-idiomatic apply
factory methods that are declared with a return type of that trait
rather than of the implementation class. Concretely, the apply
factory method in the companion object to SemanticTextClassifier
is declared to return a StringLabelSemanticClassifier
instead of a SemanticTextClassifier
. Irrespective of the declared type, the runtime type of the object returned by that factory method is of course SemanticTextClassifier
.
All implementation classes (but not their companion objects) of the SemanticClassifier trait are to be treated as private . For reasons of backwards compatibility, this is currently not the case for SemanticTextClassifier , but a future version of Retina Library will reduce the visibility of that class to private .
|
Example SemanticTextClassification.scala shows correct and idiomatic usage of Semantic Text Classification with Retina Library.
SemanticTextClassification.scala
: Training a SemanticTextClassifier
, a concrete implementation of SemanticClassifier
that uses String
labels to identify classes, and using it to predict the classes of two previously unseen pieces of text.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.{SemanticTextClassifier, StringLabelSemanticClassifier}
object SemanticTextClassification extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val cars = "cars"
val bikes = "bikes"
val trainSet: Seq[(String, String)] = Seq(
cars -> "A text about cars", cars -> "Another car text",
cars -> "Cars are us!", cars -> "Cars will be cars.",
bikes -> "Text about bicycles", bikes -> "Another bike text",
bikes -> "Bikes are us!", bikes -> "Bicycles will be bikes.")
val stc: StringLabelSemanticClassifier = SemanticTextClassifier(trainSet)
val l1: String = stc.classify("Let me boast about my four-wheel drive.")
val (l2: String, _, _, conf2: Double) = stc.classifyWithDetail("I'm only interested in bicycles.")
assert(l1 == cars)
assert(l2 == bikes)
assert(0 < conf2 && conf2 <= 1)
s"The unseen texts were classified as $l1 and $l2, the latter with confidence $conf2."
}
}
As SemanticTextClassification.scala shows, training is done on a set of pre-labelled pieces of text, after which an (immutable) StringLabelSemanticClassifier
instance is created using the factory method for SemanticTextClassifier
. The SemanticClassifier
supplies two main methods for classification given a (typicall previously unseen) piece of text:
-
one that just returns the label of the class to which this text is predicted to belong, and
-
one that also returns various metrics about the quality of the prediction to that class. The most important of these metrics is the last one, which is a confidence score from the interval
[0,1]
.
4.2.3. Semantic Search using SemanticSearcher
Semantic Search is an important aspect of Semantic Text Processing. It means searching a set of texts by some query text to find those texts that are most similar in meaning to the query text. Thus Semantic Search is different from keyword-based search approaches because the exact words in all pieces of text involved in the Semantic Search operation matter only insofar as they convey meaning: in Semantic Search, in contrast to keyword-based search, the words themselves, as strings of characters, are not matched between query and the set of texts searched over.
Typically, the set of texts to be searched is seen as a "database" of text "documents", although both the terms "database" and "document" are used informally here. In particular, "database" does not imply persistence, ACID transactions or SQL-like capabilities, but rather simply a collection (set) of pieces of text available for Semantic Search. Similarly, "document" does not imply structure, formatting or file-formats typically associated with the word "document", but rather simply an identifiable piece of text.
Unsurprisingly, in the world of Retina Library, Semantic Search is based on comparing the Semantic Fingerprint of a query text to the Semantic Fingerprints of the pieces of text in the document database.
Cortical.io acknowledges that this particular usage of the term "document" is confusing. In future versions of Retina Library the term "document" could therefore be replaced with a term that carries less pre-conceived associations - such as "(text) snippet". |
There are two important abstractions for Semantic Search in Retina Library:
-
SemanticSearcher
captures the notion of a document database that can be semantically searched by some query text, -
Doc
and relatedtrait
s formalize the various aspects of documents (in the sense introduced above, i.e. snippets of text).
We will discuss both abstractions in the remainder of this section. First, though, we will present the Java foundations to Semantic Search in Retina Library upon which SemanticSearcher
builds.
4.2.3.1. The Java DocumentFingerprintDb
as the basis for SemanticSearcher
The most basic support for performing Semantic Search in Retina Library is through the Java interface DocumentFingerprintDb
and its (currently Java-only) implementation IndexedDocumentFingerprintDb
. The interface DocumentFingerprintDb
describes a simple mutable database of Semantic Fingerprints and a search operation over those Semantic Fingerprints, where the query is also expressed as a Semantic Fingerprint. The class IndexedDocumentFingerprintDb
is a straightforward implementation of that interface using an inverted index data structure that is maintained in its entirety on the JVM heap.
The pieces of text that give rise to the Semantic Fingerprints exchanged with DocumentFingerprintDb
occur neither in the interface nor its implementation. In other words, there are no documents in this basic implementation of Semantic Search.
Example [SearchingWithJavaDocumentFingerprintDb.scala]
shows the features available in this kind of Semantic Search.
SearchingWithJavaDocumentFingerprintDb.scala
: Basic Semantic Search using Java interface DocumentFingerprintDb
and its implementation IndexedDocumentFingerprintDb
.package example.feature
import java.util
import io.cortical.document.api.DocumentFingerprintDb
import io.cortical.document.impl.IndexedDocumentFingerprintDb
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.{getFingerprint, getRetinaSize}
object SearchingWithJavaDocumentFingerprintDb extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val db: DocumentFingerprintDb = new IndexedDocumentFingerprintDb(getRetinaSize)
db.addDocument("id1", getFingerprint("Cars are us!"))
db.addDocument("id2", getFingerprint("Cars will be cars."))
db.addDocument("id3", getFingerprint("A text about cars."))
db.addDocument("id4", getFingerprint("Another car text"))
assert(db.containsDocument("id3"))
assert(db.containsDocument("id2"))
db.removeDocument("id2")
assert(!db.containsDocument("id2"))
val ids: util.List[String] = db.search(getFingerprint("Find the text about cars."), 10)
assert(ids.size == 3)
assert(ids.get(0) == "id3")
assert(ids.get(1) == "id4")
s"Searching the fingerprint DB returned the doc IDs $ids"
}
}
Please note that the Java interface DocumentFingerprintDb
discussed here is distinct from the DocumentFingerprintDb
implementation of the Scala trait
SemanticSearcher
discussed in section <<DocumentFingerprintDb
>>.
4.2.3.2. Document abstractions in Retina Library
Retina Library tries to decouple algorithms that work on and with documents (again, using "document" to mean "snippet of text") from the actual classes used to represent those documents. The mechanism by which Retina Library realizes this decoupling is a combination of three factors:
-
The 1st factor is a hierarchy of document
trait
s that capture the attributes a document has/must have. For instance,Doc
is the root of this hierarchy and has a document identifier and metadata,TextDoc
inherits fromDoc
and adds aString
text attribute, whereasFingerprintedDoc
inherits fromDoc
and adds aFingerprint
attribute. -
The 2nd factor is a collection of Scala "companion" objects to these document
trait
s with factoryapply
methods that create instances of concrete but private implementations of thesetrait
s. These companion objects and concrete implementations comprise a complete but optional set of document classes that can be used in Retina Library Programs that do not have their own, separate, classes to represent documents. -
The 3rd factor is the definition of algorithms and abstractions such as
SemanticSearcher
exclusively in terms of these documenttrait
s. This allows any implementation of these documenttrait
s to be supplied, including, but not limited to those returned by the factory methods in the documenttrait
companion objects.
[Documents.scala]
shows examples of some of the important document trait
s and companion objects in Retina Library.
Documents.scala
: Important document abstraction trait
s and their companion objects in Retina Library.package example.feature
import com.neovisionaries.i18n.LanguageCode.en
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.{compare, getFingerprint}
import io.cortical.scala.api.document.{FingerprintedDoc, FingerprintedTextDoc, ScoredFingerprintedTextDoc, TextDoc}
import io.cortical.scala.api.metadata.Metadata
import io.cortical.scala.api.metadata.Metadata.MetadataWrapper
object Documents extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val td: TextDoc = TextDoc("id1", Metadata(en), "English text")
val fptd: FingerprintedTextDoc = FingerprintedTextDoc(td, getFingerprint(td.text))
assert(fptd.docId == "id1")
assert(fptd.text == td.text)
assert(fptd.metadata.lang == en)
val sfptd: ScoredFingerprintedTextDoc = ScoredFingerprintedTextDoc(fptd, 1.0)
assert(sfptd.docId == "id1")
assert(sfptd.score == 1.0)
val fpd: FingerprintedDoc = new FingerprintedDoc {
override val docId = "id2"
override val fp = getFingerprint("English text")
}
val sim = compare(fptd.fp, fpd.fp)
assert(0.999 <= sim && sim <= 1.0)
s"Created a text doc $td, fingerprinted it to $fptd and scored it to $sfptd, then created a custom fingerprinted doc $fpd"
}
}
All document trait s including the word Preserving are considered experimental and may be changed in a future release of Retina Library. In particular, their functionality may be merged into the standard document trait s discussed in this section.
|
4.2.3.3. SemanticSearcher
and its implementations
The Scala trait
SemanticSearcher
is the main abstraction for performing Semantic Search over a set of texts. The pieces of text to be searched-over are represented as document trait
s - see section Document abstractions in Retina Library.
SemanticSearcher
is generic and covariant in the type of result returned from a Semantic Search. Most implementations of SemanticSearcher
return document trait
s, but DocumentFingerprintDb
, for instance, just returns the identifiers of the documents (DocID
s) that matched the query.
Retina Library ships with several implementation of SemanticSearcher
, which have different runtime characteristics. Also, some SemanticSearcher
s are immutable, whereas others are backed by a document database that can be manipulated. These distinctions between capabilities of SemanticSearcher
s is captured through various sub-trait
s of SemanticSearcher
:
-
The base-
trait
SemanticSearcher
just defines methods for semantically searching the document database. -
The sub-
trait
StoringSemanticSearcher
ofSemanticSearcher
adds methods for retrieving the documents and their IDs that form part of the document database. It is generic and covariant in the type of document stored in the document database. Not allSemanticSearcher
s retain their documents in their original form (they might just store theFingerprint
s of the documents), but those that do, implement this sub-trait
to announce their capability to users. -
The sub-
trait
UpdateableSemanticSearcher
ofSemanticSearcher
adds methods to add, remove and update documents in the document database used by theSemanticSearcher
, as well as to enquire whether a document is in that database. It is generic and contra-variant in the type of document stored in the document database, which must be a sub-type ofFingerprintedDoc
.SemanticSearcher
s that are mutable implement thistrait
. -
Finally, the
trait
FullSemanticSearcher
is extended bySemanticSearcher
implementations that are both mutable and give access to their documents in their original form. It extendsStoringSemanticSearcher
andUpdateableSemanticSearcher
and is generic and invariant in the type of document stored in the document database, which must be a sub-type ofFingerprintedDoc
.
The most important concrete implementations of SemanticSearcher
are briefly listed in the remainder of this section. As always in Retina Library, the concrete implementations created by the factory methods of their companion objects are private (see <<Semantic Text Classification using SemanticClassifier
>>).
All implementation classes (but not their companion objects) of the SemanticSearcher trait and its sub-trait s are to be treated as private . For reasons of backwards compatibility, this is currently not the case for all of these classes, but a future version of Retina Library will reduce the visibility of those classes which are not currently private to private .
|
4.2.3.3.1. DocumentFingerprintDb
The simplest supported SemanticSearcher
implementation is DocumentFingerprintDb
. Please note that the DocumentFingerprintDb
implementation of the Scala trait
SemanticSearcher
discussed here is distinct from the Java interface DocumentFingerprintDb
discussed in section The Java DocumentFingerprintDb
as the basis for SemanticSearcher
. It represents an immutable document database that only supports semantic search, doesn’t store documents in their original form, and returns search results as the DocID
s. Hence search results contain no score and are ordered by descending fingerprint overlap with the query, which is in general not the same order as descending cosine similarity! The fingerprints of all documents in the database are stored on the heap of the current JVM.
See example [SearchingWithDocumentFingerprintDb.scala]
:
SearchingWithDocumentFingerprintDb.scala
: Semantic Search using DocumentFingerprintDb
.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.DocumentFingerprintDb
import io.cortical.scala.api.document.{DocID, DocIDSemanticSearcher, FingerprintedDoc}
object SearchingWithDocumentFingerprintDb extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val docs = Seq(
FingerprintedDoc("id1", getFingerprint("Cars are us!")),
FingerprintedDoc("id2", getFingerprint("Cars will be cars.")),
FingerprintedDoc("id3", getFingerprint("A text about cars.")),
FingerprintedDoc("id4", getFingerprint("Another car text")))
val db: DocIDSemanticSearcher = DocumentFingerprintDb(docs)
val ids: Seq[DocID] = db.search(getFingerprint("Find the text about cars."), 10)
assert(ids.size == 4)
assert(ids(0) == "id3")
assert(ids(1) == "id4")
s"Searching the fingerprint DB returned the doc IDs $ids"
}
}
4.2.3.3.2. UpdateableDocumentFingerprintDb
Has the same basic features as DocumentFingerprintDb
but is mutable and hence implements UpdateableSemanticSearcher
. It is therefore also generic in the type of document it allows to be added/updated/removed (even though it doesn’t retain those documents in their original form).
See example [SearchingWithUpdateableDocumentFingerprintDb.scala]
:
SearchingWithUpdateableDocumentFingerprintDb.scala
: Semantic Search using UpdateableDocumentFingerprintDb
.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.document.{DocID, FingerprintedDoc}
import io.cortical.scala.api.{UpdateableDocumentFingerprintDb, UpdateableSemanticSearcher}
object SearchingWithUpdateableDocumentFingerprintDb extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val docs = Seq(
FingerprintedDoc("id1", getFingerprint("Cars are us!")),
FingerprintedDoc("id2", getFingerprint("Cars will be cars.")),
FingerprintedDoc("id3", getFingerprint("A text about cars.")),
FingerprintedDoc("id4", getFingerprint("Another car text")))
val db: UpdateableSemanticSearcher[FingerprintedDoc, DocID] = UpdateableDocumentFingerprintDb(docs)
assert(db.contains("id3"))
assert(db.contains("id2"))
db.remove("id2")
assert(!db.contains("id2"))
val ids: Seq[DocID] = db.search(getFingerprint("Find the text about cars."), 10)
assert(ids.size == 3)
assert(ids(0) == "id3")
assert(ids(1) == "id4")
s"Searching the fingerprint DB returned the doc IDs $ids"
}
}
4.2.3.3.3. PreservingDocumentDb
A mutable database of documents, their text and fingerprints, preserving the original documents such that the search results can refer to them. Implements FullSemanticSearcher
. The score contained in the search results is cosine similarity and search results are ordered by descending cosine similarity. The fingerprints of all documents in the database are stored on the heap of the current JVM.
See example [SearchingWithPreservingDocumentDb.scala]
:
SearchingWithPreservingDocumentDb.scala
: Semantic Search using PreservingDocumentDb
.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.PreservingDocumentDb
import io.cortical.scala.api.document._
object SearchingWithPreservingDocumentDb extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val tdocs: Seq[TextDoc] = Seq(
TextDoc("id1", "Cars are us!"),
TextDoc("id2", "Cars will be cars."),
TextDoc("id3", "A text about cars."),
TextDoc("id4", "Another car text"))
val docs: Seq[FingerprintedTextDoc] = tdocs map (d => FingerprintedTextDoc(d, getFingerprint(d.text)))
val db = PreservingDocumentDb(docs)
assert(db.contains("id3"))
val doc3: Option[FingerprintedTextDoc] = db.get("id3")
assert(doc3.get.docId == "id3")
assert(db.contains("id2"))
db.remove("id2")
assert(!db.contains("id2"))
val results: Seq[ScoredFingerprintedTextDoc] = db.search(getFingerprint("Find the text about cars."), 10)
assert(results.size == 3)
assert(results(0).docId == "id3")
assert(results(1).docId == "id4")
s"Searching the document DB returned the doc IDs ${results map (_.docId) mkString ","}"
}
}
4.2.3.3.4. ParentDocumentDb
A mutable database of documents, their text and fingerprints, preserving the original documents passed into the constructor - but search results do not refer to them. Crucially, documents have to inherit from ParentDoc
, such that they contain child documents, and it is these child documents that are subject to search, while the result of a search is consolidated to be the parent documents.
In other respects this behaves like a PreservingDocumentDb
.
The document IDs of child documents must be globally unique - not just unique within their parent document! |
SearchingWithParentDocumentDb.scala
: Semantic Search using ParentDocumentDb
.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.{FullSemanticSearcher, ParentDocumentDb}
import io.cortical.scala.api.document._
object SearchingWithParentDocumentDb extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val tdocs: Seq[TextDoc] = Seq(
TextDoc("id1", "Cars are us!"),
TextDoc("id2", "Cars will be cars."),
TextDoc("id3", "A text about cars."),
TextDoc("id4", "Another car text"))
val docs: Seq[FingerprintedTextDoc] = tdocs map (d => FingerprintedTextDoc(d, getFingerprint(d.text)))
val db: FullSemanticSearcher[FingerprintedParentTextDoc, PreservingScoredFingerprintedTextDoc[FingerprintedParentTextDoc]] = ParentDocumentDb(docs map (doc => PreservingFingerprintedParentTextDoc(doc, Seq(doc))))
assert(db.contains("id3"))
val doc3: Option[FingerprintedTextDoc] = db.get("id3")
assert(doc3.get.docId == "id3")
assert(db.contains("id2"))
db.remove("id2")
assert(!db.contains("id2"))
val results: Seq[PreservingScoredFingerprintedTextDoc[FingerprintedParentTextDoc]] = db.search(getFingerprint("Find the text about cars."), 10)
assert(results(0).docId == "id3")
assert(results(1).docId == "id4")
s"Searching the document DB returned the doc IDs ${results map (_.docId) mkString ","}"
}
}
4.2.3.3.5. PartitionedDocumentDb
An immutable document DB that is partitioned over an Apache Spark cluster, with one instance per Apache Spark executor. Currently only implements SemanticSearcher
but could also extend StoringSemanticSearcher
.
The major advantage of this implementation of SemanticSearcher
is that it does not store all documents and their fingerprints in just one JVM but rather distributes (partitions) them over the {SCN}s in an Apache Spark cluster.
It uses Apache Spark StorageLevel
MEMORY_AND_DISK
, i.e. allows swapping-out of elements of the partitioned data structure to disk through native Apache Spark mechanisms.
See example [SearchingWithPartitionedDocumentDb.scala]
:
SearchingWithPartitionedDocumentDb.scala
: Semantic Search over an Apache Spark cluster using PartitionedDocumentDb
.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.PartitionedDocumentDb
import io.cortical.scala.api.document._
import io.cortical.scala.spark.util.valueOfBroadcastCorticalEngine
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
object SearchingWithPartitionedDocumentDb extends FeatureApp {
override protected def feature()(implicit ce: Broadcast[CorticalEngine]) = {
val tdocs: RDD[TextDoc] = sc.parallelize(Seq(
TextDoc("id1", "Cars are us!"),
TextDoc("id2", "Cars will be cars."),
TextDoc("id3", "A text about cars."),
TextDoc("id4", "Another car text")))
val docs: RDD[FingerprintedTextDoc] = tdocs map (d => FingerprintedTextDoc(d, getFingerprint(d.text)))
val db: DocSemanticSearcher = PartitionedDocumentDb(sc, docs)
val results: Seq[ScoredFingerprintedTextDoc] = db.search(getFingerprint("Find the text about cars."), 10)
assert(results.size == 4)
assert(results(0).docId == "id3")
assert(results(1).docId == "id4")
s"Searching the document DB returned the doc IDs ${results map (_.docId) mkString ","}"
}
}
4.2.3.3.6. PartitionedFileCachingDocumentDb
Similar in spirit to PartitionedDocumentDb
but uses Parquet as an on-disk storage format for the partitioned data structure.
SearchingWithPartitionedFileCachingDocumentDb.scala
: Semantic Search over an Apache Spark cluster using PartitionedFileCachingDocumentDb
.package example.feature
import java.util.UUID
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.PartitionedFileCachingDocumentDb
import io.cortical.scala.api.document._
import io.cortical.scala.spark.util.valueOfBroadcastCorticalEngine
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
object SearchingWithPartitionedFileCachingDocumentDb extends FeatureApp {
override protected def feature()(implicit ce: Broadcast[CorticalEngine]) = {
val tdocs: RDD[TextDoc] = sc.parallelize(Seq(
TextDoc("id1", "Cars are us!"),
TextDoc("id2", "Cars will be cars."),
TextDoc("id3", "A text about cars."),
TextDoc("id4", "Another car text")))
val docs: RDD[FingerprintedTextDoc] = tdocs map (d => FingerprintedTextDoc(d, getFingerprint(d.text)))
val db: DocSemanticSearcher = PartitionedFileCachingDocumentDb(sqlContext, docs, 2, 4, s"/tmp/${UUID.randomUUID.toString}/")
val results: Seq[ScoredFingerprintedTextDoc] = db.search(getFingerprint("Find the text about cars."), 10)
assert(results.size == 4)
assert(results(0).docId == "id3")
assert(results(1).docId == "id4")
s"Searching the document DB returned the doc IDs ${results map (_.docId) mkString ","}"
}
}
4.2.3.3.7. DiskSerializingSemanticSearchWrapper
DiskSerializingSemanticSearchWrapper
is a decorator around UpdateableSemanticSearcher
s that persists all documents updated in the underlying SemanticSearcher
to disk in the local filesystem and directory. This is shown in example [SearchingWithDiskSerialization.scala]
:
SearchingWithDiskSerialization.scala
: DiskSerializingSemanticSearchWrapper
wraps around an instance of UpdateableSemanticSearcher
to add persistence to the local filesystem.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.scala.api.CorticalApi.getFingerprint
import io.cortical.scala.api.UpdateableDocumentFingerprintDb
import io.cortical.scala.api.document.persistence.DiskSerializingSemanticSearchWrapper.persisted
import io.cortical.scala.api.document.{DocID, FingerprintedDoc}
object SearchingWithDiskSerialization extends FeatureApp {
override protected def feature()(implicit ce: CorticalEngine) = {
val docs = Seq(
FingerprintedDoc("id1", getFingerprint("Cars are us!")),
FingerprintedDoc("id2", getFingerprint("Cars will be cars.")),
FingerprintedDoc("id3", getFingerprint("A text about cars.")),
FingerprintedDoc("id4", getFingerprint("Another car text")))
val db1 = persisted(UpdateableDocumentFingerprintDb(), "docDB1")
for (d <- docs) db1.add(d)
val ids1: Seq[DocID] = db1.search(getFingerprint("Find the text about cars."), 10)
assert(ids1.size == 4)
assert(ids1(0) == "id3")
assert(ids1(1) == "id4")
val db2 = persisted(UpdateableDocumentFingerprintDb(), "docDB1") // load from disk
val ids2: Seq[DocID] = db2.search(getFingerprint("Find the text about cars."), 10)
assert(ids2.size == ids1.size)
assert(ids2(0) == ids1(0))
assert(ids2(1) == ids1(1))
s"Searching the persisted fingerprint DB returned the doc IDs $ids2 after loading from disk"
}
}
The implementation of the DiskSerializingSemanticSearchWrapper
shipped with Retina Library uses Oracle Berkeley DB Java Edition, a product that requires separate licensing from Oracle Corporation. Authors of Retina Library Programs who wish to use the DiskSerializingSemanticSearchWrapper
must ensure they comply with the license terms of Oracle Berkeley DB Java Edition and must define an explicit dependency to Oracle Berkeley DB Java Edition in their Maven pom.xml files, as has been shown previously for usage with and without Apache Spark.
4.3. General utilities provided in Retina Library
4.3.1. Language detection
Retina Library includes an implementation of automated language detection, i.e., the analysis of text given in a String
to determine the language in which that text is written. Language detection works best for text that comprises one or more sentences. In contrast, it often gives unreliable results for very short text.
Language detection is demonstrated in example [DetectLanguage.scala]
.
DetectLanguage.scala
: Automated language detection in Retina Library.package example.feature
import io.cortical.language.detection.api.LanguageDetection
import io.cortical.model.languages.Languages
object DetectLanguage {
def main(args: Array[String]): Unit = {
val detector: LanguageDetection = LanguageDetection.DEFAULT
val en: Languages = detector.detectLanguage("This is obviously the Queen's language.")
val es: Languages = detector.detectLanguage("Este es un pais soleado.")
assert(Languages.EN.equals(en))
assert(Languages.ES.equals(es))
println(s"Detected language of text in $en and $es")
}
}
As can be seen in example [DetectLanguage.scala]
, language detection does not rely on the Retina, and is therefore not discussed in section Perform Semantic Text Processing with Retina Library.
Appendix A: Base classes used in example code
Retina Library features demonstrated outside of an Apache Spark cluster typically inherit from base class FeatureApp.scala
.
FeatureApp.scala
: Base class for example code demonstrating a given Retina Library feature available outside of Apache Spark.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.getCorticalEngine
import org.slf4j.LoggerFactory
abstract class FeatureApp {
protected val LOG = LoggerFactory.getLogger(getClass)
private val rdir = "./retinas"
private val rname = "english_subset"
def main(args: Array[String]): Unit = {
implicit val ce = getCorticalEngine(new FileRetinaLoader(rdir), rname)
val result = feature()
LOG info result
System.exit(0)
}
protected def feature()(implicit ce: CorticalEngine): String
}
Retina Library features demonstrated in an Apache Spark runtime environment typically inherit from base class [FeatureApp.scala]
, which itself inherits from [SparkApp.scala]
.
FeatureApp.scala
: Base class for example code demonstrating a given Retina Library feature within an Apache Spark cluster.package example.feature
import io.cortical.engine.api.CorticalEngine
import io.cortical.retina.source.FileRetinaLoader
import io.cortical.scala.api.CorticalApi.getCorticalEngine
import org.apache.spark.broadcast.Broadcast
import org.slf4j.LoggerFactory
abstract class FeatureApp extends SparkApp {
protected val LOG = LoggerFactory.getLogger(getClass)
private val rdir = "./retinas"
private val rname = "english_subset"
override protected def work(): Unit = {
implicit val ce = sc.broadcast(getCorticalEngine(new FileRetinaLoader(rdir), rname))
val result = feature()
LOG info result
}
protected def feature()(implicit ce: Broadcast[CorticalEngine]): String
}
Base class [SparkApp.scala]
is a common base class for code that must execute within an Apache Spark cluster.
SparkApp.scala
: Base class for example code running within an Apache Spark cluster.package example.feature
import io.cortical.scala.spark.util.{sparkContext, withSparkContext}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
abstract class SparkApp {
private var scVar: SparkContext = _
private var sqlContextVar: SQLContext = _
protected def sc = scVar
protected def sqlContext = sqlContextVar
def main(args: Array[String]): Unit = {
val appName = getClass.getSimpleName
withSparkContext(sparkContext(appName)) { (sc, sqlContext) =>
scVar = sc
sqlContextVar = sqlContext
work
}
System.exit(0)
}
protected def work(): Unit
}
Bibliography
-
[mvn] Maven Getting Started Guide, https://maven.apache.org/guides/getting-started/
-
[scala] Scala Documentation Page, http://www.scala-lang.org/documentation/
-
[spark] Spark Overview, https://spark.apache.org/docs/1.6.2/
-
[cioarts] Cortical.io Articles, http://www.cortical.io/resources_media.html#articles-Cortical.io