Working With Text

The functionality we offer for text is a little more elaborate than for terms, given the more complex nature of texts. Besides getting a semantic fingerprint (semantic representation) for a given text (the /text endpoint), one can also get a list of keywords extracted from the text, or get the text split up into smaller consecutive chunks, based on information content. We also provide functionality for extracting terms from a text based on part of speech tags. There is also a bulk endpoint for merging several /text requests into just one http request. Finally there is a detect_language endpoint capable of detecting 50 languages. The endpoints are as follows:

  • /text
  • /text/keywords
  • /text/tokenize
  • /text/slices
  • /text/bulk
  • /text/detect_language

Text input should be in UTF-8 format (and, in the case of the /text/bulk endpoint, it should also be JSON encoded/JSON safe). Optimal results can be obtained if the input text is provided in a clear natural language form including punctuation, uppercase, lowercase etc. (minimally 10 to 15 words), but stripped of all artifacts (e.g. xml tags, html tags). For instance, the following text (taken from Wikipedia) can be safely input to all /text endpoints except the /text/bulk endpoint:

He recalled in 2008, "I was changed – permanently changed – by that experience.
   It was like a miracle to me".

For the /text/bulk endpoint occurrences of the quotation mark will need to be escaped in order to be compatible with JSON encoding:

He recalled in 2008, \"I was changed – permanently changed – by that experience.
   It was like a miracle to me\".

All information about the inputs and outputs of the methods can be found directly by going to our interactive API documentation and clicking the Text button.

REST API

The text functions are all implemented as POST calls, where the input document is attached to an HTTP-Header. For example, using curl, a list of keywords for a text can be retrieved like this (all on one line):

curl -k -X POST -H "api-key: yourApiKey" -H "Content-Type: application/json"
   "http://api.cortical.io/rest/text/keywords?retina_name=en_associative" -d @yourInputText.txt

Internally, the input documents will be stripped of all formatting in order to extract the single terms that are the basic unit of operation. The /text/tokenize endpoint is a way for the user to check what is actually retrieved in this process. A list of strings will be returned – each representing a sentence of the input text. Each sentence is simply a comma-separated list of terms that were found in the sentence. If you specify valid POS tags in the POStags field, only terms of these POS types will be returned. Using tokenization the user can thus retrieve the single terms of a text for example in order to build custom fingerprints for texts by using expressions.

The /text/slices endpoint lets the user split up a long text into smaller snippets (based on changes or variations in topic). This will basically slice a text into subsections. The number of slices returned is computed dynamically. The response object will contain a list of Text objects, each containing the text snippet, (as well as a Fingerprint object, if the parameter get_fingerprint was set to true). Try it out directly in the interactive API documentation.

The /text/bulk lets you pack a number of text-to-fingerprint conversions into a single call to the API. The input must in this case be a list of json formatted text objects. A list of Fingerprints corresponding to the input texts will be returned.

API Clients

The FullClient object available in the Java, Python, and JavaScript client libraries has the following methods for calling the text endpoints:

  • getFingerprintForText
  • getKeywordsForText
  • getTokensForText
  • getSlicesForText
  • getFingerprintsForTexts
  • getLanguageForText

POS Tags

The /text/tokenize endpoint accepts POS tags from a universal POS tag set. As an aid to determining which POS tag(s) you need, the following is a table showing the mapping from the universal set of POS tags to the PENN TREE (Gate style) POS tag set.

Universal POS tags PENN TREE POS Tags
JJ JJ, JJR, JJS, JJSS
CD CD
FW FW
CW IN, DT, EX, POS, CC, PDT, WDT, WRB, TO, UH, RP, LS
P PP, PRP, PRP$, PRPR$, WP, WP$
PUNCT  
RB, RB, RBR, RBS
NN NN
NNS NNS
NNP NNP, NP
NNPS NNPS, NPS
SYM SYM
MD MD
VB VB, VBD, VBG, VBN, VBP, VBZ
LRB LRB

Languages

The following are the languages which the /text/detect_language endpoint is currently capable of detecting:

  • AF Afrikaans
  • AR Arabic
  • BG Bulgarian
  • BN Bengali
  • CS Czech
  • DA Danish
  • DE German
  • EL Greek
  • EN English
  • ES Spanish
  • FA Persian
  • FI Finnish
  • FR French
  • GU Gujarati
  • HE Hebrew
  • HI Hindi
  • HR Croatian
  • HU Hungarian
  • ID Indonesian
  • IT Italian
  • JA Japanese
  • KN Kannada
  • KO Korean
  • LT Lithuanian
  • LV Latvian
  • MK Macedonian
  • ML Malayalam
  • MR Marathi
  • NE Nepali
  • NL Dutch
  • NO Bokmål
  • PA Punjabi
  • PL Polish
  • PT Portuguese
  • RO Romanian
  • RU Russian
  • SK Slovak
  • SL Slovene
  • SO Somali
  • SQ Albanian
  • SV Swedish
  • SW Swahili
  • TA Tamil
  • TE Telugu
  • TH Thai
  • TL Tagalog
  • TR Turkish
  • UK Ukrainian
  • UR Urdu
  • VI Vietnamese
  • ZH_CN simplified Chinese
  • ZH_TW traditional Chinese