NLP

The Infermedica API features custom Natural Language Processing technology, allowing your applications to understand clinical concepts (symptoms and risk factors) mentioned by users as natural language text.

The service is easy to use: you can send the user’s original message and our endpoint will process it and do its best to spot mentions of symptoms or risk factors.

Our technology may help you build language-aware health-related systems such as chatbots or intelligent patient intake forms. Check out our chatbot, Symptomate, for a live example.

This service is currently available for English. We also have an experimental support for German language, please contact us if you’re interested. More languages are coming soon!

Standard usage

The service is accessible via the /parse endpoint. It returns It returns a list of mentions that have been recognized in the message. Our language technology is also able to spot some negated mentions (as in “I don't have headache”) and to deal with spelling errors (a common problem in chat language).

The endpoint expects a simple JSON containing one obligatory attribute, named text. Here's an example:

curl "https://api.infermedica.com/v2/parse" \
  -X "POST" \
  -H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
  -H "Content-Type: application/json" \
  -d '{"text": "i feel smoach pain but no couoghing today"}'

And the response:

{
  "mentions": [
    {
      "id": "s_13",
      "orth": "stomach pain",
      "choice_id": "present",
      "name": "Abdominal pain",
      "common_name": "Abdominal pain",
      "type": "symptom"
    },
    {
      "id": "s_102",
      "orth": "coughing",
      "choice_id": "absent",
      "name": "Cough",
      "common_name": "Cough",
      "type": "symptom"
    }
  ],
  "obvious": false
}

Each mention is associated with a concept ID (id attribute) and a modality (present or absent, the attribute named choice_id). These attributes are directly compatible with the /diagnosis endpoint.

The name attribute contains the main name of a symptom or risk factor, as used by medical professionals. The common_name field bears an equivalent of the main name but written in plain English, making it more suitable for the general audience (e.g. sore throat vs. pharyngeal pain). In many cases both fields have the same value.

The third name-bearing attribute, orth, contains the orthographic form of the mention, that is to say, the words used in the text (after spelling correction).

The text analyzed cannot be longer than 1,024 characters per /parse call. An error message (400) is returned for texts that are too long.

Input text may contain multiple mentions of the same underlying concept (e.g. stomach ache and belly pain). By default, only one mention of each unique concept will be returned. If you need each mention captured, please use the include_tokens option described below.

Obvious matches

The response also features a Boolean flag obvious that will be set to true if the engine considered the whole input text simple to analyze and unambiguous. In other words, the whole response is marked as obvious if it is safe to assume that all the relevant information conveyed in the input text has been successfully and unambiguously mapped onto the mentions returned in the response.

If this field has a value of false , it doesn’t necessarily mean that the output is wrong or empty (although this is also possible). It just means that we recommend making sure if possible. For instance, a chatbot may ask the user if this is what he means.

Adjusting service behavior

Spelling correction

By default the service performs spelling correction of the submitted text before attempting to discover mentions there. If you don’t expect the text to contain misspellings, it is better to turn this off. You can achieve this by adding "correct_spelling": false to your input JSON.

Why does it matter? In some rare cases using the spelling corrector might result in accidental mismatches. Some uncommon words might be absent from the corrector dictionary, which will make the corrector revert some of these into similar known words (false friends), possibly leading to unwanted matches. Note that this is rare and also we’re actively maintaining the correction dictionary to avoid such situations.

Concept types

By default the service will attempt to capture mentions of symptoms and some basic risk factors (such as smoking tobacco or being diagnosed with diabetes). For instance:

curl "https://api.infermedica.com/v2/parse" \
  -X "POST" \
  -H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
  -H "Content-Type: application/json" \
  -d '{"text": "I have diabetes"}'

Response:

{
  "mentions": [
    {
      "id": "p_8",
      "orth": "diabetes",
      "choice_id": "present",
      "name": "Diabetes",
      "common_name": "Diabetes",
      "type": "risk_factor"
    }
  ],
  "obvious": true
}

You can select which concept types should be captured by sending a value of concept_types attribute. Passing "concept_types": ["symptom"] will make the service ignore risk factor mentions and capture symptoms only. The attribute value defaults to ["symptom", "risk_factor"].

Contextual clues

When building a chatbot, you may see some users trying to refer to previously mentioned symptoms to provide more details. For instance, a user may first report having headaches, followed by the message “they are severe”. In some situations our engine is able to derive useful insight from the context of previously gathered symptoms and you can help it by passing a list of present observations gathered so far in this interview. For instance:

curl "https://api.infermedica.com/v2/parse" \
  -X "POST" \
  -H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
  -H "Content-Type: application/json" \
  -d '{"text": "it gets worse with exercise", "context": ["s_50"]}'

Response:

{
  "mentions": [
      {
         "id": "s_35",
         "name": "Chest pain, during exertion",
         "common_name": "Chest pain on exertion",
         "orth": "with exercise",
         "type": "symptom",
         "choice_id": "present"
      }
  ],
  "obvious": false
}

This example assumes that the user has previously reported chest pain (the endpoint has returned a mention of a symptom with id s_50, status: present) and now the history of captured observations is used in the context field.

Please note that proper understanding of context is a hard task for machines and this feature will not always recover all the intended meaning. It is then safer to encourage users to use as simple and self-contained descriptions as possible, while being able to handle a bit more.

Obtaining detailed output (advanced)

Optionally you can pass include_tokens: true to obtain additional information on tokenization in output. This may be helpful if you plan to perform additional stages of text processing; it makes it easier to align output of our service with output of other NLP tools you might wish to employ. For instance:

curl "https://api.infermedica.com/v2/parse" \
  -X "POST" \
  -H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
  -H "Content-Type: application/json" \
  -d '{"text": "I often feel sad.", "include_tokens": true}' 

will yield the following structure:

{
  "mentions": [
    {
      "id": "s_169",
      "positions": [0, 2, 3],
      "head_position": 2,
      "name": "Depressed mood",
      "common_name": "Depressed mood",
      "choice_id": "present",
      "orth": "I feel sad",
      "type": "symptom"
    }
  ],
  "tokens": ["I", "often", "feel", "sad", "."],
  "obvious": false
}

The extended structure contains a list named tokens. Tokens are words, numbers, symbols and punctuation captured in the input text. Words in the list are given as orthographic forms (that is, forms encountered in the input text), but after spelling correction.

Also, the representation of each mention is enriched with references to token positions. The position attribute contains a list of token indices that make up the mention (corresponding to the tokens list, counting from 0). Mentions are not always continuous, as you can see in the above example. The mention’s syntactic head is designated by the head_position attribute. Syntactic heads are tokens that determine the syntactic type of the whole phrase; in other words, if the parse tree underlying the entire mention was to be collapsed into one word, it would be the head.

When operating in the include_tokens: true mode, the service will not limit the number of mentions of the same concept. If the input text mentions the same concept multiple times (possibly using different words), each mention will be captured in output. For instance, in input text lack of energy and feeling tired the concept of lethargy will be captured twice.

Limitations

The service attempts to capture mentions of symptoms that are present in our knowledge base. If a symptom is not there, its mention will not be recognized. However, if a more general symptom is present, chances are it will be captured instead (for instance, currently there is no separate entry for “rash on legs” in our knowledge base, but the service will understand “rash” if this phrase is sent).

Also note that due to the ambiguity of natural languages and the endless spectrum of possible language expressions that may be used to convey an idea, we cannot guarantee that the recognition will be 100% accurate. Nevertheless, we believe it is already performing well, and it is continually being improved.