NLP

The Infermedica API features custom Natural Language Processing technology, allowing your applications to understand clinical concepts (symptoms and risk factors) mentioned by users as natural language text.

The service is easy to use: you can send the user’s original message and our endpoint will process it and do its best to spot mentions of symptoms or risk factors.

Our technology may help you build language-aware health-related systems such as chatbots or intelligent patient intake forms. Check out our chatbot, Symptomate, for a live example.

This service is currently available for English. We also have an experimental support for German language, please contact us if you’re interested. More languages are coming soon!

Standard usage

The service is accessible via the /parse endpoint. It returns It returns a list of mentions that have been recognized in the message. Our language technology is also able to spot some negated mentions (as in “I don't have headache”) and to deal with spelling errors (a common problem in chat language).

The endpoint expects a simple JSON containing one obligatory attribute, named text. Here's an example:

curl "https://api.infermedica.com/v2/parse" \
  -X "POST" \
  -H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
  -H "Content-Type: application/json" \
  -d '{"text": "i feel smoach pain but no couoghing today"}'

And the response:

{
  "mentions": [
    {
      "id": "s_13",
      "orth": "stomach pain",
      "choice_id": "present",
      "name": "Abdominal pain",
      "common_name": "Abdominal pain",
      "type": "symptom"
    },
    {
      "id": "s_102",
      "orth": "coughing",
      "choice_id": "absent",
      "name": "Cough",
      "common_name": "Cough",
      "type": "symptom"
    }
  ]
}

Each mention is associated with a concept ID (id attribute) and a modality (present or absent, the attribute named choice_id). These attributes are directly compatible with the /diagnosis endpoint.

The name attribute contains the main name of a symptom or risk factor, as used by medical professionals. The common_name field bears an equivalent of the main name but written in plain English, making it more suitable for the general audience (e.g. sore throat vs. pharyngeal pain). In many cases both fields have the same value.

The third name-bearing attribute, orth, contains the orthographic form of the mention, that is to say, the words used in the text (after spelling correction).

The text analyzed cannot be longer than 1,024 characters per /parse call. An error message (400) is returned for texts that are too long.

Input text may contain multiple mentions of the same underlying concept (e.g. stomach ache and belly pain). By default, only one mention of each unique concept will be returned. If you need each mention captured, please use the include_tokens option described below.

Adjusting service behavior

Spelling correction

By default the service performs spelling correction of the submitted text before attempting to discover mentions there. If you don’t expect the text to contain misspellings, it is better to turn this off. You can achieve this by adding "correct_spelling": false to your input JSON.

Why does it matter? In some rare cases using the spelling corrector might result in accidental mismatches. Some uncommon words might be absent from the corrector dictionary, which will make the corrector revert some of these into similar known words (false friends), possibly leading to unwanted matches. Note that this is rare and also we’re actively maintaining the correction dictionary to avoid such situations.

Concept types

By default the service will attempt to capture mentions of symptoms and some basic risk factors (such as smoking tobacco or being diagnosed with diabetes). For instance:

curl "https://api.infermedica.com/v2/parse" \
  -X "POST" \
  -H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
  -H "Content-Type: application/json" \
  -d '{"text": "I have diabetes"}'

Response:

{
  "mentions": [
    {
      "id": "p_8",
      "orth": "diabetes",
      "choice_id": "present",
      "name": "Diabetes",
      "common_name": "Diabetes",
      "type": "risk_factor"
    }
  ]
}

You can select which concept types should be captured by sending a value of concept_types attribute. Passing "concept_types": ["symptom"] will make the service ignore risk factor mentions and capture symptoms only. The attribute value defaults to ["symptom", "risk_factor"].

Obtaining detailed output (advanced)

Optionally you can pass include_tokens: true to obtain additional information on tokenization in output. This may be helpful if you plan to perform additional stages of text processing; it makes it easier to align output of our service with output of other NLP tools you might wish to employ. For instance:

curl "https://api.infermedica.com/v2/parse" \
  -X "POST" \
  -H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
  -H "Content-Type: application/json" \
  -d '{"text": "I often feel sad.", "include_tokens": true}' 

will yield the following structure:

{
  "mentions": [
    {
      "id": "s_169",
      "positions": [0, 2, 3],
      "head_position": 2,
      "name": "Depressed mood",
      "common_name": "Depressed mood",
      "choice_id": "present",
      "orth": "I feel sad",
      "type": "symptom"
    }
  ],
  "tokens": ["I", "often", "feel", "sad", "."]
}

The extended structure contains a list named tokens. Tokens are words, numbers, symbols and punctuation captured in the input text. Words in the list are given as orthographic forms (that is, forms encountered in the input text), but after spelling correction.

Also, the representation of each mention is enriched with references to token positions. The position attribute contains a list of token indices that make up the mention (corresponding to the tokens list, counting from 0). Mentions are not always continuous, as you can see in the above example. The mention’s syntactic head is designated by the head_position attribute. Syntactic heads are tokens that determine the syntactic type of the whole phrase; in other words, if the parse tree underlying the entire mention was to be collapsed into one word, it would be the head.

When operating in the include_tokens: true mode, the service will not limit the number of mentions of the same concept. If the input text mentions the same concept multiple times (possibly using different words), each mention will be captured in output. For instance, in input text lack of energy and feeling tired the concept of lethargy will be captured twice.

Limitations

The service attempts to capture mentions of symptoms that are present in our knowledge base. If a symptom is not there, its mention will not be recognized. However, if a more general symptom is present, chances are it will be captured instead (for instance, currently there is no separate entry for “rash on legs” in our knowledge base, but the service will understand “rash” if this phrase is sent).

Also note that due to the ambiguity of natural languages and the endless spectrum of possible language expressions that may be used to convey an idea, we cannot guarantee that the recognition will be 100% accurate. Nevertheless, we believe it is already performing well, and it is continually being improved.