NLP

The Infermedica API features custom Natural Language Processing technology, allowing your applications to understand clinical concepts (symptoms and risk factors) mentioned by users as natural language text.

The service is easy to use: you can send the user’s original message and our endpoint will process it, attempting to spot mentions of any symptoms or risk factors.

Our technology can help you build language-aware health-related systems such as chatbots or intelligent patient intake forms.

This service is currently available in English. We also have experimental support for German. If interested, please contact us.

Standard usage

This service is accessible via the /parse endpoint. It returns a list of mentions that have been recognized in the message. Our language technology is also able to spot some negated mentions (as in, “I don't have a headache”) and to deal with spelling errors (a common problem in chat language).

The endpoint expects a simple JSON containing one obligatory attribute, named text. Here's an example:

curl "https://api.infermedica.com/v3/parse" \
-X "POST" \
-H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
-H "Content-Type: application/json" \
-d '{"text": "i feel smoach pain but no couoghing today", "age": {"value": 30}}}'

And the response:

{
"mentions": [
{
  "id": "s_13",
  "orth": "stomach pain",
  "choice_id": "present",
  "name": "Abdominal pain",
  "common_name": "Abdominal pain",
  "type": "symptom"
},
{
  "id": "s_102",
  "orth": "coughing",
  "choice_id": "absent",
  "name": "Cough",
  "common_name": "Cough",
  "type": "symptom"
}
],
"obvious": false
}

Each mention is associated with a concept ID (id attribute) and a modality (present or absent, the attribute named choice_id). These attributes are directly compatible with the /diagnosis endpoint.

The name attribute contains the main name of a symptom or risk factor, as used by medical professionals. The common_name field bears an equivalent of the main name but written in plain English, making it more suitable for the general audience (e.g. sore throat vs. pharyngeal pain). In some cases, both fields have the same value.

The third name-bearing attribute, orth, contains the orthographic form of the mention, that is to say, the words used in the text (after spelling correction).

The text analyzed cannot be longer than 2,048 characters per /parse call. An error message (400) is returned for texts that are too long.

Input text may contain multiple mentions of the same underlying concept (e.g. stomach ache and belly pain). By default, only one mention of each unique concept will be returned. If you need each mention captured, please use the include_tokens option described below.

Sex filter

Specifying the user’s age is obligatory in API version 3. Specifying sex is optional, but highly recommended. This filter considers only those clinical concepts that are relevant for the given biological sex.

To use the filtering, include the field sex with value male or female (as strings). If the key is not provided, no sex filtering will take place. Example below.

curl "https://api.infermedica.com/v3/parse" \
-X "POST" \
-H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
-H "Content-Type: application/json" \
-d '{"text": "my period is late", "age": {"value": 28}, "sex": "female"}

Obvious matches

The response also features a Boolean flag, obvious, that is set to true if the engine considers the whole input text simple to analyze and unambiguous. In other words, the whole response is marked as obvious if it is safe to assume that all the relevant information conveyed in the input text has been successfully and unambiguously mapped onto the mentions returned in the response.

If this field has a value of false, it doesn’t necessarily mean that the output is wrong or empty (although this is also possible). It does mean that we recommend making sure. For instance, a chatbot may double check with the user if what they said is correct.

Adjusting service behavior

Spelling correction

By default, the service performs spelling correction on the submitted text before attempting to discover any mentions. If you don’t expect the text to contain misspellings, it is better to turn this off. You can do this by adding "correct_spelling": false to your input JSON.

Why does it matter? In some rare cases using the spelling corrector might result in accidental mismatches. Some uncommon words might be absent from the corrector dictionary, which will make the corrector revert some of these into similar known words (false friends), possibly leading to unwanted matches. Please note that this is rare and that we’re actively maintaining the correction dictionary to avoid exactly these types of situations.

Concept types

By default the service will attempt to capture any mentions of symptoms and some basic risk factors (such as smoking tobacco or being diagnosed with diabetes). For instance:

curl "https://api.infermedica.com/v3/parse" \
-X "POST" \
-H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
-H "Content-Type: application/json" \
-d '{"text": "I have diabetes", "age": {"value": 30}}'

Response:

{
"mentions": [
{
  "id": "p_8",
  "orth": "diabetes",
  "choice_id": "present",
  "name": "Diabetes",
  "common_name": "Diabetes",
  "type": "risk_factor"
}
],
"obvious": true
}

You can select which concept types should be captured by sending a value of the concept_types attribute. Passing "concept_types": ["symptom"] will make the service ignore risk factor mentions and capture symptoms only. The attribute value defaults to ["symptom", "risk_factor"].

Contextual clues

When building a chatbot, you may see some users trying to refer to previously mentioned symptoms to provide more details. For instance, a user may first report having headaches, followed by the message “they are severe”. In some situations our engine is able to derive useful insight from the context of previously gathered symptoms. This function can be improved by passing a list of present observations gathered so far in the interview. For instance:

curl "https://api.infermedica.com/v3/parse" \
-X "POST" \
-H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
-H "Content-Type: application/json" \
-d '{"text": "it gets worse with exercise", 
  "age": {"value": 30}, 
  "context": ["s_50"]
}'

Response:

{
"mentions": [
  {
     "id": "s_35",
     "name": "Chest pain, during exertion",
     "common_name": "Chest pain on exertion",
     "orth": "with exercise",
     "type": "symptom",
     "choice_id": "present"
  }
],
"obvious": false
}

This example assumes that the user has previously reported chest pain (the endpoint has returned a mention of a symptom with id s_50, status: present) and now that history of captured observations is used in the context field.

Please note that proper understanding of context is a hard task for machines and that this feature will not always recover all the intended meanings. It is safer to encourage users to use as simple and self-contained descriptions as possible.

Obtaining detailed output (advanced)

You can also pass include_tokens: true to obtain additional information on tokenization in the output. This may help if you plan to perform additional stages of text processing as it makes it easier to align the output of our service with the output of other NLP tools you might wish to employ. For instance:

curl "https://api.infermedica.com/v3/parse" \
-X "POST" \
-H "App-Id: XXXXXXXX" -H "App-Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" \
-H "Content-Type: application/json" \
-d '{"text": "I often feel sad.", "age": {"value":30}, "include_tokens": true}' 

will yield the following structure:

{
"mentions": [
{
  "id": "s_169",
  "positions": [0, 2, 3],
  "head_position": 2,
  "name": "Depressed mood",
  "common_name": "Depressed mood",
  "choice_id": "present",
  "orth": "I feel sad",
  "type": "symptom"
}
],
"tokens": ["I", "often", "feel", "sad", "."],
"obvious": false
}

The extended structure contains a list named tokens. Tokens are words, numbers, symbols, and punctuation captured in the input text. Words in the list are given as orthographic forms (that is, forms encountered in the input text), but after spelling correction.

The representation of each mention is also enriched with references to token positions. The position attribute contains a list of token indices that make up the mention (corresponding to the tokens list, counting from 0). Mentions are not always continuous, as you can see in the above example. The mention’s syntactic head is designated by the head_position attribute. Syntactic heads are tokens that determine the syntactic type of the whole phrase; in other words, if the parse tree underlying the entire mention was to be collapsed into one word, it would be the head.

When operating in include_tokens: true mode, the service will not limit the number of mentions of the same concept. If the input text mentions the same concept multiple times (possibly using different words), each mention will be captured in output. For instance, if there's both lack of energy and feeling tired in the input text, the concept of lethargy will be captured twice.

Limitations

This service attempts to capture mentions of symptoms that are present in our knowledge base. If a symptom is not there, its mention will not be recognized. However, if a more general symptom is present, chances are it will be captured instead (for instance, currently there is no separate entry for “bruised leg” in our knowledge base, but the service will understand “bruise” if this phrase is sent).

Note that, due to the ambiguity of natural languages and the endless spectrum of possible language expressions that may be used to convey an idea, we cannot guarantee that the recognition will be 100% accurate. Nevertheless, we believe it is already performing well and it will be continually improved.