Skip to Main Content

In a recent article in this space, OPTUM discussed how real-world data (RWD) can be used to provide a deeper, more precise understanding of cancer progression and treatment. Now, the company takes a look at a dynamic new field of research that is critical to bringing that RWD to life: natural language processing (NLP).

Creating a clearer picture

An NLP system enables a computer to read, interpret, and organize important health data that is buried in unstructured free text found in a patient’s medical records (such as provider notes, pathology reports, radiology reports, and treatment summaries). This can be especially useful to oncology researchers for two key reasons:

  • Specific oncology concepts important to understanding cancer progression often are not available in structured formats, particularly the tumor, node, and metastasis (TNM) values, stage information, and biomarkers.
  • Manually reviewing, extracting, and interpreting unstructured data is both labor-intensive and expensive.

By extracting relevant information from an oncology patient’s medical records, an NLP engine can convert complex clinical narratives into actionable data points and provide researchers with key oncology-related insights in an easy-to-use format — all in a fraction of the time that it would take expert staff to complete the same job.

The magic of NLP

Oncology-focused NLP systems can be designed to identify the positive occurrences of desired oncology concepts, such as cancer type, TNM, stage, and biomarkers. At the same time, it can exclude semantic contexts that are not desired oncology contexts. For example, if the goal is to identify patients with prostate cancer, an NLP system can identify different semantic contexts and appropriately extract the desired contexts into a structured format. The concepts can then be easily searched by researchers.

Let’s take a look at a few examples of the contexts that occur within medical notes:

Sample text Concept
“Patient has stage II prostate cancer” Patient positive for prostate cancer
“Prostate cancer free” Patient negative for prostate cancer
“If prostate cancer is found, patient may require additional imaging” Hypothetical prostate cancer situation
“Might be prostate cancer” Hedged prostate cancer statement
“Prostate cancer is a common cancer among males” Prostate cancer not relevant to patient

How NLP works: The information extraction process

An NLP system can be designed to process medical records data and extract relevant entities in the text — and the relationships between them — using three approaches:

  1. Entity extraction: The extraction of a concept or entity represented by lexical units or phrases in the free text
  2. Relation extraction: The extraction of the relationships between entities
  3. Frame extraction: The extraction of the logical semantic group of lexical units and the collection of any relevant relations

Example A shows the entity, relation, and frame extraction. Individual entities are tagged, or labeled, and linked to one another via relations. Relation extraction links one tag to another tag. In this example, “cancer” tag links to “direction” and “stage_tnm” tags. Frame extraction groups relations originating from the same parent concept into a structure that is more easily consumable as table-like data. The frame is a logical set of semantic units, and the frame for the cancer stage context is shown extracted into table format in Example A.

Example A. Entity, relation, frame-tagging and extraction

Building a better NLP system

Although you’re probably not going to design your own NLP system, you may want to engage someone to customize one for you. That’s why it’s helpful to know what’s necessary to build an effective and efficient system.

Modeling approach

The most robust oncology NLP systems leverage the best practices in data science and automation. They go beyond term-matching and rules-based approaches by incorporating machine learning and deep learning to ensure the correct identification of the desired oncology context.

The advantage of using this approach is that these “supervised” machine-learning models can be trained to identify broader patterns that are not explicitly and manually created by a human. Instead, the machine learns from a sample of labeled data that enables it to generalize to relevant contexts. This allows the machine to accurately identify the appropriate contexts in an automated fashion over highly variable text.

It’s important that the model is evaluated against a held-out annotated test data set that the model has not seen before. This test helps ensure that the model is not overfitting to the training data and that it will remain reliably accurate with new data.

Annotation design and data development

In order to create a high-performing NLP model, it’s essential to:

  • Thoughtfully craft and design an annotation specification document that provides annotator’s instructions on how to accurately label or tag the data in order to provide the models with information it needs to learn how to appropriately perform the extraction task.
  • Select a representative sample of notes to annotate.

It’s also vital that the annotation design and sampling methodology is systematically developed by NLP data scientists specialized in the field of clinical NLP who are working in close consultation with clinical experts (oncologists, oncology clinicians, pharmacists, molecular biologists, medical informaticists, and other physicians). During the annotation design stage, the design team outlines the entities and relations to annotate and extract. This design step should focus on both the clinical context and the generalizability of the concept space to ensure scalability and extensibility of the NLP approach for overall data enrichment.

Additional best practices employed during the design of an effective annotation process include:

  • Annotation guides are iteratively improved over time.
  • Changes are tracked and reviewed in version control to ensure consistency and reliability of the process.
  • An iterative and careful review is conducted on the annotation design by a team of diverse clinical and data science subject-matter experts for clinical content, as well as for data science design structure.
  • Once the annotation design and the random sampling methodology are refined, a random sample of data is drawn, and additional refinements are made to the specifications during the annotation process.
  • Each note in the sample is double-annotated by two annotators and any conflicts are resolved in a third review by a curator. This process should occur with each document in a sample. The sample should then be subdivided into the subsets of train, validation and test.

Once the models are finalized, they should be run at scale in a distributed manner on the collection of notes. Extracted entities should be:

  • Normalized in order to reduce the variability of the output and to facilitate analysis
  • Linked to controlled vocabularies and ontologies, whenever possible

The benefits of taking a rigorous approach to NLP system design and using a variety of techniques are scalability and comprehensive, methodically consistent and reliable extraction. Overall, employing a combination of rules, traditional machine learning, and deep learning techniques can lead to effective and highly accurate results.

Putting NLP-generated knowledge to work

Since medical notes are such a rich source of crucial information, the importance of the clinical NLP field will continue to grow throughout the health care continuum. The astute use of NPL-augmented data can help:

  • Researchers obtain a clearer view of a cancer patient’s end-to-end journey, from pre-cancer diagnosis to post-cancer diagnosis to post-treatment
  • Health care organizations respond more quickly to their internal needs and the needs of other health care stakeholders so they can make a material impact on care quality and cost
  • Pharmaceutical manufacturers design better clinical trials, generate evidence and launch products more quickly
  • Commercial teams identify the right patients for their brand
  • Hospitals enjoy a better return on their electronic medical record and analytic investments by increasing the amount and quality of usable data they have access to

In this way, an NLP system can be used to power the discoveries of life sciences companies and accelerate the development of life-extending and life-enhancing treatments, which can dramatically change the lives of oncology patients.

To see how Optum is bridging the gap in understanding cancer progression, visit optum.com/lifesciences.