Quanticate Blog

Machine Learning in the Pharmaceutical Industry

Written by Clinical Programming Team | Tue, May 21, 2019

This blog explores what Machine Learning (ML) is and it’s difference variations. We will cover the three types of ML and present real-life examples from the pharmaceutical industry of all three types. We will also cover the SAS Data Mapper Tool which is one of the ML algorithms. In addition to this, we will also touch base upon challenges of data science and the regulatory processes for approvals of AI/ML Products.

What is Data Science?

Before we dive into ML lets first define Data Science, Data science is a big umbrella covering each aspect of data processing and not only statistical or algorithmic aspects. Data science includes:

  1. Data visualization: It is a general term that describes any effort to help people understand the significance of data by placing it in a visual context.
  2. Data integration: It is the process of combining data from different sources into a single, unified view. Integration begins with the ingestion process, and includes steps such as cleansing, ETL mapping, and transformation.
  3. Dashboards and BI: A business intelligence dashboard (BI dashboard) is a data visualization tool that displays on a single screen the status of business analytics metrics, key performance indicators (KPIs) and important data points for an organization, department, team or process.
  4. Distributed architecture: data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.
  5. Data-driven decisions: It is an approach to business governance that values decisions that can be backed up with verifiable data.
  6. Automation using ML: It represents a fundamental shift in the way organizations of all sizes approach machine learning and data science.
  7. Data engineering: It is the aspect of data science that focuses on practical applications of data collection and analysis.

What is Machine Learning?

Machine learning is an application of artificial intelligence (AI) that essentially teaches a computer program or algorithm the ability to automatically learn a task and improve from experience without being explicitly programmed. It focuses on the development of computer programs that can access data and use it learn for themselves. Programmers need to examine and code accordingly so that a system can independently perform iterative improvements. Most commonly there are three types of ML; Unsupervised Learning, Supervised Learning and Reinforcement Learning.

Typically the ML Process consists of:

  1. Gathering data from various sources
  2. Cleaning data to have homogeneity
  3. Selection of right ML algorithm model building
  4. Gaining insights from the model’s results
  5. Transforming results into visual graphs

Types of Machine Learning:

 

Unsupervised Learning

Unsupervised learning is the opposite of supervised learning in that the algorithm learns from itself and does not have pre-programmed labels. The algorithm has the ability to understand the data itself and then learns to group/cluster/organize the input data. This type of algorithm determines these patterns and restructures data into something else which could be a value, it is a useful type of ML in that it provides insights into data that perhaps humans analysis may miss or hasn’t be preassigned in the supervised learning algorithms.
The algorithm works in a similar way to how humans learn themselves, in the way that we identify certain objects or events from the same types or categories and determines a degree of similarity between these objects. It is a common algorithm used in marketing automation as one of the first successful use cases was Amazon and suggested products after analyzing previous purchasing history, or Netflix and YouTube and suggestions of what piece of content to watch next.

An area which is useful to medicines and medical research, is that its an excellent algorithm in research papers. For example there could be a large database of all the papers on a given subject and an unsupervised learning algorithm would know how to group different papers in such a way that it was always aware of progress being made in different fields of medicine. If the paper was connected to the network, once you start to write your paper the ML could suggest certain references you may want to cite or even other papers you may wish to review to help your own paper prove its hypothesis. Think how powerful this type of ML could be in a clinical trial setting and how important clinical data transparency would become as the data being shared from other drug companies could enable future drugs to become more successful if this data was transparent and in the public domain but also hocked into an unsupervised learning environment. This type of ML does not just have potential for a clinical trial perspective but also a drug discovery perspective and is being used by companies such as Benevolent AI which recently formed a partnership with AstraZenca.

However with this type of ML it is not supportive of future outcomes and predictions.

Illustration of Unsupervised Learning:

Spread of Zika Virus

Step 1
Input data is entered of patients suffering from Zika virus from various locations of India.

Step 2
Machine algorithm analyses the data and clusters the data based on coastal region patients and inland region patients.

Step 3
Based on the clustering density, we can identify where the Zika virus has spread to the most and an awareness campaigns can be launched in the concerned regions.

This example illustrates that in unsupervised learning only clusters are formed, and we cannot use this type for prediction of future or outcome.

Real life example of Unsupervised Learning:

  • We have Test A which is 95% Accurate but 10 times costlier than normal blood tests.
  • The aim: To find an alternative lab tests, which will help us in reducing the patients going directly for an expensive Test A.
  • Process:
    • Step 1: We will feed the past data (Test A and laboratory results) into the platform.
    • Step 2: We run the algorithms, to form the clusters.
      • The Clusters formed will help in identifying those abnormal results which gives high chances of positive test A, so this would help in segregating and selecting patients with a set of abnormal results (based on cluster result) that will go for Test A. This will help in direct reduction of diagnostic cost.

 

Supervised Learning

Supervised learning is the easier type of machine learning algorithm to understand and implement, and proves to be very popular. It has been described as the same type of learning as a teacher educating a small child with the use of learning cards.

For example, the algorithm learns from example data and each type of example data is given a numeric value or string labels such as classes or tags. Multiple data can be loaded into the algorithm which will later predict the correct response with new examples based on its historical learning and original input data as each example was given a label and the algorithm learnt the correct label for that input data.


Supervised learning is known as being task-oriented because it requires multiple simulations to further increase its own ability to correctly predict the never seen example and align to the correct label. It is continually learning from each new task performed. This type of ML resolves classification problems which is a qualitative variable being the desired output, for example think of the face recognition on Facebook when a photo is uploaded and it provides a suggestion to tag a friend as it has lots of historical tags of that face to a Facebook account. Then there are regression problems, whose target output is a numerical value. This could be an algorithm that determines the average house price based in certain areas because as more and more houses enter the market in that geographic location it has more input data with a certain labels based on certain geographic coordinates.

Illustration of Supervised Learning:

Lab tests for Anemia

Step 1
An algorithm is trained about Hb level and corresponding output of either Anaemic or non-Anaemic based on labelled data.

Step 2
Input data for patients with their Hb levels is fed into the algorithm.

Step 3
The algorithm analyses the patient’s data with Step 1 inputs.

Step 4
When new data is entered, machine recognizes the Hb level and generates report if patient is suffering from Anaemia or not.

Real life example of Supervised Learning:

  • A company's clinical trials AI claims it leverages Natural Language Processing (NLP) to help researchers more efficiently manage clinical trial workflows.
  • The objective: To identify risk factors and provide recommendations for clinical trial optimization.
  • Study Optimizer platform are trained on “billions of data points from past clinical trials, medical journals, and real-world sources”.
  • The Process:
    • Step 1: A user uploading his research protocol documents to the platform.
    • Step 2: Risk factors are identified by analyzing the protocol text, and any potential barriers are reported.
    • Step 3: The platform provides recommendations for mitigating risk to optimize the research protocol.

When the new protocol is uploaded based on the trained data the AI will provide potential barriers and help in mitigating risk.

Reinforcement Learning

Reinforcement learning is when an algorithm is learning from its mistakes or reward based learning. It is similar to unsupervised learning where input data examples lack labels and it is up to the algorithm to assign/generate its own output value, however the difference occurs in that the algorithm has to make an output decision which is then graded as either positive or negative and has a consequences, this makes the end result a prescriptive response not just a descriptive response like supervised learning. When an outcome is positive it learns from this reward and attempts to recreate this approach, similarly a negative signal enables the algorithm to learn that a certain approach was incorrect and therefore will learn from this and try to continual improve. In a human perspective it is the process of trial and error.


Reinforcement learning has been trialed in algorithms being taught to play video games. Google’s DeepMind project created algorithms which were able to play old video games and if we were to take an example of Mario you could see how the AI had to be programmed to play a certain level and would learn from its mistakes. There would be reward signals of points being collected and the negatives would be losing lives by hitting enemies or falling down pits. Once the algorithm was shown the buttons to explore and interact in its environment, through repetition it would slowly increase in its ability and seek behaviors that generate rewards. In the Google Deepmind example, the AI originally started off slowly and clumsily, losing lives and receiving game overs until it became better and better at the game, mastered it and rivaled the best human players.
[1]

An illustration as an example from healthcare sector:

Diagnosis based on X-ray

 

Step 1
We use a trained data labelled with correct diagnosis (Disease/ Normal) and onto this data the machine learning algorithm is built.

Step 2
Now when we load the new x-ray image (data) on this system and based on past learning, the model predicts the condition of the patient.

Step 3
Simultaneously the doctor also diagnosis's the patient condition by taking a look at the same x-ray and giving a feedback on “Correctly diagnosed by ML” or “Incorrectly diagnosed by ML”.

Step 4
The feedback (or rewards) by doctor makes the algorithm better for future diagnosis to a point where doctor intervention would be minimum.

An example from Industry:

A company Brite Health leverages the use of machine learning to better manage patient engagement in clinical trials. This company uses apps for patients (or volunteers) and dashboards for site management.

The app and dashboard is trained on millions of clinical data points. This trained data points are engineered to identify key markers that tend to correlate with patient disengagement from research studies which notifies the user and it also informs us about the next scheduled task & site visit. This encourages patient engagement and prevents disengagement. At the site the dashboard receives any disengagement notification of all the enrolled patients and helps in monitoring them to avoid any minor or major violation.

The app system also provides personalize communication and study documents for reference through curative content and a conversational Chatbot.

So, this company uses supervised machine learning for patient engagement through App and Dashboard, and at the same time uses reinforcement learning through the Chatbot.

Another example of use of Machine Learning’s NLP technique is data mapping.

 

Data Mapper Tool: Process Flow


Mapping raw data to standards is one of the most challenging process in the healthcare industry. Reusing or reapplying the information collected during mapping processes from previously mapped studies and building upon that knowledge inference is the most important part of the mapping process. It is usual for mapping to be done to CDISC Standards as this is a requirement of regulatory bodies such as the FDA when submitting data for approvals of a new IND.

Auto-mapping and smart-mapping features in the tool, which are based on knowledge inference derived from machine learning algorithms, reduce time and effort for the user. This leads to improvements in quality, efficiency and consistency. This tool is user-friendly interface for everything from mapping raw data to generating SDTM standards (including domain templates) in CDISC. Natural Language Processing (NLP) is a technique implemented here to predict the mapping of new source data or variable based on the learn information from existing mapping trained on previous data or variables.

In this, we have a standard repository like SDTM, ADAM and other CDISC standard documents; study documents like specification, protocol etc; study data from different sources; SAS Program generator generates SAS program from mapping metadata; and libraries provides a place where mapping metadata are available on which machine learning algorithms can be applied to learn from information.

Machine algorithms can be applied to different type of metadata captured at Dataset, Variable and Value level. The screenshot below represents the model Similarity vs NGram similarity for Tables Mapping.

Predictions:  ['ae' 'cm' 'lb' 'fa' 'eg' 'ie'] Expected:['AE', 'CM', 'LB', 'FA', 'EG', 'IE']
                Model_Matched_Term Model_Similarity NGram_Matched_Term NGram_Similarity
Search_Term        
adverse2 ae 0.717276 ae 0.583333
chemistry lb 0.515414 lb 1.000000
conmed cm 0.703553 cm 1.000000
electrocardiac eg 0.699650 eg 0.521739
follow fa 0.683542 fa 1.000000
inclusion ie 0.537428 ie 1.000000

 

Challenges in Pharma Industry to Implement Data Science

  • Lack of Data Standards:
    • Disparity of Data Sources: The most prominent issue that all pharmaceutical companies face while preparing their data for analytics is the disparity of data. Most of the data is stored in silos and is accessed via different platforms, all using their individual data models and structures. Therefore, having access to all the data at any given point is extremely critical to running a viable analytics process.
    • Ambiguity Around Accuracy: Another challenge faced by most of the organizations using analytics is the ambiguity around the accuracy of analytics reports, along with its time-based relevance. Due to data being stored in silos and collected from disparate sources, it is difficult to be sure if all the data was accessed or how fresh it was.
    • Time-consuming Analytics Process: When so many data sources are used, it is difficult to harmonize all the data and run a set of analytics across the data set. Organizations that do not have a proper data analytics system in place, or even those who opt for a point solution, end up having to manually collate analytics reports and insights. Such a process is time-consuming and may fail to uncover insights that may have useful business implications.

It is obvious that the entire data capturing, handling, and analytics process needs to shift. In fact, the approach towards managing data is already changing. So much so, that the latest technology and approach are impacting the way organizations conduct their business.

When leveraged correctly, data yields insights that directly impact business growth. Any data analytics solution that can help organizations save money, increase the bottom line, and are cost effective while doing so, would be sure to find its way on that organization’s wish list.

  • Anxiety over change holds back progress:
    ML/AI possess a psychological effect of job loses, instead in reality ML/AI skills gives additional job prospects. We need to embrace the change and add new skills to sharpen our career. In one way it is opening new doors of opportunity and in other ways, ML/AI is helping by taking up the repetitive and redundant task from the workflow so that we can focus on tasks where human intervention is inevitable.
  • Skills shortage hits data science:
    In Pharma, the majority of the professionals have a medical background when it comes to the conduct of core tasks, whereas the data scientist role needs a combination of computer application, IT Skills and domain knowledge to implement ML/AI in the Pharma Industry. So there is a shortage of the right skills for a brief period, but in time the supply of these skills will increase and the supply gap will be filled.

 

Regulatory Processes for Approvals of AI/ML Products

The Regulatory process is majorly involved in drug approval, but with emerging use of AI in drug discovery it is prompting important question on

  • How AI and AI derived innovation should be regulated?

And:

  • Another concern is of patentability

To address this question, European Patent Office’s (EPO) has taken an initiative by publishing a draft of its updated guidelines on patenting, which include a new section devoted to AI. Aligning to EPO, the US FDA has also actively developing regulatory framework to promote innovation in artificial intelligence for healthcare. The following are the development in regulatory framework:

  • April 2018: FDA permitted marketing of artificial intelligence-based device to detect diabetes-related eye problems
  • May 2018: FDA permitted marketing of artificial intelligence algorithm for aiding providers in detecting wrist fractures
  • January 2019: FDA started PreCert version 1.0 pilot program for more streamline and efficient regulatory oversight of Software as a Medical Devices (SaMD) under FDA's Digital Health Innovation Action Plan.

 

Quanticate's statistical programming team have AI solutions to support our work and delivery to clients. If you have a need for these types of services please Submit a RFI and member of our Business Development team will be in touch with you shortly.

References

[1] https://www.wired.com/2015/02/google-ai-plays-atari-like-pros/