Real World Data Analysis in Clinical Trials: A Programmer's Perspective

Written by Clinical Programming Team | Fri, Sep 12, 2025

Across pharma, real world data is increasingly used alongside traditional clinical studies to answer practical questions faster. Drawing on EHRs, claims, registries, and patient-generated sources, real world data analytics helps identify eligible populations, reflect routine care, and track outcomes beyond the study window. Used with clear protocols and governance, these real-world data solutions can support safety surveillance, health-economics work, and, in some cases, regulatory decision-making.

When performing real world data analysis, the whole patient record is usually spread across multiple tables, into one coherent history. This might lead to some surprising discoveries such as: patients with records before birth/after death, patients changing sexes multiple times, and finding invalid codes. Such discrepancies are usually fairly easy to classify as data issues and easy to handle. On the other hand programmers might encounter real world data events that cannot be as easily classified and need special approaches such as a patient receiving multiple prescriptions (Rxs) for the same drug on one day, or receiving a new Rx before the previous runs out, patients visiting the Doctor’s office or Emergency Room during their hospital stays, patients changing wards/level of care multiple times during one hospital visit, and finding valid but non-billable codes in insurance claims data. In this post we share some commonly encountered problems (and our solutions) related to real world data analysis.

What is Real World Data (RWD)?

Firstly, let’s begin by defining real world data (RWD), RWD in drug development and the pharmaceutical industry, is data collected from various sources, about patients and their medical history. RWD is purely observational, meaning it does not involve any additional clinic visits or treatments, aside from those that would generally occur during someone’s lifetime. On the other hand, traditional clinical data requires someone to sign up to a study and follow a very strict process of visits and treatments, only painting a picture of events during the study period.

RWD is of increasing interest across the industry because more data is available, and storage has improved¹. This gives us longer-term data which can lead to richer insights.

Sources of Real World Data

There are several different sources of RWD, each with advantages and disadvantages. Sometimes different data sources are used together, which can help to eliminate these disadvantages². Some RWD sources are:

Electronic Health Records (EHR)
These are the medical records that are made every time someone sees a doctor or claims a prescription. While EHR data is usually detailed and available over a long time period, it can be difficult to analyse due to not being standardised.

Insurance Claims
Especially in locations such as the US, this is data sourced from health insurance companies about any claims made by a person, and what the claim was for. Claims data is often standardised and detailed, but it may only cover the duration of one insurance plan and exclude anything that was paid out-of-pocket³.

Lab Test Results
Dates and results of various laboratory and/or microbiology data. Lab data can provide detailed insights, but is only a snapshot in time, so is often used in conjunction with other sources.

Patient Generated Data
This is data generated from medical wearables (such as smart watches or glucose monitors) or from surveys completed by the patient. This data can be useful in specific contexts, such as follow up, or for continuous monitoring, but lacks the clinical details regarding diagnoses and drugs of other sources⁴.

Registries
These are databases that collect data only on a specific population, such as those with a specific disease. Registries can be useful for rare diseases, providing detailed, long-term data even on smaller sample sizes⁵. This does however mean they may take a long time to reach a large enough sample to analyse or may be missing wider patient information.

Benefits of Using RWD in Drug Development

RWD can help in drug development by providing an alternative, or additional, source of data to clinical trials. The uses and benefits of RWD are wide-ranging. Some general benefits of RWD are:

Cost-Saving
RWD often already exists in some format, whether it be for insurance claims or supporting treatment plans, meaning data only needs to be collated and repurposed, rather than recollected. It has been estimated that increasing usage of RWD could save companies $1 billion USD a year⁶.

Increased Diversity
Clinical trials can still suffer from a lack of diversity; RWD can help with this by drawing on more sources. Regulatory agencies have published frameworks that provide guidance on increasing diversity in this way⁷.

Reduced Timelines
Using RWE can mean not having to carry out as many pre-trial steps as with standard clinical trials. This was useful during the COVID-19 pandemic, where RWD could provide insights on prevention and treatment methods, as soon as the data was available⁸.

Increased Health Economics Insights
Health economics looks at factors influencing the health industry’s costs and outcomes. RWD is helpful here because it contains various demographic insights and details on spending or claims. This means the burden of disease (the wider economic cost of a disease) can be analysed – such as in recent studies on GPP, a type of psoriasis⁹.

Observational Longitudinal Databases

To assess what is happening in the real world, rather than using clinical trials to collect data, the researcher may use data which has come directly from the market. Observational longitudinal databases are created to allow for analysis of such data. These contain de-identified medical records for a large number of patients (can exceed 100 million patients) over an extended period of time (at least a couple of years) which is a much larger scale than regularly used in clinical trials. This allows for analyses of rare diseases, treatment pattern changes, etc. However, unlike in clinical trials, the researcher will not be able to randomly assign patients to a given therapy nor collect all characteristics that may be of interest. The de-identification process of this data needs to follow country-specific guidelines for protecting individual health care information, such that available data does not include any information that could potentially be used to identify a person.

The world of real-world data is a long way from being standardised. Some efforts are being made to counteract this but they consistently sacrifice some database-specific advantages. What this means for a regular statistical programmer is that each database you work with has its own structure, advantages and limitations and requires an individual approach. Common databases that real world evidence CROs use to support their clients RWD analysis can be grouped into two main categories:

Administrative Healthcare databases (claims databases) include records (claims) created by medical facilities and sent to an insurance company in order to get paid for the services performed (note that claims are created for billing purposes, not for research). Administrative Healthcare databases are usually:

clean (e.g. low numbers of invalid codes)
consistent (they arise from standardised claims documents)
restricted in scope (e.g. height/weight information will not be included as this is not something that insurance companies need)

Electronic Medical Records (EMR/EHR) databases contain patients’ medical records from many facilities (sometimes grouped together into networks) aggregated by one entity. Such databases are usually

messy:
- data may come from many different sources and can be hard to shoehorn into a rigid database
- different approaches to define certain events may be used by data providers
- codes may be invalid
- records may be incomplete, with missing or illegible values
- someone may have forgotten to report e.g. an injection; no financial risks involved
surprising:
- complicated algorithms are used to map the data to the database, and these may occasionally give unexpected results
- different information available for a patient such as: lung volume, pain score etc. (the size of the population with such information is a completely different story but the data is there and could probably support some interesting projects).

Even databases of the same type can have many significant differences. As a team working on multiple databases, we have to explore standard approaches to solving commonly observed problems.

Key Applications and Use Cases of RWD

The nature of RWD means it is very different to clinical trial data; therefore, it can be used in different ways to aid drug development. Real world data might be used directly, or it might be transformed into real world evidence (RWE). As defined by the EMA, whilst RWD is the data obtained about patient health, RWE is the information derived from analysing RWD¹⁰. Some ways that RWD can be used in drug development are:

Enabling Faster Study Timelines
Having RWD available allows target populations for a trial to be established, and eligible patients recruited more quickly, therefore saving time throughout the whole process.

Monitoring Real-Life Outcomes
In drug development, sometimes situations occur where a drug appears to work in clinical trials yet doesn’t work as well in the real world – known as the “efficacy-effectiveness gap”¹¹. This could be due to missed doses, interactions with other medications/behaviours, or just a consequence of the now expanded sample size. If we can identify these situations, we can work on solutions.

Label Expansions
While monitoring real life outcomes, sometimes it is found that medications also provide unexpected benefits, such as improving symptoms of other conditions. This could form the basis of another trial or a regulatory case of expanding the label to another disease or population group¹².

Enhanced Safety Monitoring
RWD means we can see trends that appear after the clinical trial has ended; a clinical trial may only last a year, but RWD could provide 10-15 years of data, covering a wider population than in a clinical trial. This can sometimes lead to drugs being withdrawn due to longer term health impacts or becoming a second-choice treatment if another drug is released.

Rare Diseases
It can be hard to run a clinical trial on rare diseases, due to a lack of patients. RWD provides access to much more patient information and can supplement a clinical trial – this is a common use of a patient registry.

Case Studies and Examples

Even though RWD is a growing area in the industry, there has still been time for RWD to have made an impact, as the following real world data examples show.

In 2021, the FDA approved Prograf (tacrolimus) to prevent organ rejection in lung transplant patients. It had previously been approved for liver transplants, and then for kidney and heart transplants as well. This used RWD from the Scientific Registry of Transplant Recipients (SRTR) and compared these to a suitable control. This shows the benefits of disease registries and the opportunities for label expansions¹³.

In the UK, there is a centralised database for Cystic Fibrosis (CF) patients, where data is entered at every specialised care centre and clinic. This data can be used for a variety of purposes such as identifying trends not noticeable across one hospital, monitoring the safety and effectiveness of treatments or identifying patients who could take part in research. Having this information available for rare diseases is great for reducing study timelines, enhanced safety monitoring and exploring real life outcomes¹⁴.

Occasionally, RWE can lead to the withdrawal of drugs from the market. This happened in 2004, with a drug named Vioxx (rofecoxib), a painkiller for arthritis. It was launched in 1999, but it was later found that it increased the risk of cardiovascular events. The initial trial for Vioxx only lasted 3-6 months, therefore the longer-term risks were not known. This highlights the benefits of enhanced safety and outcomes monitoring possible with RWD¹⁵. To support post marketing pharmacovigilance.

Real World Data Problems with Standard Solution

There are definitely some problems that are easy to solve. Usually, it is enough to be aware of such issues and write a program that handles such cases properly. Even after such measures, the programmer must always review each intermediate step carefully (look for fields with missing values, incorrect dates, etc.) as content of the database might change over time. The following are some common data-driven problems.

Handling Date of Birth

Due to regulatory authority restrictions, most databases mask exact date of birth to avoid identification. Patients above a certain age are usually flagged (no age provided), or have their age capped to a specific value. All remaining patients may only have the year of birth recorded (or in special cases year and month of birth). This causes problems with studies that require a more precise age (e.g. newborns). If date of birth can be found in multiple places in the database, then it might happen that a patient has more than one unique date of birth provided. This might mean that some of their records do not belong to them. In such cases our approach has been to exclude such patients from the analysis.
Similarly for gender, even though sex changes happen, seeing more than one gender in patient’s records is much more likely to be a result of a data issue than a physical sex change.

Problems with Date of Death

Since in both types of databases (claims and EMR) we see events taking place mostly within medical facilities, only death which occurred in these facilities can be captured directly. Most of the databases that supply death data for their patients use a different source for this information. This can lead to another source where discrepancies may occur, i.e. mismatches of patients’ records and death date can result in death date being assigned to patients who are alive. To counteract that, we compare all data available in the database with date of death. If there are records which are substantially after the date of death, we treat these patients as being alive (assuming incorrect assignment of death date).

Problems with Codes

A real-world data programmer comes across multiple coding systems. Diagnoses, procedures, medications are all provided as codes (e.g. ICD-10, NDC, HCPCS, Readcodes). When looking for a particular diagnosis, for example, one needs to be aware which coding system has been used in the database and what the code value for the diagnosis is (there may be multiple values). Even the same coding system can have database specific format (truncation, dots, dashes, leading zeros etc.). There might also be codes that closely resemble another from the system but may not actually exist (e.g. looking for ICD-9: 250.* (diabetes) and finding 250.07 (non-existing sub-code)). Similarly, a given code might belong to more than one coding system so one needs to be sure that the code found in the database is the correct one.

Best Practices for Implementing RWD Strategies

While the data itself cannot be changed, there are various approaches that can make using RWD easier – and more likely to be approved by regulators. When creating a cohort definition, it is important to specify exactly which data sources will be used and have a clear definition of treatments and populations – especially if looking to replicate a clinical trial¹⁶.

It is also important to work with others to share knowledge. It is helpful to create a few team or company-wide standard approaches – ways of dealing with specific situations – such as those we cover below. This means that everyone on the same team will be working in the same way, which is not only easier for regulatory submissions, but also for team quality checks. Outside of programming, it is also helpful to speak to other research functions, such as clinicians or statisticians. Clinicians will be able to tell you more about what the medical parts of the data mean, and help with study design, and statisticians can help in determining what variables are needed for any further analysis.

Real World Data Problems with Stories

The ease of solving previous problems came mainly from the fact that the origin of a given problem was quite straightforward. Here we present problems that might be results of both data issues and rare patient’s characteristics/treatments. These challenges rarely have fire-and-forget solutions and clinician input is usually needed in order to find the best project-specific approach.

Non-Billable Codes in Insurance Claims Data

We have already covered the issue of invalid codes and now progress onto the more complicated case of billable ICD codes. ICD codes have a hierarchical structure: for example, ICD-9 824 is a general code whilst 824.8 is a billable code more precisely describing the given condition. As the name suggests, only billable – the most specific of the codes – can be used for reimbursement purposes. As a result, only such codes should be expected in claims databases. Presence of a non-billable code on a claim might be a data issue. In rare cases however, when the most accurate diagnosis is hard to pick, a physician may not be able to provide a billable code and decide to put only a general (non-billable) code instead. When encountering non-billable codes, the programmer can either treat them all as invalid or use them as correct codes (e.g. for rare diseases).

Outpatient and Emergency Room Visit

Some of the studies require a programmer to provide Health Care Resource Utilization (HCRU), which includes visit counts. Unfortunately, most of the databases do not provide a clear grouping of records into a visit.

The definition of a single event needs to be agreed upon between the programmers and statisticians. Consider a patient with a couple of claims (ignoring claims coming from hospitals) on a single day. How should one decide how many visits actually happened that day?

Usually when a patient wants to see a doctor they only go for one visit on a given day
- Maybe making the assumption of one visit per day is sufficient, regardless of the number of claims present?

But what if a patient visits both a GP and a specialist?
- Maybe the number of distinct providers (physicians) on that day should be used instead?

But what if the doctor identifier is not available or masked to only show speciality?

Here is the algorithm we found to be both specific enough and general enough for most projects – if there are multiple claims per date for a patient:

Group the claims by unique providers
If the above is not available, use speciality/facility instead
If none of these are available, group such claims into one visit

Now imagine the same situation, but with all claims coming from Emergency Room (ER). This time the assumption of counting visits by unique providers/specialties does not seem right. We found it reasonable to assume at most one ER visit a day, as a patient during one ER visit may end up seeing multiple physicians.

Hospital Stays

We define inpatient stay (IP) as the patient spending at least one night in a medical facility. For such stays there are two types of data: one summary record containing general information of a given stay (length of stay, date of admission/discharge, primary/principal diagnosis, discharge status etc.) and then records for all services provided. Theoretically, using the summary record, one should be able to easily provide all inpatient related statistics, but unfortunately there are some special cases that must be considered before showing final numbers. First of all, one needs to understand correctly how the database groups inpatient records:

What if a patient changes ward during the visit?
What if a patient changes level of care?
What if a patient is admitted through ER?
What if a patient is transferred to another facility?
What if a patient is readmitted on the day after discharge?

The final algorithm has to account for all of this information when analysing hospital stays (counting visits, calculating length of stay etc.). One important note here is that it might (and usually does) happen that grouping performed by the vendor (summary records) results in some overlapping visits. The algorithm has to account for these too.
Depending on the purpose of the study:

To compare inpatient related variables between different databases, inpatient stay definitions need to be “normalised” between the databases (e.g. using OMOP standard)
To perform an in-depth analysis of history of patient’s hospital stays (readmission, transfers etc.) one might have to modify the vendor’s grouping algorithm so that it handles cases of interest correctly (meaning in line with study documentation)
If we only need basic IP related numbers (e.g. to compare patients on different drugs), the vendor’s definition will usually be enough.

Drug Usage Analysis

There is a big difference between claims and EHR databases with regards to drug records: the former shows fills (purchased drugs) while the latter shows prescriptions received. There is no guarantee that the patient actually used the medication (in both types of data). Apart from drugs prescribed, databases also provide information about drugs received in medical facilities. For example, finding information on injectable drugs might require using procedure codes in addition to drug codes.

Checking how many patients use a given drug is usually straightforward. On the other hand, analysing treatment patterns can be pretty sophisticated. The first information needed to perform such an analysis is how many days the prescription covers (usually it is provided as days supplied variable, but sometimes it can only be calculated from the number of pills prescribed and dosage information). Next the approach to counting administrations in medical facilities has to be decided (assume that one administration only covers the day of administration, impute some fixed covered days or exclude from analysis). After this part is done, one needs to establish how to handle multiple prescriptions on a single day:

If a patient gets many prescriptions for one compound (might be different products), then one needs to decide if that means that the patient is extending days covered or that the patient is increasing their dose. It is crucial to get a clinician’s input on this as drug regimens can vary widely between medications.
If a patient gets more than one compound, then one needs to decide if it should be treated as a combination therapy or if the two compounds should be treated in a similar fashion as in the point above (e.g. analgesics acting in a similar way).

After all these considerations, even more complicated items are left:

Account for overlapping days of supply of given drug and if the algorithm should allow for carrying over remaining pills. There are several reasons as to why a patient might buy a new drug before the previous one is finished:
- To have a new package ready when the old one runs out,
- Due to lost/missing pills,
- To increase the dosage.
Account for an allowable grace period during which it is assumed the treatment is still working but a patient is not actively taking the drug.
Classify a patient initiating competitor medication after the original one. At what point should they be classified as a patient switching treatment versus having discontinued one and starting another?

Challenges and Limitations of RWD

Although RWD has significant benefits, it is not flawless, and it does have downsides. One of these, as mentioned throughout this article, is the lack of standardisation and varying data quality. Depending on the source, there can be missing data, items filled in incorrectly or just mismatches between various real world data sources.

Another concern with the collection of RWD is privacy and security. Although patients are de-identified, health information is still sensitive data, that could be vulnerable to cyberattacks, or patient reidentification. The risk of reidentification is reduced due to guidelines around identifying the very old or young, and concealing location information, such as zip codes, when there is a small local population¹⁷.

There can also be issues due to the data collected, such as bias – although we have covered how RWD can reduce recruitment bias in clinical trials, RWD itself may still not be entirely representative. One area where this might be an issue is around socio-economic status – when data comes from insurance claims, those who do not have, or cannot afford, medical insurance may not be represented here. Similarly, data collected from medical wearables may not be fully representative, as not everyone has access¹⁸. RWD can also be affected by confounding variables, where health outcomes are affected by factors that aren’t being analysed. This is much more difficult to control than it is in standard clinical trials¹⁹.

Regulatory and Ethical Considerations

Most regulatory authorities have guidance on the use of RWD and are open to submissions based on RWE. The FDA notes that the expansion of RWD has increased the potential for robust RWE to support decisions²⁰. The UK MHRA does not consider RWD to be more or less valuable than standard clinical trials provided that the data is robust, and the trial is well designed, therefore sponsors should not be afraid of real world data, and use it where appropriate²¹.

Similarly to clinical trials, RWD must still follow strict privacy laws, especially in the case of potential patient re-identification. Companies must ensure they follow all HIPAA (US), GDPR (UK & EU) or similar privacy laws, while still retaining as much of the data for analysis as possible – a fine balancing act²².

Informed consent can also be an issue in RWD – sometimes the data is collected for other purposes, or the documentation is very long and is too much to read²³. Depending on the research and the regulator, there are ways around this, such as:

Simplified documentation can be used
Ethics committees can waive informed consent requirements for retrospective studies if participants have already been de-identified
Using data already in the public domain²⁴

It is also encouraged for sponsors to register their studies with regulatory authorities early, to increase transparency in real world data collection and analysis. Study registration also helps increase collaboration and helps ensure that studies are compliant^25-26.

Future Trends in RWD

Because RWD is a growing area in the pharmaceutical industry, the potential is still being uncovered. The growing abilities of AI mean it can be used to predict outcomes or be used for signal detection – analysing various text fields, such as unstructured EHR notes or social media posts, and converting them into standardised medical terms²⁷. However, this does uncover new ethical and privacy concerns, so we need to keep up to date with any new regulations, such as the EU AI Act July 2024²³. Signal detection also provides benefits in Pharmacovigilance studies.

RWD also enables the creation of synthetic control arms, which are especially useful in oncology or rare disease studies. This is where historical patient RWD is used to create a control group, meaning the full group of interest can receive the trial product. AI can make this easier by minimising selection bias and enabling enhanced patient matching between groups, and these abilities will increase as technology improves²⁸.

As technology and medical wearable usage increases, so does the number of digital biomarkers that can be collected outside of clinical settings. This is leading to decentralised and hybrid trials, where monitoring can continue at home²⁹. If pharma embraces this aspect of real-world data, we can get more detailed clinical trials, that can potentially reach more diverse populations than traditional trials³⁰.

Conclusion

We have covered a wide variety of topics that a real-world data programmer will encounter across different databases. As you will probably have realised by now, there are a lot of “probably” and “maybe” situations and solutions in this article. This reflects the breadth of the area we are tackling. You should also be aware that most of the questions (barring only simple ones) in our field do not have clear-cut answers. This does not mean they are not covered in the “database user manual”, but that with years of work experience in this area, many consultations with programmers, statisticians and vendor support we still end up with answers that only have a strong maybe at best in them.

Quanticate's statistical programming team turn fragmented EHR, claims, registry, and wearable data into decision-ready evidence, with strong cohort design, comparator selector, data linkage, OMOP mapping, and validated pipelines that protect privacy and compliance. To see how our real world data analytics and real world data solutions can optimise feasibility, safety surveillance, HEOR, and support regulatory submissions, submit an RFI today.

References

1. https://hawkpartners.com/healthcare/healthcare-decisioning-the-role-of-rwd-and-rwe/
2. https://www.biostatistics.ca/real-world-evidence-rwe-and-real-world-data-rwd/
3. https://www.datavant.com/real-world-data-rwd/claims-data
4. https://pro.carenity.com/2024/04/10/what-are-real-world-data-rwd-vs-patient-generated-data-vs-self-reported-data/
5. https://www.syneoshealth.com/insights-hub/clinical-registries-and-rwd-where-we-are-integration-process
6. https://becarispublishing.com/doi/10.2217/cer-2019-0162
7. https://www.ppd.com/blog/real-world-data-role-diversity-clinical-research/
8. https://www.datavant.com/real-world-data
9. https://evidence-hub.aetion.com/rwd-describe-patient-burden-rare-disease
10. https://www.ema.europa.eu/en/documents/other/guide-real-world-evidence-provided-ema-support-regulatory-decision-making_en.pdf
11. https://www.sciencedirect.com/science/article/pii/S2589537024004942
12. https://dualitytech.com/blog/accelerating-drug-development-with-rwd/
13. https://www.fda.gov/drugs/news-events-human-drugs/fda-approves-new-use-transplant-drug-based-real-world-evidence
14. https://www.cysticfibrosis.org.uk/about-us/uk-cf-registry
15. https://www.ccjm.org/content/ccjom/71/12/933.full.pdf
16. https://www.nadiabarozzi.com/2025/06/05/methodological-considerations-in-rwd-studies-by-intended-use/
17. https://pmc.ncbi.nlm.nih.gov/articles/PMC8339486/
18. https://www.re-thinkingthefuture.com/technologies/gp4847-real-world-evidence-rwe-in-pharma-its-significance-applications-challenges-and-benefits/
19. https://openheart.bmj.com/content/5/1/e000788
20. https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence
21. https://www.gov.uk/government/publications/mhra-guidance-on-the-use-of-real-world-data-in-clinical-studies-to-support-regulatory-decisions/mhra-guideline-on-randomised-controlled-trials-using-real-world-data-to-support-regulatory-decisions
22. https://evidence-hub.aetion.com/de-identification-in-healthcare-the-legal-and-strategic-imperative-for-global-compliance
23. https://umcgresearch.org/more-europa-ethical-considerations-in-rwd
24. https://pmc.ncbi.nlm.nih.gov/articles/PMC11318787/
25. https://catalogues.ema.europa.eu/catalogue-rwd-studies
26. https://insights.axtria.com/articles/fdas-final-guidance-for-the-industry-on-the-use-of-real-world-data-and-real-world-evidence-to-support-regulatory-decision-making-for-drugs-and-biological-products
27. https://www.iqvia.com/blogs/2025/07/enhancing-signal-detection-with-real-world-data-a-new-era-in-pharmacovigilance
28. https://www.indegene.com/what-we-think/blogs/leveraging-rwd-for-synthetic-control-arms
29. https://acrpnet.org/2025/08/13/digital-biomarkers-are-reshaping-how-evidence-is-generated-in-clinical-research
30. https://acrpnet.org/2025/08/19/unlocking-the-power-of-digital-biomarkers-in-clinical-trials

View full post