The Evidence Base Post 09 December 2025

Lessons from a pilot study on AI coding and analysis of qualitative patient data

The Evidence Base

AI is rapidly reshaping how evidence is generated and analyzed across healthcare. While its use in quantitative data is more established, its applications in qualitative research, particularly patient experience data, remain at an early stage. Learning from emerging examples is therefore vital to understanding its practical applications and defining what it can meaningfully deliver.

That was the focus of the study, “Exploring the Promise of Generative Artificial Intelligence (AI) for Coding and Analysing Qualitative Patient Data: Results from a Pilot Study in Non-Hodgkin’s Lymphoma,” presented at ISPOR Europe 2025. In this discussion, study author Karen Bailey, Research Scientist, PPD™ Evidera™ Patient-Centered Research at Thermo Fisher Scientific, shares insights on using generative AI to code and analyze qualitative interviews from patients with non-Hodgkin’s lymphoma (NHL). Their work compared AI-assisted and human coding to assess accuracy, efficiency, and regulatory suitability, providing early evidence of both the opportunities and limitations of AI in qualitative research.

Generative AI is increasingly being explored for analyzing qualitative patient data. What motivated your team to investigate its potential, and what specific challenges in qualitative research were you hoping to address?

Most people who conduct qualitative research know it can be very lengthy and resource intensive, and certain aspects of coding can become quite repetitive, especially in studies with 60 or more transcripts. It can get tedious for coders. With the advent of AI, we were curious: can it help make the process more efficient, or even improve it? Quality control, in particular, is extremely time consuming and requires significant senior input, which makes it costly. The idea was to see whether AI could take on some of this work and make analysis both more time efficient and potentially higher quality.

Your study evaluated both AI and “intentional AI” coding approaches within the ATLAS.ti software. How do these methods differ, and what did each reveal about the current capabilities of AI?

The software offers two AI capabilities. The first is the standard AI function, where you simply press a button and it runs automatically. It takes about 50 seconds and generates a set of inductive codes based on the uploaded transcripts. There’s no human input beyond initiating the process.

The “intentional AI” function allows for some human input into the coding process. It asks for “intentions,” which are essentially prompts similar to what people use in tools like ChatGPT. Before running the coding, we provide information such as a summary of all interview questions and context like “please code all symptoms within symptom categories” or “please code all impacts within categories.” You can give it whatever guidance you feel is useful, and then it generates some suggested coding categories for you to approve, then applies the coding based on that input.

Could you walk us through the design of your pilot, including the data collection, coding, and comparisons between human and AI analysis?

We took a study that had been conducted in 2023 using the traditional qualitative research approach. It followed the typical structure we use for sponsor projects: concept elicitation interviews to explore symptoms related to NHL, followed by cognitive interviews to assess several patient-reported outcome (PRO) measures the client was considering for use in clinical trials. These included widely used cancer instruments such as the European Organisation for Research and Treatment of Cancer Quality of Life Questionnaire Core 30 (EORTC QLQ-C30), as well as NHL-specific measures like the EORTC QLQ-NHL-HG29, EORTC QLQ-NHL-LG20 and FACT-Lym.

Data collection was carried out in the usual way – interviews were conducted, audio recorded, transcribed, and then coded using our standard process. That involves developing a code book, having coders apply the codes, checking intercoder agreement, and then conducting senior review of all coding, which is the most time-intensive step.

For the pilot, we took those exact same transcripts and applied the AI functions in the way I’ve just described. We compared human coding with AI-assisted coding in terms of time required, quality of coding, and output against the study objectives. We also conducted a deeper dive into a single transcript to understand those differences in more detail.

How did the performance of AI-assisted coding compare with human analysis in terms of accuracy, efficiency, and quality control?

Both AI approaches generated a very large number of codes – far more than the human coders. Human analysis produced around 1300 codes, whereas both AI methods produced roughly 2300 codes. It’s also worth noting that the intentional AI function could only process 16 transcripts. When we tried to apply it to all 30 transcripts, it repeatedly returned errors. We gradually reduced the number and eventually discovered it could manage only 16. Even with that smaller set, it still produced over 2000 codes.

That volume alone created a major quality control issue. Reviewing and validating more than 2000 codes simply isn’t feasible. For the standard AI function (the one where you just press a button) we chose not to complete full QC and instead examined one transcript in detail within the intentional AI output. Unfortunately, that deeper review revealed several issues. For example, the AI frequently coded interviewer questions but not the participant’s responses, sometimes applying ten or more irrelevant codes to a single interviewer prompt. We also had to manually delete many of these. It often double-coded statements as both symptoms and impacts, and it struggled to identify which PRO measure the participant was discussing. Overall, there were quite a number of errors that impacted accuracy and the usability of the output.

I also found that using the AI made me feel quite removed from the data. With the standard AI function, where you simply press a button and it generates the codes, it felt completely disconnected from the way qualitative research is normally done.

"You don’t build familiarity with the transcripts, you don’t get a sense of the latent meaning, and you lose the depth that comes from really engaging with the data. As a qualitative researcher, that felt quite problematic."

The quality control process was also surprisingly repetitive and, at times, frustrating. I mentioned earlier how human coding can be repetitive, but going through and checking the AI-generated codes felt even more so, mainly because many of the codes were incorrect and had to be removed or adjusted. It left me thinking that the traditional approach would have been quicker and more meaningful. I think we’re still very much in the early stages and I’m sure technology will develop, but at the moment, based on the pilot with Atlas.ti, we can’t fully trust what it produces.

Despite these challenges, what aspects of the AI approach showed promise for supporting future qualitative research workflows?

We did see some potential in the way AI handled more inductive coding, that bottom-up approach that draws directly from the participant’s own language. Our studies are typically very deductive, with predefined, a priori codes based on the interview guide and sponsor-driven hypotheses. We’re good at that, but it does mean we can sometimes miss more nuanced, inductively driven concepts.

The AI showed some promise here. For example, with the concept of fatigue, which is quite a broad, wide-ranging idea, it not only coded “fatigue” but also identified related notions such as “persistent tiredness” and “inadequate rest.” Those additional layers suggest it was picking up nuances that human coders hadn’t captured.

One potential use case might be to run the AI on one or two transcripts early in a study to see what it surfaces. It could help inform the initial codebook by highlighting concepts or distinctions we may not have considered, which could be quite useful.

Data protection and regulatory compliance are critical when working with patient data. How do frameworks such as GDPR influence the responsible use of generative AI in this field?

This discussion is specific to ATLAS.ti, and I should caveat that I can’t really speak in detail about other software like NVivo or MAXQDA. That said, these platforms are highly active in this space, and I would hope they are working to address some of these concerns so next iterations are better. One of the major features they share is how they handle data privacy. When you upload transcripts and run the AI function, ATLAS.ti and NVivo both use OpenAI technology, but under private agreements where the data is not used to train the underlying large language model (LLM). The transcripts are only sent for the brief period needed to generate the output and are not stored. This design is clearly intended to address regulatory concerns around sharing personal health information from patients.

The conundrum, though, is that because the LLM cannot be trained on the data, it can’t improve. There’s no opportunity for it to learn from corrections or from the human quality control process. Ideally, we’d want much more iterative back-and-forth – similar to how you can prompt and refine outputs in tools like ChatGPT. For example, you’d want the system to recognize when you remove inappropriate codes, or correct double coding, and adjust its behavior accordingly. ATLAS.ti can’t do that at the moment, and I suspect the privacy constraints are part of what limits that capability.

“Ultimately, what we’d really like is the ability to take our codebook, have the AI apply it accurately, or help check the coding. Those are the labor-intensive parts of qualitative research, and reducing that burden, while maintaining quality, would be extremely valuable.”

Interviewee

Karen Bailey, PhD
Research Scientist, PPD™ Evidera™ Patient-Centered Research, Thermo Fisher Scientific

Karen Bailey, PhD, is a Research Scientist with the Patient-Centered Research team at Evidera, a Thermo Fisher Scientific business, based in London, UK. Dr Bailey brings 15 years of research experience, delivering qualitative and quantitative patient-centered studies within real-world settings and clinical trials across multiple therapeutic areas, including rare diseases, oncology, mental health, drug and alcohol use and social inequalities. She has conducted numerous systematic and targeted literature reviews, patient interviews and focus groups, and dispensed PRO measures focusing on burden of illness, health related quality of life, treatment outcome and experience, and service evaluation. Dr Bailey also has experience implementing and conducting exit and embedded clinical trial interview studies, using mixed method research design and longitudinal qualitative research approaches. Dr Bailey’s research has been disseminated at national and international seminars and conferences, through oral and poster conference presentations, and in peer-reviewed journal articles. She has also participated in thought leadership activities for webinars and conferences on topics related to qualitative research and patient engagement.

Before joining Evidera, Dr Bailey worked with Open Health, Patient Centered Outcomes (PCO), most recently as an Associate Director, bringing in new business and managing several qualitative and quantitative projects from inception to completion. Prior to completing her PhD, Dr Bailey was Deputy Director for the non-profit organization, Against Violence and Abuse (AVA), working with senior stakeholders in NHS Trusts, UK government bodies and non-profit organizations to develop and embed strategies for improving safeguarding responses with a focus on data monitoring and the use of patient reported outcomes to evaluate progress.

Dr Bailey earned her PhD in Addiction Sciences from the Institute of Psychiatry, Psychology and Neuroscience, King’s College London. She also has an MSc degree in Mental Health Services and Population Research from King’s College and an MA degree in Theory and Practice of Human Rights from the University of Essex. She received her Bachelor of Arts degree in Philosophy from the University of Durham.

Acknowledgments

Karen would like to thank her study co-authors Anne Skalicky, Carla Dias Barbosa, Sonya Stanczyk and Paul Cordero.

Disclaimers

Karen Bailey, Anne Skalicky, Carla Dias Barbosa, and Sonya Stanczyk are employees of PPD™ Evidera™ Patient-Centered Research, Thermo Fisher Scientific. Paul Cordero is employee of Sanofi and may hold stock and/or stock options in the company. The opinions expressed in this feature are those of the author and do not necessarily reflect the views of The Evidence Base® or Becaris Publishing Ltd.

Sponsorship for this Peek Behind the Poster was provided by Thermo Fisher Scientific.