How do you evaluate a healthcare call-center customer service workshop?

Use a pre/post design with a matched comparison group when possible, define the target quality metrics before training, connect attendance records to AQM outcomes, control for baseline performance and queue/workgroup context, and check whether improvement creates operational tradeoffs such as longer average call duration.

What is automated quality monitoring in a patient access center?

Automated quality monitoring uses speech analytics and call metadata to evaluate a larger share of contact-center interactions against defined quality standards. It is most useful when the rubric, eligible call types, exclusions, validation, and coaching workflow are explicit.

Why use difference-in-differences for training evaluation?

Difference-in-differences helps estimate the treatment effect of a training intervention by comparing the change among trained agents with the change among comparable untrained agents over the same pre/post period.

What did the Penn customer service workshop evaluation find?

The poster reported statistically significant gains in Courtesy Phrases and Positive Language after workshop attendance, with the largest Courtesy Phrase gains among low-baseline agents and no statistically significant average-call-duration penalty.

Penn AQM Customer Service Workshop Evaluation

Operating Context

Healthcare access centers are often measured on speed, volume, and service quality at the same time. That creates a hard management problem: leaders need agents to be efficient, but they also need patients to hear respectful, clear, human language during scheduling, messaging, prescription, and access-support interactions.

The Penn Medicine Patient Engagement Center had Verint Automated Quality Monitoring (AQM) results that could identify center-wide customer service opportunities. The practical question was whether those data could inform a training intervention and whether the intervention could be evaluated with enough rigor to say something more useful than “people attended a workshop.”

This project became Evaluating a Customer Service Workshop: A Difference-in-Differences Analysis of Contact Center Agent Performance, presented at the Patient Access Collaborative Annual Symposium.

In Brief

The project evaluated whether a customer service workshop improved contact center agent performance using Verint AQM data, attendance records, queue metadata, and a difference-in-differences design. The poster reported statistically significant improvements in Courtesy Phrases and Positive Language, with the strongest gains among low-baseline agents and no statistically significant Average Call Duration penalty.

What We Built

The work turned AQM output into a training evaluation system. It connected quality measurement, workshop attendance, baseline agent performance, and operational metadata so the team could evaluate whether training changed measurable customer service behaviors.

The intervention and evaluation included:

leadership review of AQM results to select the key performance indicators with the greatest center-wide improvement opportunity
a training workgroup that translated selected AQM KPIs into workshop content
a one-hour live, instructor-led workshop using call clips, live evaluations, and group activities
a 30-minute self-paced eLearning module with exercises and an assessment where trainees scored call recordings against the AQM standard
a quasi-experimental difference-in-differences evaluation comparing workshop attendees with contemporaneous matched untrained agents
statistical analysis of primary outcomes and operational tradeoffs

The primary KPIs were Courtesy Phrases and Positive Language. Average Call Duration was used as a secondary balancing metric to test whether service-language gains came with a measurable call-time penalty.

Will and Kristi deserve real credit for the training and approach. Will’s education background shaped how the AQM data was translated into teachable behaviors, and Kristi’s training instincts made the workshop practical for agents rather than just analytically tidy.

Poster And Presentation

Full Penn Medicine poster: Evaluating a Customer Service Workshop

At the symposium, the work became a presentation and conversation with access leaders.

Cole Lyons and Penn Medicine colleagues presenting the customer service workshop poster

Methods

The evaluation used a quasi-experimental difference-in-differences design. Workshop attendees were compared with contemporaneous matched untrained agents across the same pre/post period.

The poster describes the method in five parts:

Design: difference-in-differences comparing workshop attendees with matched untrained agents over the same pre/post period.
Cohort and matching: agents were included with sufficient Verint AQM volume, and comparison agents were matched using baseline performance, queue/workgroup context, and evaluation volume.
Outcomes: primary KPIs were Courtesy Phrases and Positive Language; Average Call Duration was a secondary balancing metric for efficiency tradeoff.
Analysis: raw difference-in-differences estimated treatment effects, corroborated by Mann-Whitney U testing and OLS regression adjusted for baseline score and evaluation volume.
Data: Verint Automated Quality Monitoring evaluations, workshop attendance records, and queue metadata were aggregated to the agent-week level to stabilize estimates.

The analytical challenge was balancing the time available to complete the project with enough rigor to make the results useful. The design had to fit real operational bandwidth: available AQM volume, attendance records, queue context, comparison agents, and the team’s capacity to validate the work.

Results And Evidence

The poster reported statistically significant and practically meaningful gains in the two primary AQM standards.

Outcome	Estimated Effect	95% Confidence Interval	P Value	Interpretation
Courtesy Phrases	+3.65 points per 100	+2.97 to +4.33	p < 0.001	Larger, clearly significant lift after workshop attendance.
Positive Language	+1.35 points per 100	+0.88 to +1.81	p < 0.001	Smaller but detectable improvement.
Average Call Duration	No statistically significant penalty	-0.25 to +5.79	p = 0.072	Service-language gains were not paired with a statistically significant ACD increase.

The opportunity-gradient analysis showed that agents with the lowest pre-training Courtesy Phrase scores improved most:

Pre-Training Baseline Group	Courtesy Phrase Improvement
Low baseline	+6.47
Mid baseline	+2.98
High baseline	+1.42

The implementation scale also matters:

576 agents
29 departments
96 workshops
6 weeks
100% completion

The poster’s main conclusion was that training produced a statistically and practically significant lift in Courtesy Phrases, with a smaller but detectable effect on Positive Language. Low-baseline agents improved most, suggesting that targeted coaching may generate the largest returns where baseline performance gaps are most visible.

Standards, Governance, And Validation

The quality of this work depended on denominator and comparison discipline. Automated quality metrics become misleading when they are treated as universal truth rather than as structured measurements with eligibility rules, model behavior, and operational context.

The validation model was:

use the existing AQM standards, with no post hoc outcome invention
compare trained agents against contemporaneous untrained agents and their own pre/post trend
match comparison agents on baseline performance, queue/workgroup context, and evaluation volume
use Average Call Duration as a balancing metric so improvement in language did not hide an efficiency penalty
aggregate to the agent-week level to stabilize estimates
use multiple statistical checks so the interpretation does not depend on one simple comparison

I apply the same operating logic to healthcare AI more broadly. Automated systems are useful when they make more of the work observable and when the measurement design is honest about what is being scored, who is included, and what tradeoffs are being watched.

Implementation Playbook

The reusable workflow for evaluating a call-center customer service intervention is:

Start with an existing quality standard. Do not invent the metric after the training is built.
Use AQM or speech analytics to identify the largest center-wide opportunity.
Build the training around observable behaviors, not vague service ideals.
Preserve a comparison group if operationally possible.
Match comparison agents on baseline performance and work context.
Link training attendance to quality outcomes at the agent level.
Aggregate results over a stable time unit, such as agent-week.
Test the primary service outcomes and at least one operational balancing metric.
Segment by baseline performance to see where training has the largest marginal effect.
Convert the results into coaching strategy.

Reusable Checklist

For an access-center training evaluation, I would use this checklist:

Define the AQM KPI before training starts.
Identify the center-wide opportunity with leadership.
Create a training workgroup with operational and training leaders.
Translate the metric into teachable behaviors and call examples.
Build both live and asynchronous learning when scale requires it.
Track workshop attendance precisely.
Build a comparison cohort before the post-period is analyzed.
Predefine the balancing metric, such as Average Call Duration.
Analyze total effect and effect by baseline performance.
Use the findings to target coaching, refine the rubric, and decide whether the intervention should continue.

Any reusable worksheet for this kind of evaluation should include tabs for KPI selection, cohort matching, attendance tracking, outcome definitions, balancing metrics, and interpretation notes.

My Operating View

This project is the kind of healthcare analytics I care about: operational, measurable, and close enough to the work that leaders can act on it. The analysis connected directly to a training decision, a real access-center workforce, and a public standard of evidence.

My opinion is that access centers should be careful with speech analytics. The technology can make service quality visible at a scale manual review cannot match. That visibility only matters when measurement becomes training, feedback, coaching, and a fair evaluation design.

The most important lesson is the opportunity gradient. Low-baseline agents improved most. That is exactly what a good quality program should want to know. If the effect is concentrated where the gap is largest, the next operational move is targeted coaching, rubric refinement, and support for the agents and departments with the clearest upside.

The lesson is also cultural. Many organizations say they want rigorous training evaluation until the analysis might show that the training did not work. Kristi’s response when I asked about that risk has stayed with me: “What do you mean worried? If the training doesn’t work, wouldn’t we want to know and change it?” That stance is rare and important. It speaks to the training team, and it also speaks to the culture I have seen at Penn Medicine from the top down: serious people are willing to learn from the measurement, even when the answer might force a change.

That operating habit is older than the healthcare setting. The Walmart oil bay time study is the early-career version: the average hid the real process. The JHP knowledge pipeline is the AI-governance version: the fluent answer mattered less than whether the answer could be traced to the right source.

Conference And Publication Context

The work was presented at the Patient Access Collaborative Annual Symposium during the poster programming. I also served as a table leader for Session 5: Access Center Deep Dive, which focused on access-center operations.

Penn Medicine team at the Patient Access Collaborative symposium

The related publication work is still in development, so implementation details stay high-level and close to the public poster, approved methods, and documented results.

References

NIST supports the need for validated AI-supported operational measurement. Microsoft Research supports human-AI interaction principles around correction, transparency, and human judgment. AHRQ supports the broader importance of patient experience and communication measurement. Verint provides vendor-category context for customer engagement and automated quality monitoring.

Until the publication is complete, implementation mechanics stay high-level and close to the poster-backed results, Verint AQM data, the difference-in-differences design, and the Patient Access Collaborative presentation context.

Penn AQM Customer Service Workshop Evaluation

In Brief

Relevant To

Operating Context

In Brief

What We Built

Poster And Presentation

Methods

Results And Evidence

Standards, Governance, And Validation

Implementation Playbook

Reusable Checklist

My Operating View

Conference And Publication Context

References

Frequently Asked Questions

Cited Sources