University of Pennsylvania Health System, Patient Engagement Center
Penn AQM Customer Service Workshop Evaluation
I was the primary analyst for a Penn Medicine Patient Engagement Center evaluation of a customer service workshop using Verint Automated Quality Monitoring data, attendance records, queue metadata, and a difference-in-differences design. The work was presented at the Patient Access Collaborative Annual Symposium and is being developed for external publication.
- Verint Automated Quality Monitoring
- Speech analytics
- Workshop attendance records
- Queue metadata
- Difference-in-differences
- healthcare operations
- patient access
Project note
In Brief
I was the primary analyst for a Penn Medicine Patient Engagement Center evaluation of a customer service workshop using Verint Automated Quality Monitoring data, attendance records, queue metadata, and a difference-in-differences design. The work was presented at the Patient Access Collaborative Annual Symposium and is being developed for external publication.
Relevant To
- access-center leaders
- quality management teams
- patient access operations teams
- healthcare analytics leaders
- workforce training and coaching teams
Search Context
- how to evaluate a call center customer service workshop
- difference-in-differences contact center agent performance
- automated quality monitoring healthcare call center
- Verint AQM patient access quality analytics
- healthcare access center speech analytics training evaluation
4 cited sources
Operating Context
Healthcare access centers are often measured on speed, volume, and service quality at the same time. That creates a hard management problem: leaders need agents to be efficient, but they also need patients to hear respectful, clear, human language during scheduling, messaging, prescription, and access-support interactions.
The Penn Medicine Patient Engagement Center had Verint Automated Quality Monitoring (AQM) results that could identify center-wide customer service opportunities. The practical question was whether those data could inform a training intervention and whether the intervention could be evaluated with enough rigor to say something more useful than “people attended a workshop.”
This project became Evaluating a Customer Service Workshop: A Difference-in-Differences Analysis of Contact Center Agent Performance, presented at the Patient Access Collaborative Annual Symposium.
In Brief
The project evaluated whether a customer service workshop improved contact center agent performance using Verint AQM data, attendance records, queue metadata, and a difference-in-differences design. The poster reported statistically significant improvements in Courtesy Phrases and Positive Language, with the strongest gains among low-baseline agents and no statistically significant Average Call Duration penalty.
What We Built
The work turned AQM output into a training evaluation system. It connected quality measurement, workshop attendance, baseline agent performance, and operational metadata so the team could evaluate whether training changed measurable customer service behaviors.
The intervention and evaluation included:
- leadership review of AQM results to select the key performance indicators with the greatest center-wide improvement opportunity
- a training workgroup that translated selected AQM KPIs into workshop content
- a one-hour live, instructor-led workshop using call clips, live evaluations, and group activities
- a 30-minute self-paced eLearning module with exercises and an assessment where trainees scored call recordings against the AQM standard
- a quasi-experimental difference-in-differences evaluation comparing workshop attendees with contemporaneous matched untrained agents
- statistical analysis of primary outcomes and operational tradeoffs
The primary KPIs were Courtesy Phrases and Positive Language. Average Call Duration was used as a secondary balancing metric to test whether service-language gains came with a measurable call-time penalty.
Will and Kristi deserve real credit for the training and approach. Will’s education background shaped how the AQM data was translated into teachable behaviors, and Kristi’s training instincts made the workshop practical for agents rather than just analytically tidy.
Poster And Presentation

At the symposium, the work became a presentation and conversation with access leaders.

Methods
The evaluation used a quasi-experimental difference-in-differences design. Workshop attendees were compared with contemporaneous matched untrained agents across the same pre/post period.
The poster describes the method in five parts:
- Design: difference-in-differences comparing workshop attendees with matched untrained agents over the same pre/post period.
- Cohort and matching: agents were included with sufficient Verint AQM volume, and comparison agents were matched using baseline performance, queue/workgroup context, and evaluation volume.
- Outcomes: primary KPIs were Courtesy Phrases and Positive Language; Average Call Duration was a secondary balancing metric for efficiency tradeoff.
- Analysis: raw difference-in-differences estimated treatment effects, corroborated by Mann-Whitney U testing and OLS regression adjusted for baseline score and evaluation volume.
- Data: Verint Automated Quality Monitoring evaluations, workshop attendance records, and queue metadata were aggregated to the agent-week level to stabilize estimates.
The analytical challenge was balancing the time available to complete the project with enough rigor to make the results useful. The design had to fit real operational bandwidth: available AQM volume, attendance records, queue context, comparison agents, and the team’s capacity to validate the work.
Results And Evidence
The poster reported statistically significant and practically meaningful gains in the two primary AQM standards.
| Outcome | Estimated Effect | 95% Confidence Interval | P Value | Interpretation |
|---|---|---|---|---|
| Courtesy Phrases | +3.65 points per 100 | +2.97 to +4.33 | p < 0.001 | Larger, clearly significant lift after workshop attendance. |
| Positive Language | +1.35 points per 100 | +0.88 to +1.81 | p < 0.001 | Smaller but detectable improvement. |
| Average Call Duration | No statistically significant penalty | -0.25 to +5.79 | p = 0.072 | Service-language gains were not paired with a statistically significant ACD increase. |
The opportunity-gradient analysis showed that agents with the lowest pre-training Courtesy Phrase scores improved most:
| Pre-Training Baseline Group | Courtesy Phrase Improvement |
|---|---|
| Low baseline | +6.47 |
| Mid baseline | +2.98 |
| High baseline | +1.42 |
The implementation scale also matters:
- 576 agents
- 29 departments
- 96 workshops
- 6 weeks
- 100% completion
The poster’s main conclusion was that training produced a statistically and practically significant lift in Courtesy Phrases, with a smaller but detectable effect on Positive Language. Low-baseline agents improved most, suggesting that targeted coaching may generate the largest returns where baseline performance gaps are most visible.
Standards, Governance, And Validation
The quality of this work depended on denominator and comparison discipline. Automated quality metrics become misleading when they are treated as universal truth rather than as structured measurements with eligibility rules, model behavior, and operational context.
The validation model was:
- use the existing AQM standards, with no post hoc outcome invention
- compare trained agents against contemporaneous untrained agents and their own pre/post trend
- match comparison agents on baseline performance, queue/workgroup context, and evaluation volume
- use Average Call Duration as a balancing metric so improvement in language did not hide an efficiency penalty
- aggregate to the agent-week level to stabilize estimates
- use multiple statistical checks so the interpretation does not depend on one simple comparison
I apply the same operating logic to healthcare AI more broadly. Automated systems are useful when they make more of the work observable and when the measurement design is honest about what is being scored, who is included, and what tradeoffs are being watched.
Implementation Playbook
The reusable workflow for evaluating a call-center customer service intervention is:
- Start with an existing quality standard. Do not invent the metric after the training is built.
- Use AQM or speech analytics to identify the largest center-wide opportunity.
- Build the training around observable behaviors, not vague service ideals.
- Preserve a comparison group if operationally possible.
- Match comparison agents on baseline performance and work context.
- Link training attendance to quality outcomes at the agent level.
- Aggregate results over a stable time unit, such as agent-week.
- Test the primary service outcomes and at least one operational balancing metric.
- Segment by baseline performance to see where training has the largest marginal effect.
- Convert the results into coaching strategy.
Reusable Checklist
For an access-center training evaluation, I would use this checklist:
- Define the AQM KPI before training starts.
- Identify the center-wide opportunity with leadership.
- Create a training workgroup with operational and training leaders.
- Translate the metric into teachable behaviors and call examples.
- Build both live and asynchronous learning when scale requires it.
- Track workshop attendance precisely.
- Build a comparison cohort before the post-period is analyzed.
- Predefine the balancing metric, such as Average Call Duration.
- Analyze total effect and effect by baseline performance.
- Use the findings to target coaching, refine the rubric, and decide whether the intervention should continue.
Any reusable worksheet for this kind of evaluation should include tabs for KPI selection, cohort matching, attendance tracking, outcome definitions, balancing metrics, and interpretation notes.
My Operating View
This project is the kind of healthcare analytics I care about: operational, measurable, and close enough to the work that leaders can act on it. The analysis connected directly to a training decision, a real access-center workforce, and a public standard of evidence.
My opinion is that access centers should be careful with speech analytics. The technology can make service quality visible at a scale manual review cannot match. That visibility only matters when measurement becomes training, feedback, coaching, and a fair evaluation design.
The most important lesson is the opportunity gradient. Low-baseline agents improved most. That is exactly what a good quality program should want to know. If the effect is concentrated where the gap is largest, the next operational move is targeted coaching, rubric refinement, and support for the agents and departments with the clearest upside.
The lesson is also cultural. Many organizations say they want rigorous training evaluation until the analysis might show that the training did not work. Kristi’s response when I asked about that risk has stayed with me: “What do you mean worried? If the training doesn’t work, wouldn’t we want to know and change it?” That stance is rare and important. It speaks to the training team, and it also speaks to the culture I have seen at Penn Medicine from the top down: serious people are willing to learn from the measurement, even when the answer might force a change.
That operating habit is older than the healthcare setting. The Walmart oil bay time study is the early-career version: the average hid the real process. The JHP knowledge pipeline is the AI-governance version: the fluent answer mattered less than whether the answer could be traced to the right source.
Conference And Publication Context
The work was presented at the Patient Access Collaborative Annual Symposium during the poster programming. I also served as a table leader for Session 5: Access Center Deep Dive, which focused on access-center operations.

The related publication work is still in development, so implementation details stay high-level and close to the public poster, approved methods, and documented results.
References
NIST supports the need for validated AI-supported operational measurement. Microsoft Research supports human-AI interaction principles around correction, transparency, and human judgment. AHRQ supports the broader importance of patient experience and communication measurement. Verint provides vendor-category context for customer engagement and automated quality monitoring.
Until the publication is complete, implementation mechanics stay high-level and close to the poster-backed results, Verint AQM data, the difference-in-differences design, and the Patient Access Collaborative presentation context.
Frequently Asked Questions
- How do you evaluate a healthcare call-center customer service workshop?
- Use a pre/post design with a matched comparison group when possible, define the target quality metrics before training, connect attendance records to AQM outcomes, control for baseline performance and queue/workgroup context, and check whether improvement creates operational tradeoffs such as longer average call duration.
- What is automated quality monitoring in a patient access center?
- Automated quality monitoring uses speech analytics and call metadata to evaluate a larger share of contact-center interactions against defined quality standards. It is most useful when the rubric, eligible call types, exclusions, validation, and coaching workflow are explicit.
- Why use difference-in-differences for training evaluation?
- Difference-in-differences helps estimate the treatment effect of a training intervention by comparing the change among trained agents with the change among comparable untrained agents over the same pre/post period.
- What did the Penn customer service workshop evaluation find?
- The poster reported statistically significant gains in Courtesy Phrases and Positive Language after workshop attendance, with the largest Courtesy Phrase gains among low-baseline agents and no statistically significant average-call-duration penalty.
Cited Sources
- AI Risk Management Framework National Institute of Standards and Technology
Governance reference for validity, reliability, transparency, human accountability, and monitoring in AI-supported operational measurement.
- Human-AI Interaction Guidelines Microsoft Research
Design reference for keeping AI-supported systems understandable, correctable, and aligned with human judgment.
- About the CAHPS Program and Surveys Agency for Healthcare Research and Quality
Patient experience measurement context for communication, access, and service-quality work.
- Verint Customer Engagement Verint
Vendor context for the broader customer engagement and workforce analytics platform category. Penn's poster specifically references Verint Automated Quality Monitoring evaluations.