Discover how Nuwa can transform your organisation. Get in touch today.Contact Us
Nuwa

Measuring Training Effectiveness in Immersive Environments

How the XRisis project combined System Usability Scale methodology, multi-dimensional added value assessment, and qualitative debrief analysis to rigorously evaluate immersive training outcomes beyond technical feasibility demonstrations.

Published: by Anastasiia P.
Funded by the European Union

Funded by the European Union

This project has received funding from the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Grant agreement number: 101070192

System Usability Scale: Rigorous Usability Assessment

The XRisis validation employed the System Usability Scale, a validated psychometric instrument comprising ten statements assessed on five-point Likert scales, providing standardised usability measurement comparable across different systems and studies. Participants responded to statements including "I thought the system was easy to use", "I found the system unnecessarily complex", "I would imagine that most people would learn to use this system very quickly", and "I needed to learn a lot of things before I could get going with this system", with responses aggregated through a specific calculation formula that produces scores ranging from 0 to 100. The methodology provides several advantages over ad-hoc usability questioning: standardised wording enables comparison with benchmark data from thousands of previous studies across diverse application domains, the ten-item structure captures multiple dimensions of usability (learnability, efficiency, memorability, error frequency, satisfaction) rather than collapsing assessment into single overall impressions, and the validated scoring algorithm reduces response bias by alternating positively and negatively framed statements. XRisis achieved a System Usability Scale score of 59%, falling below the approximately 68% threshold generally considered acceptable for consumer applications but within the range typical for complex professional tools during initial deployment, particularly those requiring specialised domain knowledge and serving specific operational contexts rather than general public audiences. The score distribution revealed important nuances: six of ten respondents (including both participants and facilitators) scored 51% or higher, indicating majority users found the system tolerably usable, whilst two users scored 71% or higher, demonstrating that certain user profiles achieved genuinely positive experiences suggesting potential for refined versions to reach broader acceptance. Two users scored below 51%, representing genuinely poor usability experiences that indicated specific friction points requiring resolution rather than acceptable variance in user preferences, driving targeted investigation of what factors differentiated negative from positive user experiences. The results validated the team's assessment that interface complexity represented the primary usability barrier: open-ended feedback consistently mentioned navigation difficulties, unclear task expectations, cognitive overload from simultaneous environmental exploration and scenario task execution, and insufficient onboarding creating initial frustration that undermined subsequent engagement. The System Usability Scale methodology provided objective measurement that transcended individual opinions: rather than debating whether usability was adequate based on subjective impressions, the team had quantitative evidence grounding discussions about whether identified issues required immediate resolution or could be addressed in future iterations, how severe usability gaps were relative to functionality priorities, and whether the platform had achieved sufficient quality for commercial deployment or needed further refinement. The approach demonstrated the value of established measurement instruments over custom evaluation frameworks: whilst domain-specific assessment tools might capture nuances particular to humanitarian training, standardised instruments enable comparisons with other XR platforms, training technologies, and interactive systems, positioning results within broader context about what constitutes acceptable versus exceptional usability in interactive application design. Future validation efforts will continue employing System Usability Scale whilst supplementing with domain-specific instruments assessing humanitarian training effectiveness, creating comprehensive evaluation combining generic usability measurement with specialised pedagogical outcome assessment.

Multi-Dimensional Added Value Assessment

The validation framework disaggregated overall platform assessment into five specific components, enabling granular understanding of which capabilities delivered value versus which required reconsideration or elimination. Participants rated added value to emergency management competency development separately for informational briefing from AI avatar (3.2 out of 5), interactive response strategy tool (3.4 out of 5), team collaboration in VR coordination office (3.6 out of 5), soft skills practice with AI avatars (4.2 out of 5), and facilitator debrief in VR environment (3.3 out of 5), producing overall average of 3.5 out of 5 equivalent to 70% added value. This differentiation proved strategically essential: rather than concluding that "the platform works moderately well" based on aggregate scores, the team gained precise insight that implementation simulation (Pilot 3) substantially outperformed other components whilst theoretical briefing (Pilot 1) underdelivered relative to investment, fundamentally informing resource allocation decisions favouring capabilities with demonstrated high value. The assessment methodology explicitly framed questions around added value to emergency management competencies rather than general satisfaction or enjoyment, maintaining focus on learning outcomes rather than entertainment value or technological novelty, ensuring evaluations reflected pedagogical effectiveness rather than mere enthusiasm for innovative technology. Participants brought relevant comparison context: six of eight had participated in three or more previous simulation exercises using conventional modalities, enabling informed judgement about whether XR capabilities provided genuine incremental benefits versus merely different delivery of equivalent learning experiences. The five-point scale provided sufficient granularity to distinguish between "somewhat valuable" (3 out of 5) and "substantially valuable" (4-5 out of 5) whilst avoiding excessive precision that would suggest unrealistic measurement accuracy given subjective assessment inherent to training value judgements. Open-ended elaboration questions supplemented numerical ratings, inviting participants to explain what specifically contributed to value perceptions or what factors limited effectiveness, generating actionable insights about interface friction, scenario realism, AI behaviour quality, and workflow alignment that numerical scores alone could not provide. The multi-dimensional approach recognised that platforms rarely succeed or fail uniformly: most complex systems contain components delivering excellent value alongside components with poor value propositions, requiring granular assessment enabling selective refinement rather than binary accept-or-reject decisions treating platforms as monolithic entities. Results enabled evidence-based priority decisions: investing substantially in refining Pilot 3 capabilities (the highest-rated component) whilst potentially eliminating or radically simplifying Pilot 1 (the lowest-rated component), reallocating resources toward capabilities with proven value rather than distributing effort equally across all features regardless of validated effectiveness. The assessment methodology balanced participant perspectives with facilitator insights: whilst participants evaluated from learner experience standpoint, facilitators assessed from delivery effectiveness perspective, capturing both sides of training value equation and identifying cases where capabilities served one constituency well whilst disappointing the other. The approach demonstrated that rigorous training effectiveness assessment requires moving beyond binary success-failure judgements or simplistic aggregate satisfaction scores to nuanced understanding of which specific capabilities deliver which specific values for which specific user groups in which specific contexts, creating the detailed evidence base required for informed platform evolution decisions.

Qualitative Debrief Analysis and Thematic Synthesis

Structured verbal debriefs with participants and separate sessions with facilitators and project team members generated qualitative insights complementing quantitative metrics and revealing nuances impossible to capture through questionnaires alone. Facilitators created psychologically safe debrief environments where participants felt comfortable expressing critical feedback without concern about offending project team members or appearing insufficiently enthusiastic about innovation, explicitly framing evaluation as improvement-focused rather than judgement-focused and emphasising that identifying problems provided more value than polite praise. The debrief methodology employed open-ended questioning: "What aspects of the VR simulation added most value to your learning?" and "What frustrated you or created unnecessary difficulty?" rather than yes-no questions or rating scales, inviting narrative responses that participants could elaborate with examples, comparisons to prior experiences, and suggestions for improvement. Facilitators documented responses verbatim where possible, preserving participant language rather than immediately synthesising into facilitator interpretations, enabling subsequent analysis to identify themes and patterns from actual words used rather than filtered summaries that might unconsciously incorporate facilitator biases. Separate debrief sessions with participants versus facilitators enabled candid discussion within each group about observations they might hesitate to share in mixed company, with participants expressing frustrations about technical issues without worrying about defending facilitator decisions and facilitators analysing pedagogical effectiveness without participants feeling evaluated on their performance. Thematic analysis identified recurring patterns across responses: multiple participants independently raised similar concerns about interface complexity, AI speech recognition failures, and unnecessarily elaborate environments, providing confidence that these represented genuine issues rather than individual preferences, whilst single-instance feedback flagged potential concerns worth monitoring even if not yet widespread. The analysis revealed important dissociations between what participants enjoyed and what they found pedagogically valuable: some participants enthusiastically described elaborate virtual environments as impressive and engaging yet acknowledged they contributed little to actual learning, whilst others expressed initial frustration with AI dialogue systems that ultimately proved most valuable for developing negotiation skills through realistic unpredictable interactions. Debrief conversations captured contextual details that quantitative metrics obscure: for example, understanding that IT literacy concerns reflected worries about future deployment to diverse staff populations rather than personal difficulties enabled appropriately targeted responses about accessible design and comprehensive induction rather than misinterpreting feedback as capability criticism. The qualitative methodology enabled hypothesis generation about why certain quantitative patterns emerged: the surprisingly high variance in user satisfaction (half rating maximum 5 out of 5, others providing moderate scores) likely reflected different prior expectations, with participants familiar with cutting-edge XR experiences applying higher comparison standards than those for whom XRisis represented their first substantive immersive technology exposure. Integration of qualitative and quantitative evidence created comprehensive evaluation that neither methodology alone could provide: numbers grounded anecdotal impressions in measurable outcomes whilst narratives explained patterns and outliers that numbers alone would leave mysterious, enabling evidence-based decision-making informed by both measurable performance and rich contextual understanding of user experiences, organisational constraints, and deployment realities.

Alignment with Humanitarian Training Objectives

The validation methodology deliberately assessed XRisis against Action Contre la Faim's established learning objectives rather than generic training effectiveness criteria, ensuring evaluation reflected domain-specific requirements. The Emergency Readiness and Response Unit defined clear competency targets: participants completing training should be able to recall key emergency concepts, execute organisational procedures accurately, analyse evolving crisis situations effectively, design appropriate response strategies, manage implementation challenges, and demonstrate effective team communication and decision-making under pressure. Evaluation instruments mapped these competencies to specific platform capabilities: Pilot 1's arrival briefing addressed knowledge recall objectives; Pilot 2's collaborative planning addressed analysis and strategy design objectives; Pilot 3's implementation scenarios addressed challenge management and team communication objectives. The alignment ensured that validation assessed whether immersive technology enhanced specific learning outcomes that mattered to organisational preparedness rather than measuring general capabilities that might be impressive yet irrelevant to actual training needs. Action Contre la Faim's Monitoring, Evaluation, Accountability and Learning framework influenced evaluation design, incorporating their established approaches to assessing training programme quality, effectiveness, relevance, and efficiency, creating consistency with how the organisation evaluates all learning initiatives rather than applying separate standards for technology-enabled delivery. The methodology acknowledged that training effectiveness encompasses more than immediate learning outcomes: cost-effectiveness relative to conventional alternatives, scalability across global operations, accessibility for diverse staff populations, integration with existing learning infrastructure, and sustainability beyond initial implementation all contribute to overall value proposition that evaluation needed to address. Participant selection criteria (drawing from actual emergency roster rather than volunteers or specially recruited test users) ensured validation engaged genuine target audience members rather than convenient proxies who might respond differently to training approaches, strengthening confidence that results would generalise to operational deployment. The timing of validation (after participants had completed standard onboarding and before they deployed to emergency operations) matched the intended platform application point in employee development lifecycle, avoiding artificial timing that might inflate or deflate effectiveness measures. Evaluation framing explicitly positioned immersive simulation as complementing rather than replacing conventional training modalities, assessing incremental value rather than demanding the platform serve all learning needs in isolation, recognising that comprehensive training programmes combine multiple delivery approaches optimised for specific learning objectives. The alignment between humanitarian training objectives and evaluation methodology created organisational confidence that validation evidence would support operational decision-making about whether, when, and how to deploy immersive simulation capabilities within broader staff development strategies, positioning XRisis not as interesting research demonstration but as serious contender for operational adoption pending refinements addressing identified usability and effectiveness gaps.