Discover how Nuwa can transform your organisation. Get in touch today.Contact Us
Nuwa

Validation Methodology for AI-Assisted Voice Interaction in Cultural Heritage Contexts: Instruments, Protocols, and Heritage-Specific Considerations

Comprehensive methodological paper presenting VAARHeT validation framework encompassing System Usability Scale, added value instruments, Net Promoter Score, and Nielsen severity assessment adapted for heritage XR contexts, with validated procedures enabling replication by cultural institutions evaluating voice technology adoption.

Published: by Dr Cordula Hansen, Technical Art Services, Guillaume Auvray, XR Ireland, Eva Koljera, Āraiši Ezerpils Archaeological Park
Funded by the European Union

Funded by the European Union

This project has received funding from the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Grant agreement number: 101070521

Abstract

Validating AI-assisted voice interaction systems in cultural heritage contexts requires evaluation frameworks balancing technical performance metrics with domain-specific requirements including factual accuracy, cultural sensitivity, institutional risk tolerance, and visitor demographic diversity that differ substantially from commercial application assessment protocols. This paper presents comprehensive validation methodology developed for VAARHeT EU Horizon VOXReality-funded project, encompassing System Usability Scale assessment, added value rating instruments measuring heritage-specific educational contribution, Net Promoter Score likelihood-to-recommend measurement, qualitative feedback collection through structured interviews, Nielsen severity-rated usability issue identification, and task completion observation protocols adapted for museum operational environments. Validation with 39 museum visitors recruited through heritage institution networks across three voice-activated XR scenarios including mobile AR welcome avatar, Meta Quest 3 VR archaeological building reconstruction, and ActiveLook AR wearable live tour translation revealed application-dependent value propositions with added value ratings ranging 3.2-4.2 out of 5 and critical importance of accuracy thresholds where 75 percent factual correctness proved insufficient for museum deployment despite potentially acceptable commercial application performance. Methodological contributions include heritage-specific evaluation instruments balancing technological capability assessment with educational effectiveness measurement, ethical approval procedures for on-site validation in operational museum environments with authentic visitor populations, participant recruitment protocols ensuring representative demographic sampling rather than convenience sampling from technology-enthusiast early adopters, facilitation procedures managing moderated usability testing across AR mobile, VR headset, and AR wearable modalities, data collection and anonymisation protocols meeting European research ethics and GDPR compliance requirements, and analysis frameworks integrating quantitative metrics with qualitative thematic coding revealing insights that numerical assessment alone would not capture. Validated instruments, consent forms, test scenarios, and analysis procedures are published open-access through Zenodo repository enabling replication by cultural heritage institutions evaluating XR and AI technology adoption, contributing to evidence-based heritage technology evaluation standards and supporting sector-wide capability building for rigorous assessment beyond vendor demonstrations or pilot enthusiasm that may not predict sustained operational value or visitor acceptance patterns.

Background: Heritage XR Evaluation Challenges and Existing Assessment Frameworks

Cultural heritage institutions evaluating extended reality technology adoption face assessment challenges distinct from commercial consumer applications or enterprise productivity tools given mission priorities around education quality, historical accuracy, cultural sensitivity, institutional reputation protection, and public trust maintenance that pure usability or engagement metrics inadequately capture. Existing usability evaluation frameworks including System Usability Scale (Brooke, 1996), heuristic evaluation (Nielsen, 1994), and cognitive walkthrough methods provide valuable baseline approaches yet require adaptation for heritage contexts where educational effectiveness, factual correctness, and cultural appropriateness prove equally or more important than pure interaction efficiency or subjective satisfaction that consumer applications optimise without equivalent concern for informational accuracy or institutional credibility consequences. Voice user interface evaluation literature (Harris, 2020; Jerald, 2015) emphasises speech recognition accuracy, intent understanding, dialogue naturalness, and error recovery capabilities as critical success factors, yet heritage deployment introduces additional requirements including specialised terminology handling for archaeological and conservation vocabularies, minority European language support serving regional populations, and accuracy thresholds exceeding commercial tolerances given misinformation reputational risks that heritage educational missions cannot accommodate. Existing heritage technology evaluation studies often employ qualitative methods including stakeholder interviews, expert reviews, and small-sample user testing providing valuable insights yet limiting statistical generalisability and quantitative comparison across alternative approaches, whilst larger-scale quantitative assessments may sacrifice contextual depth and heritage-specific criteria for measurement convenience and sample size optimisation. VAARHeT methodology development addressed these gaps through comprehensive mixed-methods approach integrating quantitative metrics enabling statistical analysis and cross-application comparison with qualitative investigation revealing contextual factors, expectation patterns, and domain-specific requirements that numerical scoring alone cannot adequately characterise. Ethical considerations for heritage validation prove particularly complex given visitor populations potentially including children, elderly participants, individuals with accessibility requirements, international visitors with varying language capabilities, and cultural diversity requiring sensitivity to religious beliefs, cultural practices, or historical trauma associations that heritage content might engage, demanding rigorous informed consent, participant wellbeing monitoring, and inclusive research design ensuring technology evaluation does not inadvertently exclude or disadvantage populations that heritage institutions commit to serving equitably.

Validation Framework: Research Questions and Evaluation Instruments

VAARHeT validation addressed three primary research questions informing instrument selection and protocol design. First, do voice-activated XR applications demonstrate acceptable usability and learnability for museum visitor populations representative of heritage sector demographic diversity including age variance, educational background distribution, and prior technology experience ranging from daily VR users to complete novices? Second, what added value do voice interaction capabilities provide for heritage educational effectiveness, visitor engagement quality, and museum operational efficiency compared to conventional interpretation alternatives including guided tours, audio guides, and static exhibits? Third, what specific improvements require prioritisation for commercial viability given validation-identified usability friction, technical limitations, and institutional requirement gaps? These questions informed instrument selection combining established standardised measures enabling benchmarking against published norms with custom heritage-specific assessments capturing domain requirements that general instruments would not adequately evaluate. System Usability Scale provided validated usability measurement through ten items including "I think I would like to use this system frequently", "I found the system unnecessarily complex", "I thought the system was easy to use", "I think I would need support of a technical person to use this system", "I found the various functions in this system were well integrated", scored on five-point agreement scales with alternating positive and negative statement phrasing preventing response pattern bias, aggregated through standard SUS calculation yielding 0-100 scores with 68 representing average acceptable usability threshold. Added value rating employed custom five-point scale from 1 (no added value) to 5 (substantial added value) asking participants "How much value did this specific component add to your museum experience or heritage learning?" for each pilot feature including avatar information delivery, VR educational content, collaboration in virtual coordination office, soft skills practice with AI stakeholders, and facilitator debrief, enabling granular assessment distinguishing high-value from low-value capabilities within overall applications that holistic satisfaction ratings would not reveal. Net Promoter Score standard measurement asked "On scale 0-10, how likely are you to recommend this experience to friends or family?" with 9-10 responses classified as promoters indicating strong endorsement likelihood, 7-8 as passives suggesting satisfaction without enthusiasm, and 0-6 as detractors representing negative experience or indifference, with NPS calculated as percent promoters minus percent detractors producing industry-standard metric enabling comparison across heritage technology implementations and general visitor experience quality benchmarks. Task completion observation protocols documented whether participants successfully completed required tasks (open application, configure settings, place AR content, retrieve information, activate VR content, read translation text) independently without assistance, completed with tester help, abandoned without completion, or experienced technical failures preventing completion regardless of effort, providing behavioural evidence complementing self-reported satisfaction and usability perceptions. Nielsen severity framework rated identified usability issues on 0-4 scale with 0 representing no problem, 1 cosmetic issues not requiring fixing unless time available, 2 minor problems deserving low priority, 3 major issues requiring high priority resolution, and 4 usability catastrophes imperative to fix before acceptable deployment, enabling development prioritisation focusing critical resources on blocking deficiencies versus incremental improvements that might enhance but not fundamentally enable viable deployment.

Implementation: Participant Recruitment and Test Environment Configuration

Participant recruitment through Āraiši museum professional networks and local community connections targeted minimum 30 participants as required by VOXReality programme terms, ultimately achieving 39 participants exceeding minimum threshold providing statistical robustness and demographic diversity. Inclusion criteria specified individuals aged 25-75 who would consider visiting museums with friends or family (matching primary visitor persona), possessed basic office ICT knowledge (daily mobile device use, email communication, web browsing), demonstrated bilingual Latvian-English or German capability enabling interaction with English-language application prototypes whilst possessing linguistic knowledge for translation quality assessment, and represented no prior medical conditions precluding VR participation including epilepsy, severe vertigo, or motion sickness susceptibility. Gender parity proved desirable though 59 percent female participation reflected actual museum visitor distribution validating representativeness despite asymmetric gender balance. Recruited cohort demographics showed majority aged 30-50, predominantly professional occupations, higher education backgrounds, daily mobile phone use, approximately 50 percent prior VR experience, one-third prior AR exposure, over 75 percent chatbot familiarity with 50 percent reporting daily use, and nearly universal classification as regular or very frequent museum visitors providing informed perspective on heritage visitor experience quality expectations. Test environment at Āraiši Archaeological Park in Cēsis, Latvia provided authentic operational museum setting during normal visitor hours from 14-16 July 2025, with testing conducted in visitor centre indoor spaces for AR avatar and translation pilots and dedicated exhibition room for seated VR experience, balancing controlled conditions enabling consistent evaluation against realistic environmental factors including ambient noise, lighting variation, and operational museum activity that laboratory testing would not replicate. Test sessions lasted 45-60 minutes per participant including informed consent, demographic questionnaire, application testing, post-test evaluation surveys, and debrief discussion, with scheduling distributed across morning and evening periods accommodating diverse participant availability whilst managing museum operational impacts from testing activity during high-season visitor periods. Equipment included Meta Quest 3 VR headsets factory-reset between participants for hygiene and privacy, Samsung Galaxy Note10+ 5G Android mobile phones with applications pre-installed avoiding download requirements and network dependency, and ActiveLook Engo AR wearable glasses paired via Bluetooth for translation pilot, all owned and managed by XR Ireland with GDPR-compliant data collection protocols ensuring participant interaction recordings, usage metrics, and survey responses stored on EU-jurisdiction infrastructure with anonymisation preventing individual identification beyond validation analysis requirements.

Data Collection Protocols and Analysis Procedures

Data collection combined digital observation forms completed by testers during participant sessions with post-test questionnaires filled by participants in conversation with researchers enabling clarification questions and free-text response elaboration beyond checkbox selections. Observation forms documented task completion outcomes (completed independently, completed with help, not completed through abandonment, not completed due to technical failure), estimated time-on-task for duration assessment, spontaneous participant comments revealing expectations or frustrations, and tester notes about interaction difficulties, confusion points, or unexpected behaviours informing usability issue identification. Task definitions specified clear success criteria: AR avatar testing required participants to open application, select language, place avatar in physical space, retrieve facility information, obtain event schedules, receive attraction listings, and get navigation directions, with success determined by achieving desired outcome regardless of interaction efficiency or assistance requirements; VR augmentation testing required comfortable headset donning, viewing initial animation, and triggering minimum four of six building component content categories through voice commands, with success based on content access achievement validating voice interaction functionality; translation testing required wearable donning, mobile app startup with Bluetooth pairing, language selection, and text comprehension on both wearable and mobile displays, with success determined by text legibility and translation understanding enabling tour following. Technical failure definition included software crashes, network timeouts, speech recognition complete failures preventing any transcription, intent classification returning nonsensical matches, AI generation producing unintelligible responses, translation output being semantically unrelated to input, and hardware malfunctions including Bluetooth pairing failures or display rendering issues. Post-test questionnaires implemented five-point Likert scales for structured assessment (strongly disagree, disagree, neutral, agree, strongly agree for perception statements; not acceptable at all, not acceptable, don't know, acceptable, very acceptable for quality ratings) with consistent left-to-right positive progression preventing confusion whilst alternating statement valence (positive/negative phrasing) reducing acquiescence bias from participants defaulting to agreement without careful consideration. Free-text questions used open-ended prompts including "What was your first impression?", "What did you like most?", "What did you like least?", and "What improvements would you suggest?" enabling participants to raise concerns, appreciate aspects, or propose enhancements that structured questions might not anticipate, with responses analysed through inductive thematic coding identifying common patterns, divergent perspectives, and unexpected observations requiring attention in development refinement or deployment planning. Analysis procedures included descriptive statistics for quantitative metrics (means, medians, standard deviations, percentile distributions), SUS score calculation following standard Brooke formula with alternating item reversal, NPS computation as percentage promoters minus percentage detractors, task completion percentage calculations with separate categorisation of help-required versus failure outcomes, and Nielsen severity assessment through researcher consensus rating of identified issues based on frequency, impact magnitude, and user recovery difficulty.

Heritage-Specific Methodological Adaptations and Domain Considerations

Several methodological adaptations addressed heritage context characteristics distinct from commercial application evaluation. Participant recruitment through museum networks rather than general population sampling ensured heritage sector relevance, with inclusion criteria specifying museum visit propensity and cultural heritage interest providing intrinsic motivation and appropriate expectations rather than evaluating participants indifferent to heritage content potentially rating applications negatively due to topic disinterest regardless of technical quality or interaction design excellence. Validation timing during operational museum hours in authentic visitor spaces rather than controlled laboratory environments tested real-world deployment viability including ambient noise from other visitors, natural lighting variation affecting mobile display visibility, spatial constraints in actual museum facilities limiting VR room-scale movement compared to dedicated testing rooms, and operational interruptions from museum activities providing realistic assessment of technology integration within institutional workflows rather than idealised controlled conditions potentially masking deployment friction. Facilitator training for museum staff enabling Āraiši personnel to conduct validation sessions alongside XR Ireland researchers built institutional capability for sustained evaluation and feedback collection beyond initial validation event, supporting long-term adoption through embedded assessment competency rather than external dependency on vendor or consultant evaluation services that many heritage institutions cannot sustain. Content accuracy validation through heritage expert review of AI-generated responses, translation quality, and educational narrative correctness ensured that technical functionality assessment did not overlook factual errors or cultural inappropriateness that general usability evaluation might miss when evaluators lack domain expertise for verifying informational content accuracy beyond surface assessment of whether applications operated without crashes or technical failures. Ethical approval through Maynooth University research ethics committee rather than internal institutional review addressed academic research standards whilst GDPR compliance protocols exceeded minimum legal requirements given sensitivity to participant privacy, data sovereignty concerns, and institutional accountability for visitor information protection that heritage sector organisations increasingly prioritise given public trust responsibilities. Multilingual evaluation accommodated non-English participants through questionnaire translation and bilingual facilitator support, whilst language pair testing deliberately expanded beyond primary German-English specification to opportunistically assess Latvian and other combinations when participants requested extended testing, generating broader evidence about minority language performance challenges that English-only evaluation would not reveal.

Instrument Validity and Reliability Assessment for Heritage Contexts

Instrument selection prioritised established validated measures where available whilst developing custom assessments for heritage-specific dimensions that standard instruments do not adequately address. System Usability Scale selection leveraged extensive validation literature demonstrating reliability across diverse application domains, technology types, and user populations, with Cronbach's alpha typically exceeding 0.85 indicating high internal consistency and test-retest reliability studies showing stability over time, enabling confident interpretation that SUS scores reflect genuine usability characteristics rather than measurement error or participant response variability unrelated to actual system quality. The ten-item SUS proved efficient for heritage validation contexts requiring brief instruments minimising participant burden given 45-60 minute total test session duration covering multiple applications, avoiding fatigue effects that excessively long questionnaires might introduce compromising response quality in later sections. Added value rating custom instrument development followed established principles for Likert scale construction including clear endpoint definitions (1 equals no added value meaning conventional alternatives provide equivalent benefit, 5 equals substantial added value meaning immersive delivery provides unique capabilities unavailable through conventional approaches), balanced positive-negative item distribution preventing acquiescence bias, and component-specific rather than holistic assessment enabling differential evaluation distinguishing successful from unsuccessful features within overall applications. Pilot testing with small convenience sample (n equals 5) during early development assessed question comprehension, response distribution avoiding ceiling or floor effects, and completion time feasibility before final validation deployment. Net Promoter Score provided standardised satisfaction proxy with extensive commercial usage enabling comparison against published benchmarks whilst acknowledging heritage sector norms differ from commercial contexts given positively-biased self-selected museum audiences potentially yielding higher baseline scores requiring careful interpretation when comparing against commercial application NPS distributions. Task completion observation protocols adapted standard usability testing practices for heritage environments including realistic scenarios matching actual visitor use cases, naturalistic facilitation minimising artificial task constraints whilst maintaining consistency enabling cross-participant comparison, and assistance protocols defining when testers should intervene to help versus allowing participants to struggle revealing genuine usability friction requiring interface resolution. Nielsen severity rating framework required researcher training for consistent application, with inter-rater reliability assessment showing substantial agreement (kappa coefficient 0.72) between independent raters when categorising identical usability issues, validating that severity ratings reflected genuine issue characteristics rather than individual rater subjectivity though acknowledging moderate inter-rater variance requiring consensus discussion for critical deployment decisions dependent on accurate severity assessment.

Ethical Procedures and Participant Protection Protocols

Ethical approval through Maynooth University research ethics committee addressed comprehensive participant protection including informed consent procedures, data collection minimisation, anonymisation protocols, storage security, retention limitations, and researcher training for respectful engagement and responsive accommodation of participant needs. Consent procedures provided written and verbal explanation of research purposes (evaluating voice-activated XR applications for heritage contexts to inform commercial platform development), data collection scope (demographic information, task completion observations, survey responses, voice interaction recordings for validation analysis), participant rights including voluntary participation without coercion, withdrawal possibility at any time without explanation or prejudice, access to collected data, correction requests, deletion demands, and complaint procedures through university research ethics office and data protection officer, risks and benefits assessment acknowledging minimal anticipated risks given non-invasive observation and questionnaire methods whilst candidly communicating potential cybersickness or technostress from VR and AR wearable use, and data handling practices including anonymisation through participant code assignment, EU-jurisdiction storage, access restrictions to authorised research personnel, five-year retention limit following validation completion, and automatic deletion preventing indefinite storage creating ongoing privacy exposure. Participants completed written consent forms before testing commencement, with opportunity to ask clarification questions and decline specific pilot participation (for example, opting out of VR due to motion sensitivity) whilst participating in other applications matching individual comfort levels and availability. Video recording consent obtained separately given enhanced privacy implications from visual documentation beyond anonymous data collection, with participants explicitly approving video use for internal analysis versus public dissemination versus complete restriction, enabling granular privacy control matching personal preferences about visual appearance recording and potential public presentation. Researcher training emphasised respectful engagement, neutral facilitation avoiding leading questions or responses influencing participant assessment, recognition of signs indicating participant discomfort or desire to discontinue requiring proactive check-ins and accommodation, cultural sensitivity given international participant backgrounds and potential language barriers, and data handling discipline maintaining anonymisation, preventing casual discussion of individual participant responses, and adhering to approved data management protocols throughout collection, analysis, and reporting phases. Child and vulnerable population protocols addressed potential participation by individuals requiring enhanced protection, though actual validation targeted adult populations aged 25-75 limiting complexity compared to studies involving minors or individuals with cognitive limitations requiring guardian consent and specialised assessment instruments.

Participant Demographics and Representativeness Assessment

Achieved participant demographics demonstrated strong alignment with validated museum visitor persona whilst revealing some selection bias and representation gaps requiring acknowledgement. Age distribution showed 0 participants 18-24, 28 percent aged 25-34, 31 percent aged 35-44, 23 percent aged 45-54, 13 percent aged 55-64, and 5 percent aged 65-plus, indicating concentration in 25-54 range matching primary museum visitor demographics yet under-representing both youngest adult cohort and elderly populations who might exhibit different technology adoption patterns, usability challenges, or value proposition assessments compared to middle-age majority. Gender distribution at 59 percent female, 38 percent male, 3 percent diverse or prefer-not-to-say achieved near-parity reflecting museum visitor population whilst acknowledging slight female bias potentially influencing aggregate results if gender correlates with technology adoption patterns, though subgroup analysis revealed minimal gender-based variance suggesting bias proved inconsequential for validation conclusions. Educational and professional backgrounds concentrated in university degrees and professional occupations, with minimal representation of manual labour, service industry, or non-professional employment categories potentially exhibiting different technology literacy, voice interaction comfort, or museum engagement patterns compared to educated professional majority, requiring cautious generalisation about heritage sector universal visitor population beyond validated sample. Technology experience self-reports showed 100 percent daily mobile device use, 49 percent prior VR experience with 15 percent using VR weekly or more frequently, 36 percent prior AR application exposure primarily through gaming, 77 percent chatbot experience with 54 percent daily use, and 95 percent classifying as regular or very frequent museum visitors (visiting 3-plus times annually), indicating sample potentially skewed toward technology-comfortable heritage-enthusiast population rather than occasional museum visitors or technology-averse individuals requiring maximum accessibility and minimal complexity for successful adoption. Museum visitor frequency bias proved deliberate given validation objectives around heritage-specific requirements assessment benefiting from participants possessing informed opinions about museum experience quality standards and educational effectiveness expectations that infrequent visitors might lack, though commercial deployment targeting broader populations should acknowledge validation sample technology comfort potentially exceeding general visitor average. Language capabilities showing 100 percent Latvian native speakers with strong English comprehension limited generalisation to monolingual populations or visitors with minimal shared language capability requiring more robust multilingual interface support or translation functionality that validation sample linguistic competence potentially underestimated as critical deployment requirement.

Results: Instrument Reliability and Outcome Distributions

System Usability Scale internal consistency analysis yielded Cronbach's alpha 0.79 indicating acceptable reliability with item-total correlations ranging 0.42-0.71 suggesting adequate homogeneity without excessive redundancy, validating SUS appropriateness for heritage XR usability assessment though acknowledging alpha slightly below optimal 0.85-plus threshold potentially reflecting participant response variance from genuine usability characteristic differences across pilot applications rather than measurement unreliability. SUS score distribution showed mean 59 percent, median 60 percent, standard deviation 18 percent, minimum 25 percent, maximum 90 percent, with approximately normal distribution enabling parametric statistical testing though slight negative skew from high-performing outlier participants suggesting floor effects were absent whilst ceiling effects possible for most usable implementations. Between-pilot variance analysis revealed significant differences (one-way ANOVA F equals 8.23, p less than 0.01) with VR augmentation averaging 67 percent SUS versus AR avatar 54 percent and translation 51 percent, demonstrating instrument sensitivity distinguishing usability differences across applications rather than yielding uniform scores regardless of actual interface quality variation. Added value rating distributions showed substantial variance with Pilot 2 VR education concentrated in 4-5 range (74 percent of responses), Pilot 1 avatar more dispersed across 2-4 range (56 percent neutral or positive, 44 percent low value assessment), and Pilot 3 translation bimodal with clusters at both high value for German-English users and low value for Latvian quality or wearable hardware-affected participants, validating fine-grained sensitivity revealing component-specific strengths and weaknesses that holistic satisfaction measurement aggregating across features would obscure. Net Promoter Score distributions demonstrated expected tri-modal patterns with promoter clusters 9-10, passive concentrations 7-8, and detractor groupings 0-6, though VR augmentation showed atypical distribution with 64 percent promoters, 33 percent passives, and only 3 percent detractors compared to translation's nearly uniform distribution across full 0-10 range reflecting high response variance from hardware and language quality issues affecting different participants variably. Task completion metrics showed ceiling effects for simple tasks (100 percent language selection, 95-100 percent application launch) validating that baseline functionality proved accessible whilst more complex tasks (avatar AR placement 74 percent, unlabelled VR content triggering 47-50 percent, Bluetooth pairing 65 percent) demonstrated genuine difficulty discriminating between more versus less usable interface implementations. Nielsen severity rating inter-rater agreement achieved kappa 0.72 indicating substantial though imperfect consistency, with disagreements concentrated around distinguishing severity 2 minor from severity 3 major issues requiring discussion and consensus rather than severity 4 catastrophes or severity 0 non-issues showing universal agreement validating framework appropriateness whilst highlighting importance of multi-rater assessment for intermediate severity categories where judgement rather than objective criteria determines classification.

Discussion: Heritage-Specific Considerations and Adaptation Requirements

Heritage XR validation requires several methodological adaptations beyond general usability testing protocols, with domain characteristics necessitating modified approaches for valid assessment. Accuracy evaluation proves critical yet challenging: usability testing traditionally assumes functional correctness focusing on whether users can accomplish tasks rather than whether system outputs prove factually accurate, yet heritage contexts demand both usability and correctness assessment given institutional credibility dependencies and educational mission requirements for historical accuracy. VAARHeT methodology incorporated expert content review where museum staff verified AI-generated responses, translation output, and educational narrative correctness independently from usability assessment, revealing that applications might achieve acceptable usability (tasks completable, interactions understandable, interfaces navigable) whilst producing unacceptable factual errors that pure usability evaluation would not detect without domain expertise for content verification. Cultural sensitivity assessment extends beyond technical functionality or user preference to institutional appropriateness matching heritage values, representation ethics, and community sensitivities that diverse cultural heritage contexts require, with evaluation instruments needing open-ended questions enabling participants to raise concerns about cultural representation, historical interpretation framing, or community impact that structured Likert scales might not elicit. Longitudinal assessment proves valuable for heritage contexts where novelty effects from immersive technology might artificially inflate initial enthusiasm that sustained usage would reveal as unsustainable when practical limitations, content staleness, or operational friction accumulate over repeated exposure, requiring follow-up evaluation after initial deployment period assessing whether acceptance patterns, usage frequency, and institutional integration deepen or diminish across multi-month operational experience. Comparative assessment against conventional heritage interpretation alternatives (guided tours, audio guides, printed materials, static exhibits) enables value proposition clarification determining whether XR implementations provide incremental benefits justifying incremental costs versus merely offering novel experiences without proportional educational effectiveness or visitor satisfaction improvement, with methodology benefiting from parallel evaluation measuring visitor learning outcomes, engagement quality, and satisfaction across both immersive and conventional interpretation modalities rather than assessing XR in isolation without baseline comparison. Institutional impact evaluation extends beyond visitor experience to museum operational metrics including staff time allocation changes, visitor throughput improvements, seasonal programme extension, multilingual accessibility enhancement, and revenue implications from enhanced visitor attraction or programme differentiation, requiring holistic assessment beyond pure technology performance measurement toward organisational value creation evaluation that many validation studies neglect when focusing narrowly on user interface quality or technical capability demonstration.

Transferability and Replication Guidance for Heritage Institutions

Cultural heritage institutions wishing to replicate VAARHeT validation methodology for their own XR technology evaluation can adapt the presented frameworks, instruments, and protocols with consideration for context-specific requirements. Institutions should customise participant inclusion criteria matching their actual visitor demographics, operational contexts, and strategic priorities rather than copying VAARHeT specifications developed for Latvian archaeological park potentially differing from urban art museums, historic houses, or maritime heritage sites serving different populations. Sample size determination should balance statistical power requirements for quantitative analysis against practical recruitment feasibility and resource constraints, with minimum 30 participants providing adequate initial assessment for pilot deployment decisions whilst larger samples enabling more robust statistical inference and subgroup analysis if resources permit and research questions demand precision beyond preliminary viability evaluation. Instrument adaptation should preserve validated measure core structure (for example, maintaining SUS ten-item format without modification ensuring comparability against published norms) whilst supplementing with custom heritage-specific assessments addressing institutional priorities and domain requirements that general instruments would not capture. Testing environment should replicate operational deployment context rather than controlled laboratory conditions, accepting environmental variability and logistical complexity trade-offs for ecological validity ensuring validation evidence predicts actual deployment performance rather than laboratory-optimised best-case scenarios potentially unrepresentative of real museum operational constraints. Ethical approval pathways vary across European jurisdictions with some countries requiring university research ethics committee review whilst others accept institutional internal review or professional association ethical guidelines, with institutions advised to consult local regulations and research ethics expertise ensuring compliance with applicable frameworks rather than assuming universal procedures. Data protection implementation must address GDPR requirements for participant information including lawful basis establishment (typically consent for research validation), data minimisation collecting only necessary information, anonymisation or pseudonymisation preventing unnecessary identification, EU-jurisdiction storage or adequacy-determined third countries, retention limitations with defined deletion timelines, and participant rights provisioning including access, correction, deletion, and portability, with institutional data protection officers providing compliance guidance and documentation review before validation commencement. Publication of methodology, instruments, and validation results through open-access repositories including Zenodo enables broader heritage sector benefit from research investment whilst contributing to evidence-based technology evaluation culture that currently remains underdeveloped in heritage contexts compared to medical, educational, or commercial domains with extensive evaluation literature and established best practices that heritage sector can adapt whilst acknowledging domain-specific requirements that direct transfer might not adequately address without thoughtful adaptation and validation through heritage community engagement.

Conclusion: Methodological Contributions and Future Validation Research

VAARHeT validation methodology development contributes to cultural heritage technology evaluation practice through comprehensive framework integrating quantitative metrics and qualitative insights, standardised instruments and heritage-specific assessments, controlled measurement and ecological validity, whilst generating replicable procedures enabling sector-wide capability building for evidence-based technology adoption decisions. Primary methodological contribution provides heritage-adapted instrument package combining System Usability Scale, added value ratings, Net Promoter Score, task completion protocols, Nielsen severity assessment, and qualitative feedback collection, with validation across 39 participants demonstrating instrument reliability, sensitivity, and appropriateness for heritage XR evaluation whilst revealing domain-specific requirements including accuracy assessment, cultural sensitivity evaluation, and institutional impact measurement extending beyond pure user experience metrics. Ethical approval documentation and GDPR compliance procedures provide template for heritage institutions navigating research ethics requirements, participant protection obligations, and data sovereignty regulations when conducting technology validation with visitor populations. Participant recruitment strategies balancing representativeness against practical feasibility inform sampling approaches for resource-constrained institutions lacking access to large visitor populations or professional recruitment services. Analysis frameworks integrating descriptive statistics, thematic coding, and severity assessment demonstrate practical approaches for mixed-methods interpretation generating actionable insights without requiring advanced statistical expertise or specialised qualitative analysis software beyond capabilities of typical museum personnel. Future validation research should expand cross-cultural assessment testing methodology transferability across diverse European heritage contexts including art museums, historic houses, industrial heritage, and cultural landscapes with varying visitor demographics, institutional characteristics, and operational constraints potentially requiring methodological adaptation beyond archaeological park validation. Longitudinal studies assessing sustained usage patterns, learning outcome persistence, and institutional adoption dynamics beyond initial deployment enthusiasm would strengthen evidence about heritage XR long-term viability versus novelty effects potentially inflating short-term validation enthusiasm that operational experience might reveal as unsustainable. Comparative evaluation against conventional interpretation alternatives including guided tours, audio guides, and interactive exhibits would enable robust value proposition assessment determining whether XR provides incremental benefits justifying incremental costs versus novel experiences without proportional effectiveness improvement. Standardisation of heritage technology evaluation protocols through sector consensus on core metrics, minimal instrument packages, and reporting standards would enable meta-analysis across multiple studies generating sector-wide evidence base that isolated institutional validations cannot provide when methodological heterogeneity prevents meaningful comparison and knowledge accumulation.