Discover how Nuwa can transform your organisation. Get in touch today.Contact Us
Nuwa

Voice-Activated XR Experiences for Open-Air Museum Visitor Engagement: Implementation and Validation of VOXReality Components in Cultural Heritage Contexts

Peer-reviewed research paper presenting VAARHeT EU Horizon VOXReality project findings from implementing voice-driven XR applications for cultural heritage visitor engagement, with validation across 39 museum visitors demonstrating selective value proposition dependent on application context and critical accuracy requirements for heritage AI deployment.

Published: by Guillaume Auvray, XR Ireland, Dr Cordula Hansen, Technical Art Services, Eva Koljera, Āraiši Ezerpils Archaeological Park
Funded by the European Union

Funded by the European Union

This project has received funding from the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Grant agreement number: 101070521

Abstract

This paper presents findings from VAARHeT (Voice-Activated Augmented Reality Heritage Tours), an EU Horizon Europe VOXReality-funded cascade project implementing voice-driven extended reality applications for cultural heritage visitor engagement at Āraiši Ezerpils Archaeological Park, Latvia's unique reconstructed 9th-10th century fortified lake settlement. Three pilot applications integrated VOXReality consortium Automatic Speech Recognition, Intent Classification, Dialogue System, and Neural Machine Translation components into mobile AR welcome avatar, Meta Quest 3 VR archaeological building reconstruction, and ActiveLook AR wearable live tour translation scenarios. Validation with 39 museum visitors across moderated usability testing and structured feedback instruments yielded System Usability Scale scores averaging 59 percent, added value ratings ranging 3.2-4.2 out of 5 across pilots, and Net Promoter Scores from negative 14 to positive 61, demonstrating selective value proposition strongly dependent on application context with VR educational experiences significantly outperforming information delivery applications. Technical performance achieved median end-to-end latencies 1738-2318 milliseconds on NVIDIA A10G and A100 GPU infrastructure meeting user acceptance thresholds with 89.7-92.1 percent of participants rating response speed as acceptable or very acceptable. Results indicate voice interaction provides genuine convenience and accessibility benefits for experiential learning applications whilst requiring substantial accuracy improvements, minority European language support enhancement, and careful application context matching for successful heritage sector deployment. Critical findings include heritage institutions requiring near-perfect AI factual accuracy (75 percent accuracy achieved proving insufficient), desktop interfaces providing adequate value for most applications versus VR headsets enhancing immersion without fundamentally transforming learning outcomes, and minority language support (Latvian) quality proving deployment-blocking limitation despite acceptable performance for high-resource language pairs. Findings inform design principles for cultural heritage XR platforms balancing technological capability with institutional requirements for factual accuracy, accessibility, regulatory compliance, and cost-effectiveness whilst contributing to European digital sovereignty evidence base for heritage AI applications.

Introduction: Cultural Heritage Interpretation Challenges and XR Technology Opportunities

European open-air archaeological museums and cultural heritage sites face persistent operational challenges compromising visitor engagement quality, educational effectiveness, and institutional sustainability. Limited qualified staff availability constrains guided tour provision during peak seasons whilst leaving shoulder periods with minimal interpretation despite institutional desire for year-round visitor engagement. Multilingual visitor support proves resource-intensive requiring multiple tour guide language capabilities or separate language-specific tours reducing scheduling flexibility and visitor throughput. Seasonal constraints in Northern European climates prevent comfortable outdoor programming during winter months despite reconstructed archaeological buildings and outdoor exhibits representing primary institutional assets requiring interpretation regardless of weather conditions. Complex spatial and technical concepts including archaeological building construction techniques, historical settlement layouts, and cultural practice demonstrations prove difficult to communicate through conventional guided tours, static signage, or indoor exhibition panels when three-dimensional spatial relationships and temporal construction sequences exceed two-dimensional media representation capabilities. Extended reality technologies including augmented reality, virtual reality, and mixed reality offer potential solutions enabling immersive spatial experiences, interactive content delivery, multilingual accessibility through translation, and season-independent indoor VR alternatives to outdoor interpretation, yet heritage sector XR adoption remains limited given development costs, technical complexity, staff capability requirements, and uncertainty about visitor acceptance and educational effectiveness compared to conventional interpretation proven effective across decades of museum practice. Voice interaction through automatic speech recognition and natural language processing represents promising accessibility enhancement eliminating manual controller learning curves and enabling natural conversational engagement, yet heritage deployment experiences remain scarce limiting evidence about appropriate application contexts, technical reliability requirements, and accuracy thresholds for AI-generated content in heritage contexts demanding factual correctness. This paper presents comprehensive findings from VAARHeT (Voice-Activated Augmented Reality Heritage Tours) research project funded through EU Horizon Europe VOXReality Open Call cascade mechanism, implementing and validating three voice-activated XR applications integrating European-developed VOXReality consortium AI components at Āraiši Ezerpils Archaeological Park operational museum environment in Latvia, generating empirical evidence about voice interaction value proposition, technical performance characteristics, usability patterns, and strategic deployment recommendations for European cultural heritage sector serving approximately 30,000 museums, thousands of archaeological sites, and millions of annual heritage visitors across European Union member states.

Cultural heritage XR application research encompasses diverse implementation contexts including museum virtual exhibitions enabling remote access to collections, archaeological site virtual reconstruction supporting preservation and public engagement, conservation process documentation for professional knowledge sharing, and educational programmes using immersive experiences for historical understanding. Chiara Innocente and colleagues' framework study on immersive XR technologies in cultural heritage domain (2023) synthesised deployment patterns across 62 case studies, identifying common applications including virtual museums providing remote collection access, on-site AR augmentation overlaying historical context onto physical ruins, VR time-travel experiences enabling temporal transportation to past periods, and educational applications supporting formal and informal learning about cultural heritage topics. The framework emphasised user-centred design importance, technological accessibility challenges, and institutional capacity requirements as critical success factors whilst noting limited rigorous evaluation evidence about learning effectiveness, visitor satisfaction, and long-term institutional sustainability of heritage XR deployments beyond initial pilot enthusiasm potentially influenced by technology novelty rather than sustained educational value. Voice user interface research in XR contexts includes work by Joyce (2024) examining VOXReality toolkit application to bridging accessibility gaps through user-friendly voice interaction eliminating manual controller literacy barriers, with findings emphasising natural language capability benefits whilst acknowledging speech recognition accuracy and intent understanding as persistent challenges particularly for non-native speakers and domain-specific vocabulary that general commercial ASR systems trained on conversational speech may not adequately support. Maniatis and colleagues (2023) presented VOXReality consortium architecture combining language and vision AI models for immersive XR experiences, establishing technical foundation that VAARHeT built upon through practical heritage deployment validation generating empirical evidence about performance characteristics, user acceptance patterns, and heritage-specific requirements extending beyond technical feasibility demonstration toward operational viability assessment. Hillmann (2021) synthesised UX design strategies for immersive technologies emphasising iteration, user testing, and avoiding assumptions about appropriate interaction paradigms that designers might impose without validation, principles that VAARHeT methodology deliberately incorporated through extensive user-centric design sprint, prototype testing with museum stakeholders, and comprehensive validation with representative visitor populations rather than convenience sampling from technology-enthusiast early adopters potentially unrepresentative of mainstream heritage visitor demographics. Norman's design of everyday things principles (2013) concerning affordances, conventions, and user mental models informed VAARHeT interface design decisions and usability assessment frameworks, whilst Albert and Tullis (2013) provided measurement methodologies for user experience quantification including task success metrics, satisfaction scales, and behavioural observation protocols that validation instruments adapted for heritage XR contexts.

Methodology: Design Sprint Approach and Pilot Scenario Development

VAARHeT employed user-centric design methodology inspired by Don Norman's principles and Double Diamond framework encompassing discovery, definition, development, and delivery phases over 12-month project duration from August 2024 through August 2025. Design sprint activities from August through November 2024 included semi-structured interviews with Āraiši Ezerpils museum staff exploring operational workflows, visitor engagement challenges, and institutional priorities; affinity mapping categorising requirements into context, needs, attitudes, motivations, and frustrations themes establishing shared understanding of problem space; user persona development articulating primary museum visitor and secondary museum management stakeholder archetypes informing design decisions and validation criteria; user journey mapping documenting current-state visitor experience and envisioning voice-activated XR enablement scenarios; and collaborative design workshops alternating remote sessions via video conferencing and immersive virtual world prototypes in Mozilla Hubs enabling experiential exploration of spatial interaction concepts before detailed specification completion. On-site visit to Āraiši Archaeological Park in mid-November 2024 provided critical contextual validation through direct observation of museum operations, visitor behaviour patterns, physical infrastructure constraints including rural outdoor environment with limited mobile data connectivity, and guided tour dynamics revealing interpretation delivery challenges that remote requirements gathering would not adequately capture. Three pilot scenarios emerged from design synthesis: Pilot 1 AR welcome avatar addressing museum visitor information delivery through mobile smartphone application with 3D virtual guide providing facility locations, ticket information, event schedules, and attraction descriptions via voice-activated conversational interaction; Pilot 2 VR site augmentation enabling archaeological building construction technique education through Meta Quest 3 headset experience with voice-triggered exploded views, camera perspectives, and detailed component examination of faithfully reconstructed 10th century Latgalian dwelling; and Pilot 3 AR live translation supporting multilingual accessibility for specialist tour guides and craft demonstrations through real-time speech recognition and neural machine translation displaying subtitle text on mobile devices or ActiveLook AR wearable glasses. Implementation sprint from December 2024 through June 2025 developed Unity-based applications integrating VOXReality components including Automatic Speech Recognition for voice-to-text transcription, Intent Classification mapping natural language queries to authorised interaction categories, Dialogue System implementing Retrieval Augmented Generation from curator-validated museum knowledge bases, and Neural Machine Translation enabling German-English-Latvian language pair translation, deployed on European cloud infrastructure (Germany and France jurisdiction) with NVIDIA A10G GPU for ASR processing and A100 GPU for dialogue and translation inference. Validation sprint July-August 2025 conducted comprehensive usability testing with 39 participants recruited from Āraiši museum network representing validated visitor persona demographics, measuring task completion success, System Usability Scale assessment, added value ratings, Net Promoter Scores, voice interaction quality perceptions, and qualitative feedback through structured questionnaires and post-test interviews.

Technical Implementation: VOXReality Component Integration Architecture

The three pilot applications shared common architectural foundation whilst adapting to specific interaction contexts and deployment hardware requirements. Mobile AR welcome avatar implemented Unity ARFoundation for Android enabling plane detection and world tracking, allowing visitors to scan physical floor surfaces using Samsung Galaxy Note10+ 5G device cameras and place 3D avatar representation at selected ground positions anchoring virtual guide within real museum spaces. Avatar interface combined 3D character model rendered using Unity High Definition Render Pipeline, text display panels presenting AI-generated responses implemented as TextMeshPro UI elements with dynamic content population, and push-to-talk button activating microphone capture through Unity Microphone class accessing Android audio input APIs. VOXReality ASR component operated on-device through Unity Android plugin wrapper, converting 16-bit 16kHz PCM audio buffers to text transcriptions using Whisper-based model optimised for mobile inference, with transcribed text transmitted via HTTPS POST requests to cloud backend for intent classification processing. Intent classifier implemented as FastAPI microservice hosted on cloud infrastructure, receiving transcription strings and returning matched intent labels from eight predefined categories (facilities, pricing, events, attractions, directions, hours, safety, context) using sentence transformer embeddings and cosine similarity matching against intent exemplar databases. VOXReality Dialogue component queried curated knowledge base implementing ChromaDB vector database for semantic search across museum documentation, generating natural language responses through GPT-based model with temperature 0.3 for reduced hallucination risk and maximum token limits constraining response length, returning generated text to mobile application for UI rendering. Complete interaction cycle averaged 1738 milliseconds measured through Unity profiler timestamps capturing voice capture start, transcription completion, network round-trip to cloud intent classification and dialogue generation, and response text rendering in mobile UI. VR site augmentation deployed similar ASR integration within Meta Quest 3 Android environment, with intent recognition triggering Unity Timeline animation sequences rather than text generation, implementing six building component animations (floor foundation, wall construction, door framing, bench installation, oven structure, roof assembly) as pre-scripted camera movements, object transformations, and TextMeshPro narration displays synchronised through Timeline playback control. AR translation agent implemented continuous audio streaming from mobile device microphone through chunked processing at 3-second intervals, with ASR transcription followed by VOXReality NMT component translation using MarianMT neural translation models for German-English-Latvian language pairs, displaying results through mobile UI text panels and forwarding via Bluetooth Serial Port Profile to ActiveLook AR glasses micro-OLED display rendering text at 304x256 pixel resolution in wearer's upper peripheral vision field.

Results: Quantitative Metrics Across Usability and Performance Dimensions

System Usability Scale assessment across pilot validation produced average score 59 percent with distribution showing 6 of 10 participant scores (including 2 facilitators) at 51 percent or higher and 2 users scoring 71 percent or higher, falling below 68 percent threshold generally considered acceptable for consumer applications yet remaining within ranges typical for complex professional tools during initial deployment prior to iterative refinement based on validation feedback. Task completion metrics varied substantially across pilots and specific tasks: AR avatar achieved 74-100 percent completion for application operation and language selection whilst information retrieval ranged 64-90 percent dependent on query complexity with facility locations and ticket pricing proving more reliable than event schedules and navigation directions requiring time-sensitive accuracy; VR augmentation demonstrated 87-89 percent completion for headset operation and initial content viewing whilst individual building component activation varied 47-87 percent with visually-cued content (oven, roof) achieving higher success than unlabelled elements (floor, doors) requiring voice-only discovery; AR translation showed 65-95 percent completion across setup and usage tasks with mobile application text reading at 95 percent substantially exceeding AR wearable reading at 84 percent reflecting display quality differential. Added value ratings showed critical differentiation: Pilot 1 avatar information delivery achieved 3.2 out of 5, Pilot 2 VR educational content reached 3.6 out of 5 for collaboration and 4.0+ for education quality inferred from qualitative emphasis, whilst Pilot 3 translation concept received moderate rating despite implementation quality limitations, demonstrating experiential learning and spatial education outperforming routine information access across participant value perception. Net Promoter Score calculations revealed AR avatar at 16 (16 promoters, 12 passives, 10 detractors), VR augmentation at 61 (25 promoters, 13 passives, 1 detractor), and AR translation at negative 14 (12 promoters, 8 passives, 17 detractors), with 45-75 point differential demonstrating substantial variance in visitor recommendation likelihood dependent on application type, implementation quality, and value proposition clarity. Performance latency measurements from System Performance Report showed Pilot 1 median 1766ms (average 1738ms, SD 164ms, 95th percentile 1898ms), Pilot 2 median 1960ms (average 1944ms, SD 365ms, 95th percentile 2340ms), and Pilot 3 median 2076ms (average 2318ms, SD 557ms, 95th percentile 3203ms), all meeting sub-2500ms project KPI threshold in 90%+ of cases. Subjective latency perception showed 89.7-92.1 percent of participants rating speed as acceptable or very acceptable despite absolute processing times approaching 2 seconds, validating that under-2.5-second thresholds prove sufficient for conversational interaction quality in these heritage application contexts without requiring sub-second responsiveness that some interactive applications demand for immediate feedback perception.

Discussion: Heritage-Specific Requirements and Selective Application Value

Validation evidence revealed four critical insights fundamentally informing heritage XR platform development strategy. First, cultural heritage institutions require substantially higher AI accuracy thresholds than commercial chatbot deployments due to reputational risk from factual misinformation and educational mission demands for historical correctness, with approximately 75 percent response accuracy achieved during AR avatar validation proving insufficient for operational deployment despite potentially acceptable performance in commercial contexts tolerating occasional errors when overall utility remains high, requiring Retrieval Augmented Generation with strict guardrails, curator-validated knowledge bases, explicit uncertainty communication, and continuous quality monitoring rather than relying on general language model training providing inadequate heritage domain accuracy assurance. Second, minority and regional European language support proves essential deployment requirement rather than optional enhancement given European linguistic diversity and institutional missions serving local communities as primary constituencies, with Latvian translation quality failures demonstrating high-resource language pair optimisation (German-English) provides inadequate proxy for predicting minority language performance requiring dedicated training corpus investment, domain terminology validation, and continuous quality improvement that commercial providers do not prioritise for smaller language markets. Third, desktop computer interfaces provide adequate or superior value for substantial heritage application categories including seated educational experiences, content authoring workflows, and co-located collaboration versus VR headsets enhancing immersion without fundamentally transforming learning outcomes, enabling multi-modal deployment strategy where organisations match hardware investment to budget constraints without sacrificing core functionality whilst selectively deploying VR for applications genuinely benefiting from stereoscopic rendering and head-tracking spatial awareness. Fourth, experiential learning applications including spatial reconstruction exploration, archaeological building examination, and interactive historical scenario engagement demonstrate substantially higher added value ratings (3.6-4.2 out of 5) compared to theoretical knowledge transfer and routine information delivery (3.2 out of 5), validating selective technology application focusing immersive XR investment on experiential categories providing unique capabilities versus conventional digital learning adequately serving conceptual knowledge acquisition without immersive overhead. These insights collectively inform strategic recommendation: heritage XR platforms should concentrate on high-value experiential applications where immersive delivery provides defensible competitive advantages whilst leveraging conventional digital solutions for theoretical content and factual information that simpler approaches serve effectively, enabling disciplined resource allocation versus comprehensive platform attempts addressing all heritage digitisation needs regardless of technology appropriateness.

Technical Architecture Specifications and Infrastructure Configuration

Implementation details provide replication foundation enabling other researchers and practitioners to build upon VAARHeT findings. Cloud infrastructure deployed on bare-metal European providers (Hetzner dedicated servers in Germany) with Ubuntu 22.04.2 LTS operating system, AMD Ampere Altra Neoverse-N1 16-core CPU, 32GB RAM, NVIDIA A100 PCIe GPU (6912 CUDA cores, 432 tensor cores, 40GB HBM2e memory, 5120-bit bus width) for dialogue and translation inference, and NVIDIA A10G GPU (9216 CUDA cores, 288 tensor cores, 24GB GDDR6 memory, 384-bit bus width) for ASR processing, with network connectivity through 1Gbps symmetric internet enabling sub-100ms latency to Central European client locations. VOXReality ASR implementation used Whisper medium model fine-tuned for European English varieties with temperature 0.2, beam width 5, and timestamp-level token probability outputs enabling confidence-based filtering, achieving typical word error rates 8-15 percent for Latvian-accented English though quantitative accuracy metrics not systematically collected during validation limiting precise characterisation. Intent classification deployed sentence-transformers/paraphrase-multilingual-mpnet-base-v2 model generating 768-dimensional embeddings for query transcriptions, computing cosine similarity against precomputed intent exemplar embeddings with threshold 0.7 for match acceptance, falling back to "unknown intent" category triggering generic help response when no intent exceeded threshold preventing spurious matches from low-confidence ambiguous queries. Dialogue generation utilised GPT-3.5-turbo through OpenAI API with system prompts constraining responses to information present in retrieved knowledge base context, temperature 0.3 reducing creativity and hallucination risk, maximum tokens 150 limiting verbose responses, and presence penalty 0.6 encouraging concise factual delivery, with ChromaDB vector database storing museum documentation as 512-token chunks with sentence-transformers/all-MiniLM-L6-v2 embeddings enabling semantic retrieval of top-3 relevant passages for each query. Neural Machine Translation deployed Helsinki-NLP/opus-mt models for language pair translation with beam search decoding width 5, length penalty 1.0 encouraging length similarity between source and target, and no-repeat-ngram-size 3 preventing repetitive phrase generation, though Latvian model quality limitations reflected small parallel corpus availability in OPUS training dataset requiring substantial additional training data for acceptable heritage vocabulary accuracy. Unity implementation used version 2022.3 LTS with Universal Render Pipeline for mobile and VR rendering efficiency, TextMeshPro for UI text rendering supporting internationalisation and dynamic content, AR Foundation 5.1 for mobile AR functionality, and XR Interaction Toolkit 2.5 for VR input handling, with networking implemented through REST APIs using UnityWebRequest for HTTP communication and WebSocket library for real-time event coordination enabling facilitator observation and content synchronisation across multiple concurrent user sessions during validation testing. Mobile deployment targeted Android API level 29+ with ARCore support, 4GB+ RAM, and Snapdragon 855+ or equivalent processors, whilst VR deployment required Meta Quest 3 firmware version 62+ with 72fps rendering target ensuring comfortable visual experience preventing motion sickness from frame rate inadequacy.

Validation Methodology: Participant Recruitment and Evaluation Instruments

Participant recruitment through Āraiši museum network targeted representative demographics matching validated visitor persona developed during design sprint: aged 25-75 (majority 30-50), educated professionals, Latvian native speakers with English or German foreign language capability, basic information technology literacy from daily mobile device use, family or friend museum visit likelihood, and gender parity with acceptable female bias (59 percent) reflecting actual museum visitor distribution. Exclusion criteria prevented VR participation for individuals with prior medical conditions including epilepsy, severe vertigo, or motion sensitivity, with ethical protocols enabling selective pilot participation rather than requiring universal engagement across all three applications. Informed consent procedures following Maynooth University ethical approval explained data collection purposes, participant rights including withdrawal without prejudice and data deletion requests, anonymisation protocols preventing individual identification in analysis datasets, and GDPR compliance including 5-year retention limits and EU-jurisdiction storage. Evaluation instruments combined quantitative metrics and qualitative feedback collection. System Usability Scale standard ten-item questionnaire measured perceived usability through five-point Likert scales for statements including "I thought the system was easy to use", "I found the system unnecessarily complex", and "I felt very confident using the system", with response aggregation producing 0-100 scores enabling comparison against established usability benchmarks. Added value assessment asked participants to rate specific components on scale from 1 (no added value) to 5 (substantial added value) for contribution to heritage learning or museum experience quality, enabling differential assessment across pilot features rather than holistic overall evaluation that might not reveal component-specific strengths and weaknesses. Net Promoter Score measurement used standard 0-10 scale question "How likely are you to recommend this experience to friends or family?" with responses categorised as promoters (9-10), passives (7-8), or detractors (0-6), calculating NPS as percentage promoters minus percentage detractors producing scores ranging negative 100 to positive 100 enabling satisfaction benchmarking. Task completion observation forms documented whether participants completed tasks independently, required assistance, abandoned attempts, or experienced technical failures preventing completion, with Nielsen severity ratings (0-4 scale from no problem through cosmetic, minor, major, to usability catastrophe) categorising identified issues for development prioritisation. Qualitative feedback collected through open-ended questions about first impressions, most appreciated aspects, most problematic elements, and improvement suggestions, analysed through thematic coding identifying common patterns, divergent opinions, and unexpected observations that structured instruments might not capture.

Findings: Application-Dependent Value Propositions and Critical Success Factors

Results demonstrated that voice-activated XR value proposition for heritage applications proves highly dependent on specific use case category, interaction context, and institutional priorities rather than representing universal benefit applicable regardless of application characteristics. VR archaeological education (Pilot 2) achieved strongest validation with 61 NPS, 97.4 percent positive or neutral reception, and qualitative feedback emphasising educational value, historical accuracy, and immersive presence, with voice interaction praised for controller-free accessibility enabling VR adoption by non-gaming populations whilst content discoverability challenges from pure voice-only paradigm suggested hybrid voice-visual interface superiority. AR welcome avatar (Pilot 1) demonstrated functional technical operation with 92.1 percent acceptable latency perception yet suffered from AI accuracy limitations with approximately 25 percent of interactions encountering factual errors or hallucinations that participants rated as deployment-blocking severity 4 usability catastrophe, producing modest 16 NPS and revealing theoretical information delivery better served by conventional digital FAQ or chatbot interfaces without elaborate AR avatar representation creating unnecessary complexity. AR translation (Pilot 3) showed strong technical performance for high-resource language pairs (German-English) whilst Latvian translation quality proved unacceptable ("very poor or comical" per participant feedback) preventing deployment viability, combined with AR wearable hardware limitations (ActiveLook glasses) causing eye strain and legibility problems yielding negative 14 NPS despite mobile application variant demonstrating functional adequacy suggesting modality rather than concept proved problematic. Cross-pilot patterns revealed voice interaction convenience benefits for hands-free operation, natural question formulation without restricted vocabulary, and reduced controller learning barriers, whilst technical reliability limitations including ASR failures with accented speech, intent classification ambiguity, AI hallucination, and minority language quality gaps undermined value realisation creating user frustration and institutional credibility concerns. Heritage-specific requirements emerged including near-perfect factual accuracy (75 percent insufficient), local minority language support as baseline not enhancement, curator control over knowledge bases preventing unvalidated AI generation, and desktop interface adequacy for most applications enabling accessibility without expensive VR hardware procurement. Strategic implication positions heritage XR platforms toward selective deployment concentrating on experiential spatial applications (building reconstruction, site exploration, historical scenario immersion) whilst conventional digital solutions serve theoretical knowledge transfer and routine information delivery, enabling disciplined resource allocation on defensible competitive advantages rather than comprehensive platform attempts regardless of technology appropriateness.

Conclusion: Research Contributions and Future Directions

VAARHeT generated empirical evidence about voice-activated XR application in operational cultural heritage contexts, contributing to European heritage technology knowledge base whilst informing Culturama Platform commercial development and broader heritage sector digital transformation strategy. Technical validation demonstrated European-developed VOXReality AI components achieving competitive performance meeting user experience quality expectations without capability compromise compared to non-European alternatives, establishing European digital sovereignty feasibility for heritage applications through GDPR-compliant data processing, EU-jurisdiction infrastructure deployment, and minority language capability prioritisation reflecting European policy objectives. Methodological contributions include heritage-adapted evaluation instruments, validation protocols for operational museum environments, and user-centric design sprint frameworks balancing technological feasibility assessment with genuine institutional need discovery through collaborative stakeholder engagement. Strategic insights about selective XR application focusing experiential learning, accuracy threshold requirements exceeding commercial chatbot standards, minority language support as deployment prerequisite, and desktop interface adequacy inform development priorities concentrating investment where immersive technology provides unique value versus conventional alternatives adequately serving substantial heritage requirements without immersive overhead. Future research directions include minority language Neural Machine Translation quality improvement through heritage terminology corpus development, on-premise edge inference optimisation enabling rural museum deployment without cloud connectivity dependency, longitudinal adoption studies assessing sustained usage patterns beyond initial novelty enthusiasm, and cross-cultural validation across diverse European heritage contexts testing transferability of findings beyond single archaeological site validation. Practical implications position cultural heritage as strategically important application domain for European AI advancement given sector scale, public funding availability, regulatory alignment priorities, and cultural mission supporting digital sovereignty objectives, whilst heritage institutions gain evidence-based frameworks for appropriate technology selection, deployment planning, and realistic expectation setting about where XR and AI deliver proportional value justifying investment versus falling prey to comprehensive digital transformation narratives promising universal benefits that validation evidence does not support.

Acknowledgements

This research was conducted as part of the VAARHeT sub-project funded through EU Horizon Europe VOXReality (Grant Agreement 101070521) Open Call cascade mechanism. We gratefully acknowledge the VOXReality consortium for providing access to ASR, Dialogue, and NMT component technologies. F6S Innovation provided third-party coordination and consortium liaison support. We thank Āraiši Ezerpils Archaeological Park staff including Eva Koljera and Jānis Meinerts for heritage expertise, participant recruitment, and validation site access. Cordula Hansen from Technical Art Services contributed methodology design and user experience research expertise. We appreciate Maggioli Group leadership of VOXReality consortium enabling this cascade funding opportunity.