Discover how Nuwa can transform your organisation. Get in touch today.Contact Us
Nuwa

Voice-Driven VR Site Augmentation for Archaeological Building Education

How XR Ireland developed a voice-activated VR experience enabling museum visitors to explore 10th century Latgalian building reconstruction through natural speech commands, achieving 61 Net Promoter Score whilst revealing critical content discoverability challenges in pure voice-only interaction paradigms.

Published: by Anastasiia P.
Funded by the European Union

Funded by the European Union

This project has received funding from the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Grant agreement number: 101070521

Archaeological Interpretation Challenge and VR Educational Opportunity

Open-air archaeological museums presenting reconstructed historical buildings face fundamental interpretation challenges that conventional guided tours and static exhibits struggle to address effectively. At Āraiši Ezerpils Archaeological Park, visitors encounter faithful reconstructions of 9th-10th century Latgalian lake settlement dwellings built using authentic archaeological evidence and traditional construction methods, yet the architectural complexity and sophisticated building techniques employed by Iron Age communities prove difficult to communicate through purely visual observation or verbal description alone. The intricate relationship between structural elements including foundation pile systems anchoring buildings into lakebeds, log wall construction using specific notching and joining techniques enabling stable vertical structures without metal fasteners, roof framing incorporating complex timber geometry distributing load across spanning elements, interior feature integration including clay ovens, wooden benches, and storage solutions reflecting cultural practices, and the sequential construction process progressing from ground preparation through structural framing to weatherproofing and finishing remains largely invisible to visitors viewing completed reconstructions from exterior vantage points. Traditional interpretation approaches including physical exploded-view scale models displayed in indoor exhibition spaces provide partial understanding but cannot convey full three-dimensional spatial relationships or demonstrate dynamic construction sequences showing how individual components assemble into integrated structural systems. Live tour guide explanations delivered by museum staff prove highly effective but scale poorly due to staffing limitations during peak visitor periods, guide availability constraints during shoulder seasons, linguistic barriers for international visitors when guides conduct tours in Latvian or limited foreign languages, and the inherent challenge of verbally describing complex spatial and temporal processes that visual demonstration communicates more effectively. Seasonal limitations prove particularly acute for open-air museums in Northern European climates where winter temperature extremes prevent comfortable outdoor guided tours and living history demonstrations despite institutional desire to maintain year-round visitor engagement and revenue generation. The VAARHeT sub-project Pilot 2 scenario addressed these interpretation challenges through voice-activated VR site augmentation enabling visitors to explore 3D digital reconstructions of archaeological buildings with natural language commands triggering educational content, exploded structural views, varied camera perspectives, and detailed close-up examinations of construction techniques, deployed on Meta Quest 3 VR headsets in climate-controlled indoor exhibition spaces enabling season-independent interpretation whilst providing immersive educational experiences complementing outdoor site exploration. XR Ireland integrated VOXReality Automatic Speech Recognition and Intent Classification components with Unity-based VR environment rendering high-fidelity 3D models of the Chieftain's house from Āraiši lake settlement, enabling visitors to speak naturally ("show me how the floor was built", "explode the walls", "zoom in on the oven") rather than learning manual VR controller interactions, reducing technical literacy barriers whilst validating voice interaction value proposition for spatial exploration applications distinct from pure information retrieval contexts tested in Pilot 1.

VR Experience Design and Voice-Activated Content Architecture

The VR site augmentation experience positioned visitors within a simplified staging environment presenting a detailed 3D reconstruction of the 10th century Latgalian dwelling at 1:1 scale, providing spatial awareness and realistic proportions impossible to convey through photographs, diagrams, or conventional museum displays. The digital reconstruction modelled six major building components as discrete interactive elements: floor foundation system including pile anchoring and platform construction, log wall assemblies with notching and joining details, door and window aperture framing, interior wooden bench and seating installations, clay oven structure and chimney integration, and roof framing with weatherproofing material application, each component represented with archaeological accuracy validated by Āraiši Ezerpils heritage experts ensuring historical correctness prioritised over aesthetic stylisation or simplified representation. Visitors wearing Meta Quest 3 headsets entered the VR environment through automatic application launch, encountering a welcome panel explaining voice interaction mechanics and prompting practice questions establishing comfort with conversational modality before educational content access. VOXReality Automatic Speech Recognition operated on-device within the Quest 3 Android-based operating system, converting visitor speech to text locally with results transmitted to cloud infrastructure for intent classification processing on NVIDIA A100 GPU servers achieving median 1960 milliseconds end-to-end latency from question completion to triggered animation commencement. Intent recognition mapped natural language queries against six authorised interaction intents corresponding to building components: "floor" intent triggering foundation and pile system visualisation with explanatory narration, "walls" intent activating log construction demonstration showing assembly sequence, "doors" intent focusing camera on aperture framing and closure mechanisms, "benches" intent highlighting interior furnishing and social space organisation, "oven" intent demonstrating clay oven construction and functional heating infrastructure, and "roof" intent revealing timber framing geometry and weatherproofing material application through animated exploded views separating structural layers. Animation sequences combined camera perspective changes smoothly transitioning between overview positions and detailed close-ups, component highlighting using visual emphasis drawing attention to specific structural elements, exploded assembly views separating components to reveal joining techniques and material interfaces, and pre-scripted educational narration synthesised as on-screen text panels explaining historical context, construction methods, material properties, and cultural significance of architectural features. The deliberate choice of pre-scripted content rather than AI-generated educational explanations reflected museum stakeholder priorities for historical accuracy and factual correctness, preventing potential hallucination or speculative interpretation that could undermine educational integrity, whilst simultaneously facilitating precise synchronisation between voice-triggered events, camera animations, 3D model transformations, and explanatory text presentation without coordination complexity that real-time AI generation would introduce. Four of six content categories (walls, benches, oven, roof) received explicit visual reference through text labels visible within the VR environment prompting visitors about available queries, whilst two categories (floor, doors) remained intentionally unlabelled to assess content discoverability in pure voice-only interaction scenarios without visual cueing, testing whether natural exploration and curiosity would drive visitors to experiment with questions discovering hidden content or whether discoverability required explicit signposting preventing user frustration from incomplete content access.

Validation Results: Task Success and Interaction Performance

Usability testing with 38 participants (one participant opted out of VR pilot due to prior motion sensitivity concerns) measured task completion across VR headset operation and voice-triggered content access categories. Headset donning and comfort achieved 86.8% successful completion without help and 13.2% requiring tester assistance with strap adjustment or lens spacing configuration, indicating generally accessible hardware ergonomics meeting diverse user anthropometric requirements without extensive adaptation. Viewing at least one animation event through voice interaction achieved 89.5% completion without help and 10.5% with assistance, demonstrating most visitors successfully learned voice activation mechanics during tutorial phase and could apply understanding to content access. Individual building component activation success showed substantial variance revealing the content discoverability challenge: floor event activation achieved only 47.4% completion without help, 21.1% with assistance, 26.3% user abandonment without triggering content, and 5.3% technical failure, making it the lowest-performing content category and validating concerns about pure voice-only discoverability without visual cueing. Doors event similarly struggled with 50% completion, 21.1% requiring help, 26.3% abandonment, and 2.6% technical failure, confirming pattern where unlabelled content remained largely undiscovered by approximately 30% of visitor population unwilling or unable to experiment sufficiently to find hidden interaction opportunities. Walls event performed substantially better at 84.2% completion, 7.9% with help, zero abandonment, and 7.9% technical failure, whilst benches reached 81.6% completion, 10.5% help required, 2.6% abandonment, 5.3% technical failure. Oven and roof events, both visually cued within the environment, achieved highest success at 86.8% completion without help, 10.5% or 5.3% assistance required respectively, zero or minimal abandonment, and 2.6-7.9% technical failure rates, demonstrating that visual signposting substantially improved content discoverability whilst maintaining voice activation interaction pattern. Technical failure rates between 2.6-7.9% across voice-activated content reflected speech recognition limitations where ASR transcription errors prevented intent matching, unintentional triggering where background conversation or thinking-aloud speech accidentally activated content when visitors spoke without intending system interaction, and network or server processing failures disrupting cloud-based intent classification preventing local recovery or graceful degradation. Nielsen severity framework assessment identified six major usability issues all rated severity 3 (important to fix, high priority): floor event non-activation in 30% of attempts required improved content findability through visual references or proactive prompting, walls event recognition failures needed ASR accuracy enhancement, doors event non-activation in 30% of cases demonstrated discoverability problem identical to floor event, benches event recognition failures indicated speech processing limitations, roof event occasional failures revealed similar ASR challenges, and wrong event triggering in approximately 25% of overall sessions where conversational speech or intent classifier ambiguity activated unintended content highlighting need for push-to-talk activation mechanics preventing accidental triggering versus continuous listening mode capturing unintentional speech.

User Experience Assessment and Educational Value Perception

Post-test feedback revealed substantially more positive reception for VR site augmentation compared to AR welcome avatar, with participants particularly valuing educational content quality, immersive presence sensation, and calm atmospheric presentation aligned with contemplative learning objectives. First impressions emphasised educational effectiveness ("very interesting and educational", "feels like I'm there", "good explanation of the house and building process"), with visual quality and content accuracy receiving consistent praise validating investment in museum expert-curated narratives and archaeologically accurate 3D reconstruction despite simplified staging environment limiting broader landscape representation. Critical first impressions identified immersion-breaking elements including landscape styling not matching Latvian environmental context (vegetation, lighting, atmospheric conditions differing from local archaeology geography), participants needing assistance formulating appropriate questions to access desired content indicating insufficient onboarding or signposting, and mono-audio presentation ("sound only on one ear was strange") revealing hardware configuration issue reducing sensory immersion quality. Structured assessment showed 66.7% strongly agreed VR experience constituted positive museum addition, 30.8% agreed, zero neutral responses, 2.6% disagreed, and zero strong disagreement, indicating overwhelming majority reception substantially exceeding AR avatar equivalent metric. Appropriateness to museum context achieved even stronger validation with 61.5% strong agreement, 35.9% agreement, zero neutral, zero disagreement, and only 2.6% strong disagreement, demonstrating near-universal acceptance of VR educational application as culturally fitting heritage interpretation modality. Net Promoter Score reached 61 (25 promoters rating 9-10, 13 passives rating 7-8, single detractor rating 0-6), positioning experience in "great" category between 30-70 range indicating high likelihood of visitor recommendation to others and suggesting minor improvements could elevate performance toward "excellent" category above 70 threshold representing exceptional visitor loyalty and promotion. Voice interaction naturalness showed 23.1% strong agreement, 43.6% agreement, 17.9% neutral, 12.8% disagreement, 2.6% strong disagreement, whilst efficiency assessment achieved 23.1% strong agreement, 64.1% agreement, 5.1% neutral, 5.1% disagreement, 2.6% strong disagreement, indicating majority acceptance despite notable minority experiencing interaction as unnatural or inefficient. Content relevance and timing appropriateness reached 33.3% strong agreement, 38.5% agreement, 20.5% neutral, 7.7% disagreement, zero strong disagreement, whilst information accuracy perception achieved 46.2% strong agreement, 43.6% agreement, 10.3% neutral, and zero disagreement, validating pre-scripted content approach as effectively preserving factual correctness without hallucination risks that undermined Pilot 1 RAG-generated responses. Response speed perception showed 25.6% very acceptable, 64.1% acceptable, 2.6% uncertain, 7.7% not acceptable, and zero completely unacceptable, achieving 89.7% acceptable-or-better rating meeting project KPI thresholds and demonstrating technical performance adequacy from user subjective experience perspective.

Behavioural Observations and Usability Friction Points

Direct observation of participant behaviour revealed interaction patterns, expectation mismatches, and friction points that structured surveys alone would not capture. Approximately 40% of participants attempted to ask questions in native Latvian language despite English-only deployment clearly explained during pre-test briefing, demonstrating strong preference for mother-tongue interaction that persisted despite knowing system limitations, reinforcing minority language support as critical success factor rather than optional enhancement. Many participants asked questions beyond the six facilitated intents, attempting open-ended architectural queries, historical lifestyle questions, or cultural practice explorations that system intent classifier could not map to authorised responses, revealing expectation that voice interaction implied comprehensive conversational capability when technical implementation supported only narrow pre-defined query categories, creating disappointment when sophisticated interaction interface proved more restricted than implied. Visitors frequently looked around the VR environment without speaking, adopting passive observation behaviour typical of conventional museum visits where exhibits present information without requiring active query formulation, indicating voice interaction introduced unfamiliar engagement paradigm requiring explicit instruction or interface prompting to transform passive viewers into active questioners driving their own educational narrative. Multiple participants attempted to use hand gestures for interaction despite VR controllers remaining inactive and extensive pre-test briefing emphasising voice-only modality, demonstrating deeply ingrained expectations that VR experiences involve manual manipulation and suggesting voice-only interaction represents sufficiently novel paradigm that users default to familiar interaction patterns until repeated experience establishes new mental models. Some visitors spoke continuously without awareness that speech triggered system responses, creating accidental content activation when conversational remarks or thinking-aloud vocalisations matched intent patterns, whilst others remained silent waiting for system initiative uncertain when speaking proved appropriate, highlighting ambiguity about conversation turn-taking mechanics that manual push-to-talk activation could resolve at cost of reducing interaction naturalness. Participants reported feeling surprised they could speak to the system despite tutorial explanation, embarrassed about speaking aloud volume ("Am I too loud?") revealing social self-consciousness about voice interaction in semi-public testing environment, and uncertain when to speak reflecting incomplete understanding of system listening state and appropriate interaction timing. These behavioural observations revealed that whilst voice interaction reduced manual controller learning burden (major advantage for VR accessibility to non-gaming populations), it introduced novel interaction literacy requirements around conversational initiative, appropriate query formulation, speech volume and clarity calibration, and system state awareness (is it listening? did it understand? will it respond?) that proved non-trivial for approximately 15-25% of visitor population requiring iterative exposure or enhanced interface feedback communicating system status more explicitly than prototype implementation provided.

Educational Value Assessment and Content Quality Validation

Participant feedback analysis revealed VR experience educational value and content quality as primary success factors substantially exceeding voice interaction mechanics in importance for overall satisfaction and recommendation likelihood. Visitors most appreciated the visualisation quality enabling mental construction of archaeological building structure, the sense of presence and spatial immersion creating feeling of experiencing the historical dwelling at human scale within its original lakeside context, and the clarity and accuracy of educational explanations providing accessible understanding of complex construction methods without requiring prior architectural or archaeological knowledge. The calm atmospheric presentation without time pressure or competing stimuli enabled contemplative learning at self-directed pace, contrasting with live tour constraints where group dynamics and guide pacing might not match individual learning preferences. Novelty of voice interaction and VR technology generated positive engagement for visitors unfamiliar with immersive experiences, creating memorable museum visit differentiation potentially driving repeat visitation and positive word-of-mouth promotion. Critically, feedback aligned experience strengths with Āraiši Ezerpils Archaeological Park institutional priorities including educational value delivery, family-friendly presentation accessible to both adults and children, and sensitivity to historical context through archaeologically accurate representation respecting cultural heritage rather than sensationalising or fictionalising past periods for entertainment value. Negative feedback concentrated on interaction modality limitations rather than content quality, with primary criticism being restricted content access ("we could only get limited content by asking specific questions"), uncertainty about system listening state ("I wasn't sure if there was anyone there to hear me"), language limitation frustration ("it would be better in my native language"), and occasional wrong content triggering ("it did not show me the right content") from intent classification errors or accidental speech recognition. Notably absent from criticism were concerns about educational accuracy, visual quality, or cultural appropriateness that dominated AR avatar feedback, indicating VR application successfully met heritage sector requirements for factual correctness and institutional integration whilst voice interaction mechanics rather than content substance represented primary improvement opportunity. Voice interaction assessment table documented pros including elimination of manual controller learning curve enabling accessibility for visitors without gaming experience or VR familiarity, whilst cons encompassed high dependency on speech recognition accuracy where failures blocked content access entirely unlike manual interfaces offering alternative interaction pathways, and content discoverability dependency on visual cues or visitor creativity and question-asking discipline that approximately 30% of population proved unwilling or unable to demonstrate sufficiently for complete content exploration.

Strategic Insights for Voice-Visual Interface Design in Heritage XR

Validation evidence generated critical design principle for heritage XR applications requiring content navigation and discovery: pure voice-only interaction without visual interface elements creates discoverability problems where visitors cannot determine what content exists or how to access it without extensive experimentation that many users abandon before discovering available material, whilst hybrid voice-visual interfaces combining visual menu representations or content availability indicators with voice activation mechanics preserve discoverability whilst leveraging voice convenience for hands-free selection and control. The floor and doors content categories, deliberately unlabelled to test voice-only discoverability, achieved only 47-50% activation success without assistance compared to 81-87% success for visually-cued oven and roof categories, demonstrating approximately 30-40 percentage point discoverability degradation when visual affordances removed, quantifying the cost of pure voice interaction for content navigation scenarios. This finding informs broader interaction design recommendations: voice should augment rather than replace visual interface elements in applications where users need to understand available functionality and navigate content hierarchies, with voice activation providing efficient execution mechanism after visual discovery establishes awareness of interaction possibilities. The accidental triggering problem where conversational speech or background remarks activated unintended content in approximately 25% of sessions highlighted fundamental ambiguity in continuous listening voice interfaces: systems cannot reliably distinguish intentional commands from casual speech, thinking-aloud vocalisations, or conversations with co-present companions, requiring either push-to-talk activation mechanics sacrificing naturalness for intentional control, sophisticated speaker diarisation and intent confidence thresholding filtering low-probability matches, or acceptance that false positive activations constitute inherent voice interface characteristic requiring graceful handling through easy content dismissal or session reset capabilities rather than prevention. Museum stakeholder feedback emphasised preference for tightly managed information flow with predictable behaviour over natural interaction flexibility introducing unpredictability, suggesting heritage contexts prioritise reliability and control over conversational freedom, contrasting with consumer entertainment applications where unpredictability might enhance engagement through novelty and surprise. Recommendations emerging from analysis included visual referencing for all discoverable content enabling visitors to understand interaction possibility space before attempting voice activation, push-to-talk or explicit activation gestures preventing conversational speech from triggering unintended system responses, progressive disclosure where initial experience introduces limited content subset before expanding to full capability after visitor demonstrates interaction competency, and hybrid modality support enabling voice activation as primary interaction with manual fallback through simple button or gaze-based selection for visitors uncomfortable with speech or experiencing recognition failures preventing voice-only content access.

Performance Metrics and Technical Achievement Analysis

Technical performance monitoring across validation sessions provided quantitative evidence about system capability and reliability under real-world operational conditions with authentic museum visitor populations. End-to-end latency from voice capture initiation through ASR transcription, cloud transmission, intent classification processing, Unity event triggering, and animation playback commencement achieved median 1960 milliseconds, average 1944 milliseconds, mean 1909 milliseconds, standard deviation 365 milliseconds, absolute deviation average 286 milliseconds, and 95th percentile latency 2340 milliseconds, meeting project KPI requirement of under 2500 milliseconds in 90%+ of cases whilst demonstrating somewhat higher variance compared to Pilot 1 likely attributable to more complex Unity scene state management and animation synchronisation introducing additional processing overhead beyond pure text response generation (from System Performance Report technical analysis). Speech recognition accuracy for English language input with Latvian-accented speakers proved generally robust based on participant feedback praising system tolerance for local accent characteristics, though quantitative word error rate metrics were not captured during validation limiting precise accuracy characterisation. Intent classification successfully mapped natural language questions to authorised categories in majority of cases, with 84-87% success for high-performing content triggers (walls, benches, oven, roof) when visual cues established visitor awareness of query possibilities, though 47-50% success for unlabelled floor and doors categories indicated substantial query formulation difficulty when visitors lacked explicit prompting about appropriate question phrasing. GPU infrastructure utilisation on NVIDIA A100 processing intent classification and response coordination maintained stable performance supporting 5-6 concurrent users during validation sessions without degradation, demonstrating horizontal scaling viability for institutional deployment serving multiple visitors simultaneously though vertical scaling to consumer-grade hardware for on-premise museum installation would require optimisation addressing processing power limitations identified in performance analysis as primary bottleneck for rural heritage site deployment lacking high-bandwidth internet connectivity enabling cloud-based inference. Session duration averaged 8-12 minutes from headset donning through tutorial completion and content exploration to experience conclusion, comfortably within the 10-15 minute target duration enabling visitor throughput in smaller exhibition spaces and preventing fatigue or discomfort accumulation that longer immersive experiences might introduce particularly for VR novices susceptible to motion disorientation or sensory overload. Zero participants reported cybersickness symptoms (nausea, dizziness, disorientation) severe enough to terminate sessions prematurely, validating conservative scene design choices including stationary viewer position without artificial locomotion, smooth camera movements avoiding rapid perspective changes, and static environmental elements without excessive visual complexity or motion stimuli that could trigger vestibular-visual sensory conflict.

Comparative Value Proposition and Commercial Viability Assessment

The VR site augmentation pilot achieved substantially stronger validation outcomes compared to AR welcome avatar across all assessment dimensions, positioning it as highest-potential application for continued development and commercial deployment consideration. Net Promoter Score of 61 versus 16 for avatar represented 45-point differential, moving experience from "needs improvement" to "great" category and indicating strong visitor satisfaction with only single detractor from 39 participants expressing unwillingness to recommend compared to 10 detractors for avatar application. Added value rating of 3.6 out of 5 for collaborative work and implicit 4.0+ rating for educational content delivery (inferred from qualitative feedback emphasis) substantially exceeded avatar's 3.2 out of 5, validating hypothesis that immersive spatial experiences for educational content delivery provide clearer value proposition than AI conversation for routine information access. User experience feedback concentrated praise on content quality, historical accuracy, and immersive presence rather than voice interaction sophistication, suggesting VR delivery modality itself (spatial 3D reconstruction at human scale, immersive focus without real-world distraction, contemplative self-paced exploration) provided primary value whilst voice interaction delivered secondary convenience benefit through controller-free operation rather than serving as primary value proposition. This finding proved strategically significant for Culturama Platform development: heritage XR applications should prioritise content quality, archaeological accuracy, and immersive spatial representation as fundamental value drivers, with voice interaction positioned as accessibility enhancement and usability optimisation rather than headline feature or core differentiator, ensuring development investment concentrates on strengthening educational effectiveness and cultural authenticity rather than pursuing interaction modality sophistication as primary innovation focus. Commercial viability assessment considering development cost, hardware requirements, operational complexity, and demonstrated user value suggested VR educational experiences for complex spatial or technical heritage concepts (archaeological building construction, landscape evolution, historical settlement layouts, conservation processes) represent defensible market positioning where immersive delivery provides capabilities conventional media cannot replicate, whilst simpler heritage content better served by photographs, diagrams, or conventional museum interpretation without immersive overhead. Target customer segments include open-air archaeological museums with reconstructed buildings requiring construction technique interpretation, heritage sites with invisible or destroyed structures benefiting from virtual reconstruction, conservation organisations demonstrating restoration methodologies, and educational institutions teaching archaeological or architectural history where immersive 3D models enhance spatial understanding beyond two-dimensional drawings or physical scale models. Pricing considerations must account for Meta Quest 3 hardware costs (approximately 500-600 EUR per unit), content development investment for 3D reconstruction and animation production, server infrastructure if cloud-based intent classification retained, and ongoing maintenance including content updates and technical support, requiring analysis whether per-visitor value delivery justifies total cost of ownership compared to conventional interpretation alternatives including printed guides, audio tours, or digital touchscreen exhibits.

Technical Architecture Recommendations and Deployment Patterns

Validation feedback and performance analysis informed refined technical architecture recommendations for heritage VR deployments balancing educational effectiveness, interaction accessibility, and operational sustainability. Visual content referencing for all interactive elements through in-environment labels, menu systems, or progressive unlocking where initial content introduction prompts visitors about additional available queries eliminates discoverability friction whilst preserving voice activation convenience after awareness established, implementing hybrid interface pattern combining visual discovery with voice execution proven more effective than pure voice-only paradigm tested. Push-to-talk activation using simple button press, hand gesture, or controller trigger prevents conversational speech from accidentally triggering content whilst introducing minor interaction friction that validation evidence suggests proves acceptable trade-off for predictability and control that museum stakeholders prioritise, particularly for experiences deployed in indoor spaces where multiple visitors might use applications simultaneously creating ambient speech potentially triggering neighbouring users' systems. Desktop application variants enabling VR experience access through conventional monitors and mouse-keyboard or gamepad interfaces would dramatically expand accessibility for heritage institutions unable to justify VR headset procurement whilst maintaining core educational value through 3D visualisation and voice-activated or manual content control, with facilitator observation suggesting desktop mode proved entirely adequate for educational objectives despite reduced immersive presence compared to head-mounted display stereoscopic rendering. Local processing exploration for intent classification moving inference from cloud servers to on-device execution using optimised Small Language Models compiled for mobile or VR headset platforms would enable offline operation addressing rural museum connectivity limitations whilst reducing per-interaction server costs improving long-term economic sustainability, though requiring model compression research and accuracy validation ensuring classification quality maintained when reducing from large cloud-hosted models to resource-constrained edge deployment. Multilingual content architecture supporting parallel language versions of educational narration and UI elements with runtime language selection enables progressive market expansion serving diverse European heritage institutions and international visitor populations, requiring investment in translation quality assurance and cultural localisation ensuring explanatory content maintains appropriateness and accuracy across linguistic contexts beyond literal word-for-word translation. These architectural refinements, whilst requiring development investment beyond VAARHeT project scope, establish clear pathway from research validation to commercially deployable product incorporating lessons learned whilst preserving validated core value proposition around immersive educational content delivery for complex spatial heritage interpretation.

Implications for Culturama Platform Experiential Learning Focus

The VR site augmentation validation fundamentally informed Culturama Platform development strategy through clear demonstration that immersive XR technology provides unique value for experiential heritage interpretation enabling spatial understanding, temporal reconstruction, and interactive exploration impossible through conventional media, validating strategic concentration on this application category rather than pursuing comprehensive platform attempting all heritage digitisation needs regardless of technology appropriateness. The 61 Net Promoter Score and overwhelming majority positive reception (97.4% considering experience positive or neutral addition to museum) provided confidence that heritage institutions would adopt and visitors would value VR educational experiences for appropriate content categories including archaeological building reconstruction, historical settlement layouts, landscape evolution demonstrating environmental change, conservation and restoration process visualisation, and craft technique demonstration showing sequential procedural steps. Voice activation proved valuable for reducing VR interaction literacy barriers enabling accessibility for visitors without gaming backgrounds or immersive technology familiarity, though validation revealed hybrid voice-visual interfaces rather than pure voice-only paradigms better serve heritage application requirements for content discoverability, interaction predictability, and fallback alternatives when speech recognition limitations prevent voice-only operation. The finding that desktop interfaces likely provide adequate value for most educational objectives whilst VR headsets enhance immersion without fundamentally transforming learning outcomes informs deployment strategy prioritising multi-modal support enabling heritage institutions to choose hardware matching budget constraints and operational capabilities without sacrificing core educational functionality, creating inclusive market positioning serving resource-constrained regional museums alongside well-funded national institutions. Content quality and archaeological accuracy emerged as paramount success factors substantially exceeding interaction sophistication in importance for visitor satisfaction and institutional acceptance, validating development investment priorities concentrating on curator-validated knowledge, expert-reviewed 3D reconstructions, and authentic historical representation rather than pursuing cutting-edge AI generation or novel interaction paradigms potentially compromising factual correctness that heritage sector institutions consider non-negotiable baseline requirement. The successful achievement of Technology Readiness Level 7 through operational environment validation with authentic visitor populations positioned the VR educational application concept substantially ahead of pure laboratory demonstration or controlled user testing, providing commercially relevant evidence about deployment viability, user acceptance, and operational integration within real museum workflows and visitor experience pathways that investment and partnership development conversations require.