Voice AI Space Amsterdam Meetup: The Future of

I attended the Voice AI Space Amsterdam Meetup — the latest stop in a global tour that has already hit Barcelona, London, Dubai, Paris, and NYC. The Voice AI community is growing fast, and Amsterdam was the natural next stop.

Voice AI Space presenter on stage — voiceaispace.com central hub for everything Voice AI

The event was hosted at a venue with impressive production quality — blue-lit stage, professional setup, and a packed room of voice AI builders, researchers, and entrepreneurs.

The Format

Schedule slide — 5:00 PM Networking, 6:00 PM Talks and Demos, 7:15 PM Networking

The meetup followed a proven formula:

5:00 PM — Doors open, networking
6:00 PM — Talks and demos from builders
7:15 PM — Food, drinks, and more networking

A Global Community

Presenter showing Voice AI Space events across the globe — 11 cities highlighted with Amsterdam as newest

Voice AI Space has already run meetups in London, Paris, Barcelona, Bengaluru, Tokyo, Colombo, New York, San Francisco, Singapore, and Dubai — with Amsterdam as the latest addition. The photos from previous events showed packed rooms across continents. This is a truly global community with serious momentum.

Organizer on stage presenting the schedule for the evening

The Talks

Full-body shot of Voice AI Space organizer presenting the platform vision

Matt Coler — Sarcasm Detection in Voice AI (University of Groningen)

Matt Coler on stage — "Your voice AI knows what you say. But not what you mean."

Matt leads a research team at the University of Groningen (Faculty Campus Fryslân) working on one of voice AI’s hardest open problems: getting systems to understand not just what people say, but what they mean.

Close-up of opening slide — University of Groningen, Faculty Campus Fryslân

Sarcasm is a particularly useful lens for probing the limits of pragmatic inference — combining speech, text, and vision to catch what any single channel misses. This is fundamental research that will define how natural future voice assistants feel.

The Multimodal Framework

Multimodal sarcasm analysis — Visual (lifted chin, fixed gaze) + Audio (reduced intensity, low-pitch) + Text, showing interpretive vs literal affective stance

The core insight: sarcasm is not just a label — it is a conflict between what is said and what is meant, only visible through multimodal analysis. Matt showed a framework decomposing utterances into three channels:

Visual — facial cues (lifted chin, minimal brow movement, fixed gaze) → implied affective stance: annoyance, disengagement
Audio — spectrogram analysis (reduced intensity variation, low-pitch variability)
Text — literal content (“Oh my god! You almost gave me a heart attack!”)

The literal reading suggests surprise and alarm. The multimodal reading reveals annoyance and disengagement. Only by combining all three channels can you detect the gap.

Matt explaining the Chandler example — "Same words. Same speaker. Different tone, different face — completely opposite meaning."

Using a Friends clip, Matt demonstrated how identical words from the same speaker carry completely opposite meaning depending on tone and facial expression. This is the fundamental challenge: sarcasm is where the literal reading and intended meaning are furthest apart.

The Structure of the Problem

Matt explaining modality interactions — Reinforcement, Complementarity, Incongruity — "fusion is not concatenation"

A key insight: modalities do not just add up — they interact. Matt identified three types of interaction:

Reinforcement — prosody and words converge, amplifying the signal
Complementarity — each channel contributes evidence the others cannot
Incongruity — channels diverge in stance (the hallmark of sarcasm)

The critical takeaway: fusion ≠ concatenation. You need interaction-aware fusion with cross-modal attention to capture these dynamics.

The Research: State-of-the-Art Results

Trimodal ablation study — Text+Audio+Visual achieves 74.96% F1 vs 68.84% text-only (ICASSP 2025, Raghuvanshi, Gao et al.)

Matt presented results from an ICASSP 2025 ablation study (Raghuvanshi, Gao et al.) showing that every channel contributes, but none is sufficient alone:

Modality	Precision (%)	Recall (%)	F1 (%)
Text	70.14	69.12	68.84
Audio	67.29	67.28	67.24
Visual	65.41	65.32	65.20
Text + Visual	71.90	71.89	71.86
Audio + Visual	67.67	67.67	67.64
Text + Audio	70.97	69.46	68.95
Text + Audio + Visual	75.23	75.08	74.96

The trimodal combination beats any single modality by 6+ percentage points and any bimodal combination by 3+ points — proving that multimodal fusion is essential and not additive, but interactive.

Three-paper research trajectory: AMuSeD (74.29%, bimodal) → ICASSP 2025 (74.96%, trimodal) → MCSD 1.0 (76.64%, cross-lingual Mandarin validation)

Matt Coler on stage presenting the research trajectory — bimodal to trimodal to cross-lingual validation

The team’s three-step research trajectory — from bimodal to trimodal to cross-lingual — shows systematic progress:

AMuSeD (IEEE TAFFC 2024) — text + audio, 74.29% F1 (+4.8% over prior state of the art)
ICASSP 2025 — text + audio + vision, 74.96% F1 (+1.66% over prior best trimodal)
MCSD 1.0 (Interspeech 2025) — text + audio + vision + MC, 76.64% F1 with cross-lingual validation in Mandarin

AMuSeD framework — Attentive Multimodal Sarcasm Detection using cross-attention for acoustic cues relative to lexical content

The framework — AMuSeD (Attentive Multimodal Sarcasm Detection) — uses cross-attention so the system interprets acoustic cues relative to lexical content. Sentiment-emotion divergence is modeled as a structured, learnable signal. The core principle: sarcasm exploits the gap between channels.

MCSD 1.0: Cross-Lingual Validation

MCSD 1.0 — "There was no multimodal sarcasm dataset in Mandarin. So we built one." 10.57 hours of video, 139 speakers, open access on DataverseNL

Matt Coler on stage with the MCSD 1.0 slide — "Let's not stop at detection..."

A powerful move: the team built MCSD 1.0 — the first multimodal Chinese sarcasm dataset — because “if you can only evaluate on English-language TV sitcoms, you’re not building a pragmatic inference system.” The dataset contains 10.57 hours of video across 139 speakers, with a Fleiss’ κ of 0.74 (strong inter-annotator agreement). Published at Interspeech 2025 and available as open access on DataverseNL (University of Groningen, UG Campus Fryslân — already 170+ downloads).

The corpus structure includes:

Unique utterance ID for each transcribed segment
Manually verified transcription of the spoken utterance (in Mandarin)
Pseudonymized speaker ID
Emotional label (sarcastic / not sarcastic) for each transcription
Aligned start and end timestamps
Reference to the original publicly available video

Where the Field Is Heading: Explainability

Slide: "Detection is not enough. The system needs to explain what it understood." — SarcasmMiner, Grounded Audio-Visual Reasoning

Matt’s final thesis: detection is not enough — the system needs to explain what it understood. Classifying sarcasm is one thing; grounding that classification in evidence is what separates a deployed system from a research benchmark. This is where SarcasmMiner comes in — grounded audio-visual reasoning on the MUStARD++ benchmark.

SarcasmMiner: Grounded Reasoning vs Shortcuts

SarcasmMiner case study — SFT baseline and GRPO both predict sarcasm incorrectly; only the multi-modal contextual model gets it right

The case study showed how different models handle the same utterance (“So, this spring, I get to go to the International Space Station. Oh, my word, a trip to the heavens.” — ground truth: Non-Sarcasm):

SFT Baseline (word-based): Incorrectly predicts sarcasm — “A trip to the heavens” is treated as hyperbolic praise
GRPO (speaker-based): Also wrong — uses character knowledge as a shortcut (Wolowitz’s comedic personality)
SarcasmMiner (multi-modal contextual): Correctly predicts non-sarcasm — tone is high-pitched and energetic, facial expression shows genuine awe

Matt pointing at the SFT baseline error — showing how word-based models fail on genuine enthusiasm

Only the multimodal contextual model avoids the trap of using shortcuts (word patterns or character stereotypes) and instead grounds its reasoning in the actual audio-visual evidence.

The Bigger Picture: Beyond Sarcasm

Matt presenting pragmatic inference — sarcasm, hyperbole, understatement, humor, implicature, irony — "We trained ourselves to speak like machines. It's time machines learned how humans actually speak."

The Bigger Picture slide — types of non-literal language: sarcasm ("Oh great, another Monday"), hyperbole, understatement, humor, implicature, irony

Matt closed with a powerful framing: “We trained ourselves to speak like machines. It’s time machines learned how humans actually speak.” Sarcasm is just one test case — human language is full of non-literal expression:

Sarcasm: “Oh great, another Monday” — means the opposite
Hyperbole: “I’ve told you a dozen times” — means “many times”
Understatement: “It’s a bit warm” said in a heat wave
Humor: “I am dying” — means “this is very funny”
Implicature: “It’s cold in here” — means “close the window”
Irony: “What a beautiful day” said in a storm

The broader mission is pragmatic inference — teaching machines to understand what we mean, not just what we say. Each of these categories exploits a different gap between literal content and intended meaning, requiring different detection strategies and multimodal cues.

The Team: Speech Tech Lab, University of Groningen

Speech Tech Lab team — Dr Matt Coler (Associate Professor, Director MSc Speech Technology), Dr Shekhar Nayak, Ms Xiyuan Gao (PhD defense 06/2025), Mr Zhu Li

The research comes from the Speech Tech Lab at the University of Groningen, led by Dr Matt Coler as Associate Professor, Co-supervisor, and Director of the MSc Speech Technology program. The team includes Dr Shekhar Nayak (Assistant Prof, Co-supervisor), Ms Xiyuan Gao (PhD Candidate, defense June 2025), and Mr Zhu Li (PhD Candidate). Their published work spans AMuSeD, MCSD 1.0, SarcasmMiner, IEEE TAFFC, and Interspeech 2025.

Beyond Sitcoms: The Real-World Stakes

Matt Coler presenting the Mueller Report sarcasm case — "A two-year federal investigation. The same detection problem."

Matt made the case that sarcasm detection is not just an academic exercise — it has real legal and political consequences. He used the Mueller Report (Appendix C, Question II(d)) as a case study: a statement claimed to be made “in jest and sarcastically, as was apparent to any objective observer” — yet the video shows a straight face, no laughter, and no sarcastic cues.

Matt with full Mueller Report slide — "The multimodal record contradicts the claim. A system that reliably detects pragmatic stance from voice, face, and context would resolve this."

The conclusion: the multimodal record contradicts the claim. A system that reliably detects pragmatic stance from voice, face, and context would resolve this — turning sarcasm detection from a curiosity into a tool for justice.

When a two-year federal investigation hinges on whether something was sarcastic, you realize how important this detection problem truly is.

Wide shot of Matt Coler on the full stage with dual-screen setup

Thomas Kluiters — Reson8: Customizing Speech Recognition for the Real World

Thomas Kluiters presenting "Why customize?" — showing how different ASR providers mangle the company name "Solvimon"

Thomas Kluiters, Head of AI at Reson8 (a local Amsterdam voice AI company), delivered a deeply practical talk on customizing speech recognition for production use cases. His opening example was immediately relatable: the word “Solvimon” (a billing platform) transcribed by four different ASR providers:

Provider	Output
Target	Solvimon ✅
11Labs	Solvymon ❌
Gladia	Solveemon ❌
Deepgram	sofiemann ❌

The message: off-the-shelf ASR fails on domain-specific vocabulary. If your product relies on correctly recognizing proper nouns (company names, medication names, technical terms), you must customize.

Thomas on stage at Voice AI Space — presenting the customization problem with blue stage lighting

Healthcare: Where ASR Errors Are Dangerous

Reson8 healthcare example — "Continue Xarelto 20 mg nightly; hold aspirin" misrecognized as Zaralto, Zyrtec, and "zero alto"

Thomas drove the point home with a healthcare use case: the medication Xarelto (a blood thinner) transcribed by three ASR systems:

ASR A: “Zaralto” — wrong but close
ASR B: “Zyrtec” — completely different medication (an antihistamine)
ASR C: “zero alto” — nonsensical

In a clinical setting, mistaking a blood thinner for an antihistamine is not a transcription error — it is a patient safety risk. This is why customization matters.

Three Approaches to Customization

Reson8 "How to customize?" — Fine-tune vs LLM post-process vs Decoder bias with trade-offs for each

Thomas compared three approaches:

Approach	Pros	Cons
Fine-tune	Best accuracy	Needs in-domain audio; risk of catastrophic forgetting
LLM post-process	Easy to add	Latency and hallucination risk
Decoder bias	Text-only context	Architecture dependent

LLM Post-Processing: Prompt After ASR

Reson8 "Prompt after ASR" — Python code using OpenAI API with GPT-4o for medical ASR spelling correction

Thomas showed a practical code example: using GPT-4o to correct ASR output specifically for medication names. The prompt instructs the model to “Correct ASR spelling only. Use medication names exactly: Xarelto, aspirin. Do not add words. Return transcript only.”

Key trade-offs:

✅ Works well for formatting
⚠️ Can invent plausible entities (hallucination)
❌ Never let it add missing content

Thomas presenting the LLM post-processing approach on stage

Model Architectures for ASR

Reson8 Model Architectures — Whisper large-v3 (balanced), Parakeet-TDT-0.6B-v3 (encoder heavy), Qwen3-ASR-1.7B/Voxtral (decoder heavy), Voxtral-Mini-3B (speech-conditioned LLM)

Thomas provided a taxonomy of current ASR architectures, starting with the fundamentals: encoders encode what is heard, decoders decode what is said, and customization is generally applied at the decoder layer.

The key architectural parameters:

Whisper large-v3 (balanced): Encoder 637M params, Decoder 907M params — robust offline baseline, prompting supports short context only, long prompts unreliable
NVIDIA Parakeet-TDT-0.6B-v3 (encoder heavy): Encoder 609M params, Decoder only 17M params — fast encoder-heavy ASR with efficient decoder search, biasing happens in search (FastConformer architecture with SpecAug → 8x Conv Subsampling → Linear → Dropout → Conformer Blocks)
Qwen3-ASR-1.7B / Voxtral (decoder heavy): Small encoder, large decoder — leverages LLM text generation capabilities for transcription
Voxtral-Mini-3B (speech-conditioned LLM): Full language model conditioned on speech input

Each architecture implies different customization strategies. Encoder-heavy models like Parakeet respond better to decoder biasing (vocabulary boosting at search time), while decoder-heavy models are more amenable to prompt-based corrections. Balanced models like Whisper support short prompts but break with complex instructions.

The clear message from Reson8: the future of production voice AI is not just about picking the best model — it is about building customization pipelines that adapt general-purpose ASR to your specific domain, vocabulary, and safety requirements.

ai-coustics — Real-Time Audio Intelligence

ai-coustics presentation — "Real-time audio intelligence that makes voice AI work in production. Not just in the lab." — speaker on stage with engaged audience

ai-coustics slide — "A FEW WORDS ABOUT AI-COUSTICS: Real-time audio intelligence that makes voice AI work in production. Not just in the lab."

ai-coustics presented their approach to making voice AI production-ready through real-time audio intelligence. Their pitch cuts straight to the core problem: most voice AI works brilliantly in controlled lab environments but falls apart in the real world — background noise, reverb, cross-talk, and varying microphone quality all degrade performance. ai-coustics bridges that gap with audio enhancement that runs in real-time, making downstream voice AI systems (ASR, voice agents, telephony) significantly more robust.

Ahmed Mamdouh Khalil — The Rise and Fall of ASCENSCIA

A refreshingly honest post-mortem: Ahmed shared lessons learned from building an AI Voice Assistant for scientific labs. The “rise and fall” framing is rare in tech meetups — most people only share success stories. Understanding why voice AI products fail in specialized domains is just as valuable as knowing what works.

Kevin Litnaël — Live Vibe-Coding a Voice AI Agent with Vapi

Kevin Litnaël, Account Executive at Vapi, delivered the most interactive session of the evening: building a fully functional voice AI agent live on stage using Claude and Vapi’s platform.

Vapi assistant dashboard — GPT-4.1 Mini model, Deepgram nova-3 transcriber, Elliot voice, ~$0.09/min cost, ~1,220ms latency

The agent was configured with GPT-4.1 Mini as the LLM, Deepgram nova-3 for transcription, and the “Elliot” voice — running at approximately $0.09/min with 1,220ms first-response latency. Kevin walked through the full Vapi dashboard showing how each component plugs together.

Vapi advanced settings — Start Speaking Plan, Smart Endpointing with LiveKit, stop speaking plan (2 words, 0.5s voice, 0.8s backoff)

He then fine-tuned the conversational dynamics: configuring Smart Endpointing (powered by LiveKit), stop speaking plan parameters (Number of Words: 2, Voice Seconds: 0.5, Back Off Seconds: 0.8), and voicemail detection.

Vapi "Call the Agent" slide — phone number +31 97010208315 with Vapi logo

The highlight was the live demo: Kevin gave the audience a real Dutch phone number (+31 97010208315) and invited everyone to call the agent simultaneously — proving the platform handles concurrent calls gracefully. Vapi is emerging as a leading platform for building voice-first AI agents with minimal infrastructure.

Phanos Anastasiou — ChickyTutor Demo

Phanos demoed ChickyTutor, an AI language tutor designed for everyone. Voice-first language learning is a natural fit — pronunciation feedback, conversational practice, and real-time correction all benefit enormously from speech AI.

Deepak Singla — Co-founder and CEO of Fini — "Fini, a multi-modal AI agent for enterprise fintech support"

Deepak Singla, Co-founder and CEO of Fini, demonstrated their multi-modal AI agent built for enterprise fintech support. The live demo showcased a fascinating use case: YEGO, a scooter rental company using Fini’s voice agent to handle customer calls for scenarios like remote unlock, unpaid fines, blocked accounts, missing payment methods, and faulty credit resolution.

Fini Simulations dashboard — YEGO Callers Database showing 6 test scenarios: remote_unlock_success, remote_unlock_fail_mark_faulty_credit, unpaid_fine, blocked_account, missing_payment_method

The simulation environment at simulations.usefini.com showed how Fini tests voice agents against a database of caller personas — each with different phone numbers, scooter IDs, lock states, account statuses, and credit balances. This kind of structured evaluation is exactly what enterprise voice AI deployments need.

Fini Workflow Visualizer (Beta) — visual pipeline showing Start → Prelatch Contact Silent → context_tool → Target State for voice → branching to Account Blockers and Custom Router → remote_unlock_tool

The Workflow Visualizer (in beta) revealed Fini’s agent architecture: a visual pipeline starting from initial contact, through context gathering, target state classification, then branching into specialized handlers (Account Blockers, Custom Router) before executing actions like remote_unlock_tool. This visual approach to voice agent orchestration makes complex multi-turn conversations debuggable and maintainable.

Fini YEGO demo projected on stage — full database view showing caller scenarios on the big screen

Voice AI Space: The Central Hub

Close-up of the Voice AI Space branding — Products, News, Knowledge, Jobs, Events at voiceaispace.com

voiceaispace.com positions itself as the central hub for everything Voice AI — covering products, news, knowledge, jobs, and events. With meetups now running across 6+ cities globally, they are building real community infrastructure for the voice AI ecosystem.

Key Takeaways

Voice AI is maturing fast — from research labs to production deployments in fintech, education, and enterprise support
Sarcasm and pragmatic inference remain unsolved — understanding intent beyond words is the next frontier
European voice AI is thriving — local companies like Reson8 building competitive speech recognition
Honest failure stories create more value than polished success narratives at meetups
Vibe-coding voice agents is now possible — tools like Vapi + Claude enable rapid prototyping on stage

The voice AI space (pun intended) is one of the fastest-growing niches in AI right now. With multimodal models adding native speech capabilities and latency dropping below human perception thresholds, we are approaching a tipping point where voice-first interfaces become the default for many applications.

Voice AI Space Amsterdam Meetup: The Future of

The Format

A Global Community

The Talks