Tone of voice and multilingual generation: How AI learns what 'friendly' means in Spanish
Sarah Miguel Cournane, Data Scientist at Hello Retail, argues that tone-of-voice in AI-generated content is more complex than a single dial. Friendly in Spanish and friendly in German draw on different cultural registers, different slang, and different conventions of formality. Building AI that generates natural-sounding product communications across multiple languages means treating tone and language together as a matrix, each combination carrying its own rules and requiring its own verification.
The tone-of-voice matrix: Language and register intersect
Most conversations about multilingual AI content focus on translation quality. Sarah shifts the frame. When Hello Retail set out to generate product emails across languages, the team quickly found that tone-of-voice and language are inseparable variables. You cannot simply pick a tone (friendly, formal, neutral) and apply it uniformly across Spanish, German, Danish, and Swedish. What reads as warmly informal in one language sounds flippant or even rude in another.
Sarah points to a concrete example from her experience: Spanish does not commonly use the second-person pronoun equivalent of “your highness” in formal writing, while other European languages still carry that register in formal contexts. Danish casual address differs from formal Swedish. The humor and slang available to a friendly Spanish email have no direct equivalents in a friendly German one.
The practical implication is that the question “what tone should this email use?” is two-dimensional. What does friendly mean in Spanish? What does formal mean in Swedish? Each language-and-register combination is its own target. Sarah describes this as a matrix, and that framing shapes how Hello Retail approached the engineering problem.
Consumers respond strongly to this distinction. CSA Research found that 76% of online shoppers prefer to buy products with information in their own language, and that preference extends beyond translation to register and cultural appropriateness. An email that is technically in Spanish but tonally off will underperform just as reliably as one that is grammatically broken.
Building quality verification when you cannot read every email
Once the matrix framing is clear, the engineering challenge becomes concrete: how do you verify quality at scale? Sarah describes a system that relies on generative AI to produce the email content, with a substantial verification layer built around it.
With a thousand emails generating at a time, manually reviewing every message in every language is impossible. Sarah is direct about this. She does not speak all the target languages, which meant the team had to build automated checks that could catch problems without relying on human fluency in each locale.
Several recurring failure modes surfaced during development. Repetition was one: the model would reuse the same words or phrases within a short email, making the output feel mechanical. At the friendly end of the register dial, outputs would sometimes drift into a tone that felt childish rather than warm. At the formal end, the model would occasionally reach for vocabulary associated with legal documents, which is almost never the right register for a product recommendation email.
Each failure mode required a specific constraint added to the generation pipeline. And this is where the trade-off becomes consequential.
The trade-off between constraints and creativity
As more rules govern how the language should sound, the model’s generative space shrinks. Sarah describes this as a snowball effect. Adding constraints around vocabulary, sentence structure, and register all reduce the range of outputs the model will produce. At some point, the model starts struggling to hit a friendliness or formality target because it is also trying to satisfy a long list of other requirements simultaneously.
This is a genuine tension in applied language generation. Sarah frames it plainly: the more you restrict the system, the less room it has to be expressive. The goal is to find the point where quality verification catches real problems without constraining the output so aggressively that the emails feel flat or formulaic.
There is no clean solution - only calibration. Subtle changes compound. A tweak to vocabulary rules shifts tone. A constraint on sentence structure changes rhythm. What started as a list of simple guardrails becomes a system of interdependencies, and the team has to evaluate the whole matrix of outputs, not just individual rules.
Why getting tone right is worth the engineering investment
The stakes here are practical. Email remains one of the highest-ROI channels in ecommerce, and Salesforce’s State of the Connected Customer report found that 73% of customers expect companies to understand their unique needs and expectations - an expectation that extends to how a brand communicates in email, not just what it recommends.
For a platform generating triggered and newsletter emails across multiple European markets, getting tone-of-voice right is a core requirement. A recommendation engine can identify the correct product for a given shopper; the email still has to earn the click. Language that feels off-register undermines the work done upstream by Product Intelligence. McKinsey research found that companies excelling at personalization generate 40% more revenue from those efforts than slower-growing counterparts, a gap that points directly to execution quality - and execution quality in email starts with language that sounds right.
Sarah’s matrix framing gives engineers a cleaner mental model for structuring the verification problem. Rather than asking “is this email friendly?”, the more useful question is “is this email appropriately friendly for a Spanish-speaking shopper?” Those are different specifications, and they produce different quality checks and different failure criteria.
The broader lesson is that scaling AI-generated content across languages requires more than a capable generative model. It requires a verification architecture that evaluates outputs against locale-specific criteria, catches failure modes that human reviewers cannot scale to check, and does so without constraining the model so tightly that quality drops in a different direction.
To hear Sarah explore how Product Intelligence and AI work together across the full personalization stack, watch the full Conversations episode with Sarah Miguel Cournane.