How Gmail Actually Knows What You Want to Type (Spoiler: It's Not Mind Reading)
How Gmail Actually Knows What You Want to Type (Spoiler: It's Not Mind Reading)
You know that moment when you're typing an email and Gmail just... knows exactly what you want to say next? You start typing "Thanks for the" and boom – "meeting today" pops up like it's been reading your mind. Plot twist: it's not psychic, it's just really good data preparation.
I've been diving deep into how systems like Smart Compose actually work, and honestly? The real magic isn't in the fancy neural networks everyone talks about. It's in the unglamorous but absolutely crucial work that happens before any AI model even sees the data. Let me walk you through what really goes on behind the scenes.
Why Data Prep is Where the Real Work Happens
Here's the thing – you can have the most sophisticated AI architecture in the world, but if you feed it garbage data, you'll get garbage suggestions. Imagine trying to learn English by only reading text messages from teenagers in 2007 ("omg ur so rite lol"). You'd probably end up suggesting "lol" after every sentence, which... isn't exactly professional email material.
The data preparation pipeline is like having a really good editor who cleans up everything before it goes to print. And for something as context-sensitive as email suggestions, this step isn't just important – it's make-or-break.
The Real Journey: From Messy Emails to Smart Suggestions
Let me break down exactly how raw email data becomes those eerily accurate suggestions, step by step.
Step 1: Gathering the Goods (Data Collection)
First, you need data. Lots of it. For Smart Compose, Google likely uses:
Email data: Think about it – to suggest good email content, you need to understand how people actually write emails. The famous Enron email dataset (yes, that Enron) is often used for research because it's one of the few large-scale email datasets that's publicly available. It contains over 500,000 emails from about 150 users.
General text data: You also need broader language understanding. This comes from web crawls, books, news articles – basically anywhere people write well-formed text. Common Crawl, for example, contains petabytes of web data that models train on.
Here's what a raw email might look like in your dataset:
From: sarah.johnson@company.com
To: team@company.com
Subject: Re: Q3 Planning Meeting
Date: 2024-01-15 14:30:00
Hi everyone,
Thanks for the productive meeting this morning. As discussed, I'll be sending out the action items by EOD.
Quick reminder that the budget proposals are due by Friday. Please let me know if you need any help with the templates.
Best,
Sarah
Step 2: Cleaning Up the Mess (Text Cleaning & Normalization)
Raw email data is... well, it's messy. Really messy. Here's what needs to happen:
Privacy scrubbing: You can't just use people's actual email addresses and phone numbers. Everything gets masked:
# Before
"Contact me at john.smith@gmail.com or call 555-123-4567"
# After
"Contact me at [EMAIL] or call [PHONE]"
Language filtering: You need to focus on one language at a time. Using libraries like langdetect
, you'd filter out:
# Keep this (English)
"Let's schedule a meeting for next week"
# Filter out this (Spanish)
"Vamos a programar una reunión para la próxima semana"
Noise removal: Emails are full of stuff that doesn't help with language modeling:
# Before
"Hey!!! 😊 Can we meet @ 3pm???"
# After
"Hey! Can we meet at 3pm?"
Normalization: Making everything consistent:
# Expanding contractions
"I'll be there" → "I will be there"
"Can't make it" → "Cannot make it"
# Consistent casing
"URGENT: Please Reply" → "urgent: please reply"
Step 3: Teaching Machines to Read (Tokenization)
Computers don't understand words – they understand numbers. So "Hello world" needs to become something like [15496, 995]
. But how do you decide what gets what number?
Word-level tokenization is the obvious choice:
"Let's meet tomorrow" → ["Let's", "meet", "tomorrow"] → [1, 2, 3]
But this creates problems. What happens when someone writes "Let'sss meet tomorrowwww"? Your model has never seen those exact tokens.
Subword tokenization (specifically Byte Pair Encoding) is way smarter:
"Let's meet tomorrow" → ["Let", "'s", "meet", "tom", "orrow"] → [1, 2, 3, 4, 5]
Now even if someone writes "tomorrowish", your model can break it down into "tom", "orrow", "ish" and still understand it.
Here's a real example using the Hugging Face tokenizer:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "Thanks for the meeting today"
tokens = tokenizer.encode(text, return_tensors='pt')
print(tokens) # tensor([[16509, 329, 262, 3249, 1909]])
Step 4: Adding Context (Metadata Processing)
The real magic happens when you add context. The same phrase might be appropriate in one email but completely wrong in another.
Consider these scenarios:
# Scenario 1: Email to your boss
{
"sender": "employee@company.com",
"recipient": "boss@company.com",
"subject": "Project Update",
"context": "Hi Mr. Johnson, I wanted to update you on the",
"completion": "progress of the new website redesign"
}
# Scenario 2: Email to your friend
{
"sender": "you@gmail.com",
"recipient": "friend@gmail.com",
"subject": "Weekend plans",
"context": "Hey! I wanted to update you on the",
"completion": "party plans for Saturday night"
}
Same starting phrase, completely different completions based on context.
Step 5: Creating Training Examples (Data Formatting)
Now comes the clever part. You take each email and create multiple training examples from it. Here's how:
Original email: "Thanks for the meeting today. I'll send out the notes by end of day."
Training examples:
{"context": "Thanks", "target": "for"}
{"context": "Thanks for", "target": "the"}
{"context": "Thanks for the", "target": "meeting"}
{"context": "Thanks for the meeting", "target": "today"}
{"context": "Thanks for the meeting today", "target": "."}
{"context": "Thanks for the meeting today.", "target": "I"}
{"context": "Thanks for the meeting today. I", "target": "'ll"}
Each example teaches the model to predict the next word given the previous context.
Getting Your Hands Dirty: A Real Python Pipeline
Here's what a simplified version of this pipeline might look like:
import pandas as pd
import re
from transformers import GPT2Tokenizer
from langdetect import detect
import json
class EmailDataPipeline:
def __init__(self):
self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
self.tokenizer.pad_token = self.tokenizer.eos_token
def clean_email(self, text):
# Remove emails and phones
text = re.sub(r'\S+@\S+', '[EMAIL]', text)
text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
# Basic cleaning
text = re.sub(r'[^\w\s.,!?-]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text.lower()
def is_english(self, text):
try:
return detect(text) == 'en'
except:
return False
def create_training_examples(self, email_data):
examples = []
for _, row in email_data.iterrows():
if not self.is_english(row['body']):
continue
cleaned_text = self.clean_email(row['body'])
tokens = self.tokenizer.encode(cleaned_text)
# Create sliding window examples
for i in range(1, len(tokens)):
context = tokens[:i]
target = tokens[i]
examples.append({
'email_id': row['email_id'],
'sender': row['sender'],
'recipient': row['recipient'],
'subject': row['subject'],
'context': context,
'target': target
})
return examples
# Usage
pipeline = EmailDataPipeline()
email_df = pd.read_csv('email_data.csv')
training_examples = pipeline.create_training_examples(email_df)
The Devil's in the Details: Advanced Tricks
Once you have the basics down, there are some advanced techniques that can make a huge difference:
Data augmentation: Creating variations of your existing data:
# Original
"Thanks for the meeting today"
# Paraphrased versions
"Thank you for today's meeting"
"Appreciate the meeting today"
"Thanks for meeting today"
Handling formality levels: Your model should suggest different things for different contexts:
# Formal email
"I would appreciate the opportunity to discuss this further"
# Casual email
"Let's chat about this more"
Thread awareness: Understanding email conversations:
# If the previous email said "Can we meet at 3pm?"
# The model should know to suggest replies like:
"Yes, 3pm works for me"
"Sorry, I'm not available at 3pm"
The Reality Check: What Actually Happens at Scale
Here's what this looks like when you're dealing with real-world scale:
- Data volume: Gmail processes billions of emails. Your pipeline needs to handle terabytes of data efficiently.
- Privacy: Everything is anonymized and aggregated. No one at Google is reading your emails to improve Smart Compose.
- Continuous learning: The pipeline runs continuously, incorporating new patterns as language evolves.
- Quality control: Extensive filtering to avoid suggesting inappropriate content.
Wrapping Up: The Unglamorous Hero
Next time Gmail suggests exactly what you want to type, remember that it's not just AI magic. It's the result of a carefully crafted pipeline that:
- Collected millions of emails (anonymized, of course)
- Cleaned and normalized messy text data
- Broke everything down into tokens machines can understand
- Added crucial context about senders, recipients, and subjects
- Created millions of training examples
- Split everything into proper training/validation/test sets
The neural network that actually generates suggestions? That's just the final step. The real work – the work that makes the difference between "this is pretty good" and "this is actually useful" – happens in data preparation.
And honestly? That's where most of the engineering challenges are too. Building a transformer model is well-documented. Building a data pipeline that can handle billions of emails, respect privacy, filter appropriately, and maintain quality at scale? That's where the real engineering happens.
So yeah, Smart Compose isn't reading your mind. It's just really, really good at reading patterns in data. And the people who built those data pipelines? They're the unsung heroes making AI actually useful in our daily lives.