Chat Application System Design (like WhatsApp)
Problem Statement
Design a real-time messaging system that supports billions of users with instant message delivery, group conversations, and multimedia sharing capabilities.
Requirements
Functional Requirements
- Send and receive messages in real-time
- One-on-one and group messaging (up to 256 members)
- Message delivery status (sent, delivered, read)
- Online/offline/last seen status
- Message history and search
- File and media sharing (images, videos, documents)
- Message encryption for privacy
- Push notifications for offline users
Non-Functional Requirements
- Availability: 99.99% uptime
- Scalability: 2B users, 100B messages per day
- Performance: Message delivery < 100ms
- Consistency: Messages delivered in order
- Security: End-to-end encryption
- Storage: 5 years message retention
Capacity Estimation
Traffic
- 2B total users, 1B daily active users
- Average 50 messages per user per day = 50B messages/day
- Peak usage: 3x average = 150B messages/day = 1.74M messages/second
- Connection management: 500M concurrent connections
Storage
- Text Message: ~100 bytes average
- Media Message: ~1MB average (after compression)
- Daily Storage: 50B × 100 bytes = 5TB/day (text only)
- With Media: ~50TB/day including media
- 5 Years: ~91PB total storage needed
Bandwidth
- Text Messages: 1.74M × 100 bytes = 174MB/second
- Media Messages: 20% of messages × 1MB = 348GB/second during peak
High-Level Architecture
[Load Balancer] → [API Gateway] → [Microservices]
↓
[WebSocket Manager] ← → [Message Service] ← → [Message Queue]
↓
[User Service] ← → [Media Service] ← → [File Storage]
↓
[Notification Service] ← → [Database Cluster]
Core Components
1. Connection Management Service
CLASS ConnectionManager:
INITIALIZE:
user_connections = MAP<user_id, websocket_connection>
connection_metadata = MAP<connection_id, user_info>
FUNCTION handle_connection(websocket, user_id):
connection_id = generate_unique_id()
user_connections[user_id] = websocket
connection_metadata[connection_id] = {
user_id: user_id,
connected_at: current_timestamp(),
last_activity: current_timestamp()
}
UPDATE user_presence(user_id, 'online')
FUNCTION send_message_to_user(user_id, message):
IF user_id EXISTS IN user_connections:
SEND message TO user_connections[user_id]
RETURN success
RETURN failure // User offline
2. Message Service
CLASS MessageService:
INITIALIZE:
message_queue = MessageQueue()
database = MessageDatabase()
FUNCTION send_message(sender_id, recipient_id, content, message_type):
message = {
id: generate_unique_id(),
conversation_id: get_conversation_id(sender_id, recipient_id),
sender_id: sender_id,
content: content,
message_type: message_type,
timestamp: current_timestamp(),
status: 'sent'
}
// Store in database
database.store_message(message)
// Queue for delivery
message_queue.enqueue('message_delivery', message)
RETURN message
3. Database Schema
// Users table
TABLE users:
id: BIGINT (PRIMARY KEY, AUTO_INCREMENT)
phone_number: STRING (UNIQUE, NOT NULL)
username: STRING
profile_image_url: TEXT
last_seen: TIMESTAMP
status: ENUM('online', 'offline', 'away')
created_at: TIMESTAMP (DEFAULT: CURRENT_TIME)
// Conversations table
TABLE conversations:
id: UUID (PRIMARY KEY)
type: ENUM('direct', 'group')
name: STRING // For group chats
created_by: BIGINT
created_at: TIMESTAMP
updated_at: TIMESTAMP
// Messages table (partitioned by conversation_id)
TABLE messages:
id: UUID (PRIMARY KEY)
conversation_id: UUID (FOREIGN KEY)
sender_id: BIGINT (FOREIGN KEY)
content: TEXT
message_type: ENUM('text', 'image', 'video', 'document', 'audio')
file_url: TEXT // For media messages
created_at: TIMESTAMP
edited_at: TIMESTAMP (NULLABLE)
deleted_at: TIMESTAMP (NULLABLE)
INDEXES:
(conversation_id, created_at)
(sender_id)
PARTITIONING: HASH(conversation_id) INTO 100 PARTITIONS
// Message delivery status
TABLE message_status:
message_id: UUID (FOREIGN KEY)
user_id: BIGINT (FOREIGN KEY)
status: ENUM('sent', 'delivered', 'read')
updated_at: TIMESTAMP
PRIMARY KEY: (message_id, user_id)
// Conversation participants
TABLE conversation_participants:
conversation_id: UUID (FOREIGN KEY)
user_id: BIGINT (FOREIGN KEY)
role: ENUM('admin', 'member')
joined_at: TIMESTAMP
left_at: TIMESTAMP (NULLABLE)
PRIMARY KEY: (conversation_id, user_id)
Detailed Questions & Answers
Q1: How do you ensure real-time message delivery?
Answer: Multi-layered approach with WebSockets and fallback mechanisms:
1. WebSocket Connection Management
CLASS WebSocketManager:
INITIALIZE:
connections = MAP<user_id, websocket>
heartbeat_interval = 30 seconds
FUNCTION maintain_connection(user_id, websocket):
TRY:
WHILE connection_active:
// Send heartbeat
SEND ping TO websocket
WAIT heartbeat_interval
CATCH connection_closed:
CALL handle_disconnect(user_id)
FUNCTION send_message(user_id, message):
IF user_id EXISTS IN connections:
TRY:
SEND message TO connections[user_id]
RETURN success
CATCH connection_error:
// Connection broken, clean up
CALL handle_disconnect(user_id)
RETURN failure
2. Message Queue for Reliable Delivery
CLASS MessageDeliveryService:
INITIALIZE:
message_queue = MessageQueue(servers: [kafka1, kafka2])
FUNCTION deliver_message(message):
recipient_id = message.recipient_id
// Try WebSocket first
IF websocket_manager.send_message(recipient_id, message):
update_status(message.id, 'delivered')
RETURN
// User offline - queue for later delivery
message_queue.enqueue('offline_messages', {
user_id: recipient_id,
message: message,
retry_count: 0
})
// Send push notification
notification_service.send_push(recipient_id, message)
3. Offline Message Handling
CLASS OfflineMessageHandler:
FUNCTION handle_user_online(user_id):
// Deliver queued messages when user comes online
queued_messages = get_queued_messages(user_id)
FOR EACH message IN queued_messages:
success = websocket_manager.send_message(user_id, message)
IF success:
mark_message_delivered(message.id)
remove_from_queue(user_id, message.id)
Q2: How do you handle group messaging efficiently?
Answer: Optimized fanout strategies based on group size:
1. Small Groups (< 100 members) - Direct Fanout
CLASS GroupMessageHandler:
FUNCTION send_group_message(group_id, sender_id, message):
participants = database.get_active_participants(group_id)
// Create message record
message_record = database.store_group_message(group_id, sender_id, message)
// Fan out to all participants
delivery_tasks = []
FOR EACH participant_id IN participants:
IF participant_id != sender_id: // Don't send to sender
task = deliver_to_participant(participant_id, message_record)
ADD task TO delivery_tasks
// Execute deliveries concurrently
EXECUTE ALL delivery_tasks IN PARALLEL
2. Large Groups (100+ members) - Queue-Based Fanout
CLASS LargeGroupHandler:
FUNCTION send_large_group_message(group_id, sender_id, message):
// Store message once
message_record = database.store_group_message(group_id, sender_id, message)
// Queue fanout job
fanout_job = {
group_id: group_id,
message_id: message_record.id,
sender_id: sender_id,
total_participants: COUNT(database.get_participants(group_id))
}
message_queue.enqueue('group_fanout', fanout_job)
// Return immediately to sender
RETURN message_record
FUNCTION process_group_fanout(job):
participants = database.get_active_participants(job.group_id)
batch_size = 1000
// Process in batches to avoid overwhelming the system
FOR i = 0 TO LENGTH(participants) STEP batch_size:
batch = participants[i : i + batch_size]
fanout_to_batch(job.message_id, batch)
WAIT 0.1 seconds // Small delay between batches
3. Group Message Status Tracking
FUNCTION track_group_message_status(message_id, group_id):
total_participants = database.count_participants(group_id)
status_summary = database.query("
SELECT status, COUNT(*) as count
FROM message_status
WHERE message_id = message_id
GROUP BY status
")
RETURN {
total: total_participants,
delivered: status_summary['delivered'] OR 0,
read: status_summary['read'] OR 0,
pending: total_participants - SUM(status_summary.values())
}
Q3: How do you implement message ordering and consistency?
Answer: Vector clocks and sequence numbers for ordering:
1. Message Sequence Numbers
CLASS MessageOrderingService:
INITIALIZE:
sequence_counters = MAP<conversation_id, counter>
lock = MUTEX
FUNCTION get_next_sequence(conversation_id):
ACQUIRE lock:
IF conversation_id NOT IN sequence_counters:
// Initialize from database
last_seq = database.get_last_sequence(conversation_id)
sequence_counters[conversation_id] = last_seq + 1
ELSE:
sequence_counters[conversation_id] = sequence_counters[conversation_id] + 1
RETURN sequence_counters[conversation_id]
FUNCTION send_ordered_message(conversation_id, sender_id, content):
sequence_num = get_next_sequence(conversation_id)
message = {
id: generate_unique_id(),
conversation_id: conversation_id,
sender_id: sender_id,
content: content,
sequence_number: sequence_num,
timestamp: current_timestamp()
}
RETURN database.store_message(message)
2. Client-Side Ordering
CLASS MessageBuffer:
INITIALIZE:
expected_sequence = MAP<conversation_id, next_expected_sequence>
buffered_messages = MAP<conversation_id, LIST<message>>
FUNCTION receive_message(message):
conversation_id = message.conversation_id
expected_seq = expected_sequence[conversation_id] OR 1
IF message.sequence_number == expected_seq:
// Message is in order, process it
process_message(message)
expected_sequence[conversation_id] = expected_seq + 1
// Check if any buffered messages can now be processed
process_buffered_messages(conversation_id)
ELSE:
// Out of order message, buffer it
IF conversation_id NOT IN buffered_messages:
buffered_messages[conversation_id] = []
buffered_messages[conversation_id].ADD(message)
FUNCTION process_buffered_messages(conversation_id):
buffered = buffered_messages[conversation_id] OR []
expected_seq = expected_sequence[conversation_id]
// Sort and process any messages that are now in order
SORT buffered BY sequence_number ASCENDING
FOR i = 0 TO LENGTH(buffered):
IF buffered[i].sequence_number == expected_sequence[conversation_id]:
process_message(buffered[i])
expected_sequence[conversation_id]++
REMOVE buffered[i]
i-- // Adjust index after removal
ELSE:
BREAK // Stop if we hit a gap
Q4: How do you handle media file sharing at scale?
Answer: Distributed storage with CDN and optimization:
1. Media Upload Service
CLASS MediaUploadService:
INITIALIZE:
cloud_storage_client = CloudStorageClient()
cdn_url = 'https://cdn.chatapp.com'
FUNCTION upload_media(user_id, file_data, file_type):
// Generate unique file name
file_id = generate_unique_id()
file_extension = get_extension(file_type)
storage_key = "media/" + user_id + "/" + file_id + "." + file_extension
// Compress/optimize based on type
processed_data = process_media(file_data, file_type)
// Upload to cloud storage with encryption
cloud_storage_client.upload_object({
bucket: 'chatapp-media',
key: storage_key,
data: processed_data,
content_type: file_type,
encryption: 'AES256',
metadata: {
uploaded_by: user_id,
upload_time: current_timestamp()
}
})
// Generate CDN URL
media_url = cdn_url + "/" + storage_key
// Store metadata in database
media_record = database.store_media_metadata(
file_id, user_id, storage_key, media_url, file_type, SIZE(processed_data)
)
RETURN media_record
FUNCTION process_media(file_data, file_type):
IF file_type STARTS_WITH 'image/':
RETURN compress_image(file_data)
ELSE IF file_type STARTS_WITH 'video/':
RETURN compress_video(file_data)
ELSE:
RETURN file_data // Documents, audio - no compression
2. Media Message Handling
FUNCTION send_media_message(sender_id, recipient_id, file_data, file_type):
// Upload media first
media_record = media_service.upload_media(sender_id, file_data, file_type)
// Create message with media reference
message = {
id: generate_unique_id(),
conversation_id: get_conversation_id(sender_id, recipient_id),
sender_id: sender_id,
message_type: 'media',
media_id: media_record.id,
media_url: media_record.url,
media_type: file_type,
file_size: media_record.size,
thumbnail_url: media_record.thumbnail_url, // For images/videos
timestamp: current_timestamp()
}
RETURN send_message(message)
3. Progressive Media Loading
CLASS MediaMessageHandler:
FUNCTION display_media_message(message):
media_container = CREATE_UI_ELEMENT('div')
IF message.media_type STARTS_WITH 'image/':
// Show thumbnail first, full image on click
thumbnail = CREATE_UI_ELEMENT('img')
thumbnail.source = message.thumbnail_url
thumbnail.on_click = show_full_image(message.media_url)
media_container.ADD_CHILD(thumbnail)
ELSE IF message.media_type STARTS_WITH 'video/':
// Show video player with poster
video = CREATE_UI_ELEMENT('video')
video.poster = message.thumbnail_url
video.controls = true
video.preload = 'metadata' // Only load metadata initially
video.source = message.media_url
media_container.ADD_CHILD(video)
RETURN media_container
FUNCTION preload_media(conversation_id):
// Preload media for recent messages in background
recent_messages = get_recent_media_messages(conversation_id, 10)
FOR EACH message IN recent_messages:
IF message.media_type STARTS_WITH 'image/':
// Triggers browser/client caching
PRELOAD_IMAGE(message.media_url)
Q5: How do you implement end-to-end encryption?
Answer: Signal Protocol implementation with key management:
1. Key Generation and Exchange
CLASS E2EEncryption:
INITIALIZE:
identity_key = GENERATE_PRIVATE_KEY(X25519)
public_identity_key = GET_PUBLIC_KEY(identity_key)
FUNCTION generate_prekeys(count):
prekeys = []
FOR i = 0 TO count:
private_key = GENERATE_PRIVATE_KEY(X25519)
public_key = GET_PUBLIC_KEY(private_key)
prekey_record = {
id: i,
private_key: private_key,
public_key: public_key,
used: false
}
prekeys.ADD(prekey_record)
RETURN prekeys
FUNCTION initiate_session(recipient_identity_key, recipient_prekey):
// Generate ephemeral key pair
ephemeral_key = GENERATE_PRIVATE_KEY(X25519)
// Perform Triple Diffie-Hellman
dh1 = DIFFIE_HELLMAN_EXCHANGE(identity_key, recipient_prekey)
dh2 = DIFFIE_HELLMAN_EXCHANGE(ephemeral_key, recipient_identity_key)
dh3 = DIFFIE_HELLMAN_EXCHANGE(ephemeral_key, recipient_prekey)
// Derive root key
root_key_material = CONCATENATE(dh1, dh2, dh3)
root_key = DERIVE_KEY_HKDF(root_key_material, algorithm: SHA256, length: 32)
RETURN {
root_key: root_key,
ephemeral_public: GET_PUBLIC_KEY(ephemeral_key)
}
2. Message Encryption/Decryption
CLASS MessageEncryption:
INITIALIZE:
session_key = PROVIDED_SESSION_KEY
cipher = CREATE_CIPHER(session_key)
FUNCTION encrypt_message(plaintext):
encrypted_data = cipher.encrypt(plaintext)
RETURN {
ciphertext: BASE64_ENCODE(encrypted_data),
message_keys: get_current_message_keys()
}
FUNCTION decrypt_message(encrypted_message):
TRY:
ciphertext = BASE64_DECODE(encrypted_message.ciphertext)
plaintext = cipher.decrypt(ciphertext)
RETURN plaintext
CATCH decryption_error:
THROW DecryptionError("Failed to decrypt message")
FUNCTION get_current_message_keys():
// Implement Double Ratchet algorithm
// This is simplified - real implementation would be more complex
RETURN {
chain_key: derive_chain_key(),
message_key: derive_message_key()
}
Q6: How do you handle message status updates efficiently?
Answer: Batch processing and real-time status propagation:
1. Status Update Batching
CLASS StatusUpdateService:
INITIALIZE:
status_batch = MAP<string, status_update>
batch_size = 1000
flush_interval = 5 seconds
FUNCTION update_message_status(message_id, user_id, status):
key = message_id + ":" + user_id
status_batch[key] = {
message_id: message_id,
user_id: user_id,
status: status,
timestamp: current_timestamp()
}
IF SIZE(status_batch) >= batch_size:
flush_status_updates()
FUNCTION flush_status_updates():
IF status_batch IS EMPTY:
RETURN
updates = VALUES(status_batch)
CLEAR status_batch
// Batch update database
database.batch_update_status(updates)
// Notify senders about status changes
notify_status_changes(updates)
FUNCTION notify_status_changes(updates):
// Group by message sender for efficient notifications
sender_updates = MAP<sender_id, LIST<update>>
FOR EACH update IN updates:
message = database.get_message(update.message_id)
sender_id = message.sender_id
IF sender_id NOT IN sender_updates:
sender_updates[sender_id] = []
sender_updates[sender_id].ADD(update)
// Send status updates to message senders
FOR EACH (sender_id, user_updates) IN sender_updates:
websocket_manager.send_status_update(sender_id, user_updates)
2. Read Receipt Handling
CLASS ReadReceiptManager:
FUNCTION mark_conversation_read(user_id, conversation_id, last_read_message_id):
// Update user's read position
database.update_read_position(user_id, conversation_id, last_read_message_id)
// Find all unread messages up to this point
unread_messages = database.get_unread_messages(
user_id, conversation_id, last_read_message_id
)
// Batch mark as read
message_ids = EXTRACT_IDS(unread_messages)
status_service.batch_mark_read(message_ids, user_id)
// Notify other participants in group chats
IF is_group_conversation(conversation_id):
notify_group_read_status(conversation_id, user_id, last_read_message_id)
Q7: How do you implement push notifications for offline users?
Answer: Multi-provider notification system with smart delivery:
1. Notification Service Architecture
CLASS PushNotificationService:
INITIALIZE:
apns_client = APNSClient() // iOS
fcm_client = FCMClient() // Android
web_push_client = WebPushClient() // Web browsers
FUNCTION send_message_notification(user_id, message):
user_devices = database.get_user_devices(user_id)
IF user_devices IS EMPTY:
RETURN false
// Prepare notification content
notification = prepare_notification(message)
// Send to all user devices
delivery_tasks = []
FOR EACH device IN user_devices:
task = send_to_device(device, notification)
delivery_tasks.ADD(task)
results = EXECUTE_ALL_PARALLEL(delivery_tasks)
// Track delivery success rates
successful = COUNT_SUCCESSFUL(results)
metrics.record_notification_delivery(user_id, SIZE(user_devices), successful)
RETURN successful > 0
FUNCTION prepare_notification(message):
sender_info = database.get_user_info(message.sender_id)
// Customize notification based on message type
IF message.message_type == 'text':
body = TRUNCATE(message.content, 100)
ELSE IF message.message_type == 'media':
body = "📸 " + sender_info.name + " sent a photo"
ELSE:
body = sender_info.name + " sent a message"
RETURN {
title: sender_info.name,
body: body,
badge: get_unread_count(message.recipient_id),
data: {
conversation_id: message.conversation_id,
message_id: message.id,
type: 'new_message'
}
}
FUNCTION send_to_device(device, notification):
TRY:
IF device.platform == 'ios':
RETURN apns_client.send_notification(device.token, notification)
ELSE IF device.platform == 'android':
RETURN fcm_client.send_notification(device.token, notification)
ELSE IF device.platform == 'web':
RETURN web_push_client.send_notification(device.token, notification)
CATCH exception:
// Log error and potentially remove invalid tokens
handle_notification_error(device, exception)
RETURN false
2. Smart Notification Delivery
CLASS SmartNotificationManager:
FUNCTION should_send_notification(user_id, message):
// Check user notification preferences
prefs = database.get_notification_preferences(user_id)
IF NOT prefs.enabled:
RETURN false
// Check do not disturb hours
user_timezone = prefs.timezone OR 'UTC'
current_time = GET_CURRENT_TIME(user_timezone)
IF prefs.do_not_disturb:
dnd_start = prefs.dnd_start_time
dnd_end = prefs.dnd_end_time
IF is_in_dnd_period(current_time, dnd_start, dnd_end):
RETURN false
// Check if user is currently active
last_activity = presence_service.get_last_activity(user_id)
IF last_activity AND (current_time - last_activity) < 30 seconds:
RETURN false // User is likely actively using the app
// Check message priority (e.g., group mentions)
IF is_high_priority_message(message):
RETURN true
RETURN true
FUNCTION batch_process_notifications():
// Process notifications in batches to avoid overwhelming the system
pending_notifications = database.get_pending_notifications(limit: 10000)
FOR EACH batch IN CHUNKS(pending_notifications, 100):
process_notification_batch(batch)
WAIT 0.1 seconds // Rate limiting
Advanced Scenarios
Scenario 1: Handling Network Partitions
Challenge: Users in different regions lose connectivity to each other but can still connect locally.
Solution:
- Message Queuing: Store messages locally until connectivity restored
- Conflict Resolution: Use vector clocks to resolve message ordering conflicts
- Eventual Consistency: Sync messages when partition heals
- Status Indicators: Show users when messages are pending sync
Scenario 2: Celebrity User with Millions of Followers
Challenge: Popular user joins group chat, causing massive fanout.
Solution:
- Rate Limiting: Limit message frequency for high-follower users
- Tiered Delivery: Prioritize active users for immediate delivery
- Async Fanout: Use background jobs for massive group message delivery
- Circuit Breakers: Prevent system overload with automatic throttling
Scenario 3: Regulatory Compliance (GDPR, CCPA)
Challenge: Handle data deletion requests while maintaining message integrity.
Solution:
- User Data Mapping: Track all user data locations across services
- Cascading Deletion: Remove user data from all systems simultaneously
- Message Anonymization: Replace deleted user content with placeholders
- Audit Logging: Track all data operations for compliance reporting
This comprehensive chat application design covers real-time messaging, scalability, security, and advanced scenarios commonly discussed in system design interviews.