Unicode and Character Encoding: UTF-8, UTF-16, and ASCII Explained
Understand Unicode, UTF-8, UTF-16, and ASCII character encoding. Learn how to handle international text, emojis, and special characters in web applications.
What is Character Encoding?
Character encoding is the process of assigning numbers to characters so computers can store and process text. Different encoding systems determine which characters are available and how they're represented in binary.
ASCII: The Foundation
Overview
ASCII (American Standard Code for Information Interchange) was created in 1963 and uses 7 bits to represent 128 characters, including English letters, digits, punctuation, and control characters.
ASCII Examples
Character | Decimal | Binary | Hex ----------|---------|-----------|----- 'A' | 65 | 01000001 | 0x41 'a' | 97 | 01100001 | 0x61 '0' | 48 | 00110000 | 0x30 ' ' | 32 | 00100000 | 0x20 '!' | 33 | 00100001 | 0x21Limitations
- Only supports English characters
- No accented letters (é, ñ, ü)
- No symbols from other writing systems
- No emojis or modern symbols
Unicode: The Universal Character Set
What is Unicode?
Unicode is a universal character encoding standard that assigns a unique code point to every character across all writing systems. It includes over 149,000 characters from 159 different scripts, plus emojis, mathematical symbols, and more.
Unicode Code Points
Character | Unicode | Description ----------|---------|------------- A | U+0041 | Latin Capital Letter A é | U+00E9 | Latin Small Letter E with Acute 中 | U+4E2D | CJK Unified Ideograph (Chinese) 🚀 | U+1F680 | Rocket emoji ₹ | U+20B9 | Indian Rupee Sign ∞ | U+221E | Infinity symbolUnicode Planes
Unicode is organized into 17 planes, each containing 65,536 code points:
- BMP (Plane 0): U+0000 to U+FFFF - Most common characters
- SMP (Plane 1): U+10000 to U+1FFFF - Historic scripts, emojis
- Planes 2-16: Additional characters and private use
UTF-8: The Web Standard
How UTF-8 Works
UTF-8 is a variable-length encoding that uses 1 to 4 bytes per character. It's backward compatible with ASCII and is the dominant encoding for web pages (98% of all websites).
UTF-8 Encoding Examples
Character | Unicode | UTF-8 Bytes | Byte Count
----------|---------|---------------------|------------
A | U+0041 | 41 | 1 byte
é | U+00E9 | C3 A9 | 2 bytes
中 | U+4E2D | E4 B8 AD | 3 bytes
🚀 | U+1F680 | F0 9F 9A 80 | 4 bytesUTF-8 Advantages
- ASCII compatible: ASCII files are valid UTF-8
- Space efficient: Common characters use fewer bytes
- Self-synchronizing: Easy to find character boundaries
- No byte order issues: No BOM (Byte Order Mark) needed
- Widely supported: Standard on web, Linux, and macOS
UTF-8 in Practice
<!-- HTML -->
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>My Page 你好 🌍</title>
</head>
<body>
<p>Hello World! Bonjour! 你好! مرحبا!</p>
</body>
</html>
// JavaScript
const text = "Hello 世界 🚀";
console.log(text.length); // 10 (code units)
console.log([...text].length); // 10 (code points)
// Python
text = "Hello 世界 🚀"
print(len(text)) # 10 (characters)
print(len(text.encode('utf-8'))) # 17 (bytes)UTF-16: Windows and Java Standard
How UTF-16 Works
UTF-16 uses 2 or 4 bytes per character. Characters in the BMP use 2 bytes, while characters outside the BMP use 4 bytes (surrogate pairs).
UTF-16 Encoding Examples
Character | Unicode | UTF-16 Bytes | Byte Count
----------|---------|-----------------|------------
A | U+0041 | 00 41 | 2 bytes
é | U+00E9 | 00 E9 | 2 bytes
中 | U+4E2D | 4E 2D | 2 bytes
🚀 | U+1F680 | D8 3D DE 80 | 4 bytes (surrogate pair)UTF-16 Pros and Cons
Advantages:
- Fixed 2-byte width for most characters (BMP)
- Native format for JavaScript strings and Windows
- Efficient for languages with many non-ASCII characters
Disadvantages:
- Not ASCII compatible
- Wastes space for primarily ASCII text
- Byte order issues (requires BOM or agreement)
- Surrogate pairs complicate string operations
Common Encoding Problems
1. Mojibake (Garbled Text)
Occurs when text is decoded with the wrong encoding:
Original: "café" Encoded as UTF-8, decoded as Latin-1: "café" Encoded as UTF-8, decoded as Windows-1252: "café"2. Invalid Byte Sequences
// Python example
try:
bad_bytes = b'\xff\xfe'
text = bad_bytes.decode('utf-8')
except UnicodeDecodeError as e:
print(f"Cannot decode: {e}")
# Use errors parameter to handle
text = bad_bytes.decode('utf-8', errors='replace') # �� replacement
text = bad_bytes.decode('utf-8', errors='ignore') # skip invalid3. BOM (Byte Order Mark) Issues
UTF-8 BOM (EF BB BF) at file start can cause problems in some systems:
// JavaScript - Remove BOM
function removeBOM(str) {
return str.charCodeAt(0) === 0xFEFF ? str.slice(1) : str;
}
// Check for BOM in file
if (fileContent.startsWith('\uFEFF')) {
fileContent = fileContent.slice(1);
}Best Practices
1. Always Use UTF-8
- Default to UTF-8 for web applications
- Set charset in HTML meta tags
- Configure web server to send UTF-8 headers
- Save source files as UTF-8
2. Validate Input Encoding
// Node.js - Detect encoding
const chardet = require('chardet');
const encoding = chardet.detect(buffer);
// Convert to UTF-8 if needed
const iconv = require('iconv-lite');
const utf8Text = iconv.decode(buffer, encoding);3. Handle String Length Correctly
// JavaScript - Correct string length handling
const text = "Hello 🌍";
// Wrong: counts UTF-16 code units
console.log(text.length); // 8
// Correct: counts actual characters
console.log([...text].length); // 7
console.log(Array.from(text).length); // 7
// Iterate correctly
for (const char of text) {
console.log(char); // Works with emojis
}4. Database Configuration
-- MySQL: Use utf8mb4 (not utf8!)
CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE TABLE users (
name VARCHAR(100) CHARACTER SET utf8mb4
);
-- PostgreSQL: UTF-8 by default
CREATE DATABASE mydb ENCODING 'UTF8';Testing Unicode Support
Test your application with diverse Unicode characters:
Test strings to use: - "Hello World" (ASCII) - "Café résumé" (Latin extended) - "你好世界" (Chinese) - "مرحبا بالعالم" (Arabic, RTL) - "🚀🌍👨👩👧👦" (Emojis with ZWJ sequences) - "𝕳𝖊𝖑𝖑𝖔" (Mathematical alphanumeric)Try Our Text Tools
Process and manipulate text with proper Unicode support: