Text ProcessingFebruary 5, 202411 min read

Unicode and Character Encoding: UTF-8, UTF-16, and ASCII Explained

Understand Unicode, UTF-8, UTF-16, and ASCII character encoding. Learn how to handle international text, emojis, and special characters in web applications.

What is Character Encoding?

Character encoding is the process of assigning numbers to characters so computers can store and process text. Different encoding systems determine which characters are available and how they're represented in binary.

ASCII: The Foundation

Overview

ASCII (American Standard Code for Information Interchange) was created in 1963 and uses 7 bits to represent 128 characters, including English letters, digits, punctuation, and control characters.

ASCII Examples

Character | Decimal | Binary    | Hex ----------|---------|-----------|----- 'A'       | 65      | 01000001  | 0x41 'a'       | 97      | 01100001  | 0x61 '0'       | 48      | 00110000  | 0x30 ' '       | 32      | 00100000  | 0x20 '!'       | 33      | 00100001  | 0x21

Limitations

Only supports English characters
No accented letters (é, ñ, ü)
No symbols from other writing systems
No emojis or modern symbols

Unicode: The Universal Character Set

What is Unicode?

Unicode is a universal character encoding standard that assigns a unique code point to every character across all writing systems. It includes over 149,000 characters from 159 different scripts, plus emojis, mathematical symbols, and more.

Unicode Code Points

Character | Unicode | Description ----------|---------|------------- A         | U+0041  | Latin Capital Letter A é         | U+00E9  | Latin Small Letter E with Acute 中        | U+4E2D  | CJK Unified Ideograph (Chinese) 🚀        | U+1F680 | Rocket emoji ₹         | U+20B9  | Indian Rupee Sign ∞         | U+221E  | Infinity symbol

Unicode Planes

Unicode is organized into 17 planes, each containing 65,536 code points:

BMP (Plane 0): U+0000 to U+FFFF - Most common characters
SMP (Plane 1): U+10000 to U+1FFFF - Historic scripts, emojis
Planes 2-16: Additional characters and private use

UTF-8: The Web Standard

How UTF-8 Works

UTF-8 is a variable-length encoding that uses 1 to 4 bytes per character. It's backward compatible with ASCII and is the dominant encoding for web pages (98% of all websites).

UTF-8 Encoding Examples

Character | Unicode | UTF-8 Bytes         | Byte Count
----------|---------|---------------------|------------
A         | U+0041  | 41                  | 1 byte
é         | U+00E9  | C3 A9               | 2 bytes
中        | U+4E2D  | E4 B8 AD            | 3 bytes
🚀        | U+1F680 | F0 9F 9A 80         | 4 bytes

UTF-8 Advantages

ASCII compatible: ASCII files are valid UTF-8
Space efficient: Common characters use fewer bytes
Self-synchronizing: Easy to find character boundaries
No byte order issues: No BOM (Byte Order Mark) needed
Widely supported: Standard on web, Linux, and macOS

UTF-8 in Practice

<!-- HTML -->
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>My Page 你好 🌍</title>
</head>
<body>
  <p>Hello World! Bonjour! 你好! مرحبا!</p>
</body>
</html>

// JavaScript
const text = "Hello 世界 🚀";
console.log(text.length);        // 10 (code units)
console.log([...text].length);   // 10 (code points)

// Python
text = "Hello 世界 🚀"
print(len(text))                 # 10 (characters)
print(len(text.encode('utf-8'))) # 17 (bytes)

UTF-16: Windows and Java Standard

How UTF-16 Works

UTF-16 uses 2 or 4 bytes per character. Characters in the BMP use 2 bytes, while characters outside the BMP use 4 bytes (surrogate pairs).

UTF-16 Encoding Examples

Character | Unicode | UTF-16 Bytes    | Byte Count
----------|---------|-----------------|------------
A         | U+0041  | 00 41           | 2 bytes
é         | U+00E9  | 00 E9           | 2 bytes
中        | U+4E2D  | 4E 2D           | 2 bytes
🚀        | U+1F680 | D8 3D DE 80     | 4 bytes (surrogate pair)

UTF-16 Pros and Cons

Advantages:

Fixed 2-byte width for most characters (BMP)
Native format for JavaScript strings and Windows
Efficient for languages with many non-ASCII characters

Disadvantages:

Not ASCII compatible
Wastes space for primarily ASCII text
Byte order issues (requires BOM or agreement)
Surrogate pairs complicate string operations

Common Encoding Problems

1. Mojibake (Garbled Text)

Occurs when text is decoded with the wrong encoding:

Original:  "café" Encoded as UTF-8, decoded as Latin-1: "cafÃ©" Encoded as UTF-8, decoded as Windows-1252: "cafÃ©"

2. Invalid Byte Sequences

// Python example
try:
    bad_bytes = b'\xff\xfe'
    text = bad_bytes.decode('utf-8')
except UnicodeDecodeError as e:
    print(f"Cannot decode: {e}")
    # Use errors parameter to handle
    text = bad_bytes.decode('utf-8', errors='replace')  # �� replacement
    text = bad_bytes.decode('utf-8', errors='ignore')   # skip invalid

3. BOM (Byte Order Mark) Issues

UTF-8 BOM (EF BB BF) at file start can cause problems in some systems:

// JavaScript - Remove BOM
function removeBOM(str) {
  return str.charCodeAt(0) === 0xFEFF ? str.slice(1) : str;
}

// Check for BOM in file
if (fileContent.startsWith('\uFEFF')) {
  fileContent = fileContent.slice(1);
}

Best Practices

1. Always Use UTF-8

Default to UTF-8 for web applications
Set charset in HTML meta tags
Configure web server to send UTF-8 headers
Save source files as UTF-8

2. Validate Input Encoding

// Node.js - Detect encoding
const chardet = require('chardet');
const encoding = chardet.detect(buffer);

// Convert to UTF-8 if needed
const iconv = require('iconv-lite');
const utf8Text = iconv.decode(buffer, encoding);

3. Handle String Length Correctly

// JavaScript - Correct string length handling
const text = "Hello 🌍";

// Wrong: counts UTF-16 code units
console.log(text.length); // 8

// Correct: counts actual characters
console.log([...text].length); // 7
console.log(Array.from(text).length); // 7

// Iterate correctly
for (const char of text) {
  console.log(char); // Works with emojis
}

4. Database Configuration

-- MySQL: Use utf8mb4 (not utf8!)
CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

CREATE TABLE users (
  name VARCHAR(100) CHARACTER SET utf8mb4
);

-- PostgreSQL: UTF-8 by default
CREATE DATABASE mydb ENCODING 'UTF8';

Testing Unicode Support

Test your application with diverse Unicode characters:

Test strings to use: - "Hello World" (ASCII) - "Café résumé" (Latin extended) - "你好世界" (Chinese) - "مرحبا بالعالم" (Arabic, RTL) - "🚀🌍👨‍👩‍👧‍👦" (Emojis with ZWJ sequences) - "𝕳𝖊𝖑𝖑𝖔" (Mathematical alphanumeric)

Try Our Text Tools

Process and manipulate text with proper Unicode support:

Use String Tools