UTF-8 Encoder Free Online Tool by Testsigma
- Testsigma
- Free online Tools
- UTF-8 Encoder
Input
Example:
Characters
0
UTF-8 bytes
0
ASCII chars
0
Output
UTF-8 bytes will appear here.
What Is UTF-8 Encoding?
UTF-8 (Unicode Transformation Format – 8-bit) is the world's dominant character encoding standard. As of 2026, it powers nearly 99% of all web pages transmitted across the internet, and it is the only encoding natively supported by the browser's built-in TextEncoder API. Unlike fixed-width encodings such as UTF-32, UTF-8 uses a variable number of bytes per character — ranging from one byte for standard ASCII characters up to four bytes for emoji, rare scripts, and supplementary Unicode symbols.
UTF-8 achieves backward compatibility with ASCII by encoding the first 128 Unicode characters in a single byte using identical binary values. This means any ASCII text file is also a valid UTF-8 file, which is a key reason the encoding has become the universal default for web development, APIs, databases, and software internationalisation.
UTF-8 Encode — Free Online Tool
Testsigma's UTF-8 Encoder converts any text string into its raw UTF-8 byte representation using the browser-native TextEncoder.encode() method, which returns a Uint8Array of UTF-8 bytes directly. All encoding happens client-side — no data is sent to any server. The tool is free, requires no login, and works in any modern browser.
Features of This UTF-8 Encoder
Multiple Output Formats
The tool goes beyond a basic "encoded text" result. You can view the UTF-8 byte sequence in four developer-friendly formats:
- Hex — e.g.,
E2 82 ACfor the Euro sign € — the most common format for debugging and protocol analysis - Decimal — raw byte values as base-10 integers, e.g.,
226 130 172 - Binary — full 8-bit representation for low-level bit pattern inspection
- JavaScript Uint8Array style — e.g.,
Uint8Array(3) [226, 130, 172]
Configurable Output Formatting
Separator, prefix, and letter case are all adjustable so the output can be pasted directly into your codebase or test suite without manual reformatting:
- Separators: space, comma, newline, none, or a custom separator of your choice
- Byte prefixes: none,
0x,%, or\x - Hex case: uppercase (
E2 82 AC) or lowercase (e2 82 ac)
Live Auto-Encoding
The encoder updates its output in real time as you type. A manual Encode button is also available for users who prefer explicit action before output is generated.
Byte Statistics Panel
Three counters update automatically with every keystroke:
| Metric | What it tells you |
|---|---|
| Character count | Total Unicode characters in the input (using Array.from() to correctly handle surrogate pairs and emoji) |
| UTF-8 byte count | Total bytes the encoded string occupies in memory or transmission |
| ASCII character count | Number of single-byte characters (code point ≤ U+007F) to distinguish them from multibyte characters |
Character-Level Breakdown Table
Every character in your input is listed individually in a table showing its Unicode code point, UTF-8 hex bytes, and byte length. This is especially useful for identifying which characters are consuming extra bytes — for example, € is U+20AC and expands to three bytes (E2 82 AC), while A is U+0041 and stays one byte (41).
Copy, Download, and Clear Actions
- Copy output — writes the formatted byte string to your clipboard
- Download — saves the output as a
.txtfile for use in test fixtures, scripts, or documentation - Clear — resets both input and output with a single click
Shareable URL State
The current input text is persisted in the page's query string (e.g., ?text=Hello). This means you can bookmark or share a specific encoding scenario with teammates without them needing to re-enter the input.
Empty-State Helper Text
When no input is present, descriptive placeholder text guides the user on what to enter and what to expect from the output panel — improving usability for first-time visitors.
How UTF-8 Encoding Works
UTF-8 encodes each Unicode character into one to four bytes based on the character's code point range:
| Code point range | Bytes used | Example |
|---|---|---|
| U+0000 – U+007F | 1 byte | A → 41 |
| U+0080 – U+07FF | 2 bytes | é → C3 A9 |
| U+0800 – U+FFFF | 3 bytes | € → E2 82 AC |
| U+10000 – U+10FFFF | 4 bytes | 😀 → F0 9F 98 80 |
The bit patterns that drive this encoding are structured so that leading bytes and continuation bytes are always distinguishable. A leading byte in a two-byte sequence always starts with 110, three-byte sequences start with 1110, and four-byte sequences start with 11110. Continuation bytes always begin with 10. This self-synchronising design makes it impossible to misidentify a continuation byte as a leading byte, which prevents the corruption issues that affected earlier encodings like UTF-1.
The TextEncoder interface in modern browsers exposes this behaviour directly. Its encoding property always returns the string "utf-8" — it is the only encoding the API supports, making it a reliable, zero-dependency way to produce accurate UTF-8 byte sequences in any JavaScript environment.
Why Developers Use a UTF-8 Encoder
Debugging Encoding Issues in APIs and Databases
When text passes through multiple systems — a frontend form, a REST API, a relational database — encoding mismatches can silently corrupt data. Inspecting the raw bytes of a problem string lets you determine whether the corruption originated at the source, during transmission, or at storage. Viewing a character's hex bytes alongside its code point is a standard debugging technique in these scenarios.
Testing Input Validation and Security Boundaries
UTF-8 encoding is directly relevant to cross-browser and security testing. By inspecting byte values, testers can identify homoglyphs — characters that look visually identical to ASCII characters but carry different Unicode code points — and detect whether an application's input validation is checking characters or raw bytes. This matters for bypass testing, fuzzing, and verifying that character-length and byte-length constraints behave as expected.
Internationalisation (i18n) and Localisation (l10n)
Software targeting a global audience must handle multibyte character sets correctly. UTF-8 is the recommended encoding for HTML, XML, JSON, and HTTP payloads, and it supports every language in active use today — from Latin scripts to Arabic, Chinese, Japanese, Korean, and Devanagari. Developers use an encoder to confirm that non-Latin strings produce the expected byte sequences before those strings are embedded into APIs, databases, or test data sets.
Verifying Protocol and Header Constraints
HTTP headers, cookies, and JWT tokens operate on byte lengths, not character counts. A string that appears to be 10 characters long can be 30 bytes in UTF-8 if it contains characters from the Basic Multilingual Plane. Testsigma's byte count display makes this distinction visible at a glance.
Educational and Low-Level Code Inspection
Understanding UTF-8 at the byte level is a foundational skill for anyone writing parsers, working with binary formats, or reviewing network traffic. The character breakdown table on this tool maps each character to its code point and byte sequence, making it a practical reference for learning how variable-width encoding works in practice.
UTF-8 Encoding in JavaScript — Code Examples
The browser-native TextEncoder API is the standard way to produce UTF-8 bytes in JavaScript environments. It is available in all modern browsers and in Node.js since version 11.
Basic encode
const encoder = new TextEncoder();
const bytes = encoder.encode("Hello €");
console.log(bytes);
// Uint8Array(9) [72, 101, 108, 108, 111, 32, 226, 130, 172]Convert bytes to a hex string
const hex = Array.from(bytes)
.map(b => b.toString(16).padStart(2, "0"))
.join(" ");
console.log(hex);
// "48 65 6c 6c 6f 20 e2 82 ac"Percent-encoded (URL-safe) UTF-8
console.log(encodeURIComponent("Café"));
// "Caf%C3%A9"Java
import java.nio.charset.StandardCharsets;
String text = "Hello 🌍";
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
// Outputs: 72 101 108 108 111 32 240 159 140 141Python
text = "Hello €"
utf8_bytes = text.encode("utf-8")
hex_output = " ".join(f"{b:02X}" for b in utf8_bytes)
print(hex_output)
# 48 65 6C 6C 6F 20 E2 82 ACUTF-8 vs Other Encodings
UTF-8 is not the only Unicode encoding, but it is the most widely adopted. Understanding the trade-offs is important when choosing an encoding for a new system.
| Encoding | Byte width | ASCII compatibility | Web use |
|---|---|---|---|
| UTF-8 | 1–4 bytes (variable) | Yes — first 128 characters are identical | ~99% of web pages |
| UTF-16 | 2 or 4 bytes | No | Used internally by Windows and Java runtimes |
| UTF-32 | 4 bytes (fixed) | No | Rare; high memory cost for Latin text |
| ASCII | 1 byte | n/a (is ASCII) | Subset of UTF-8; cannot represent non-Latin text |
| ISO-8859-1 | 1 byte | Partial | Legacy; only covers Western European characters |
UTF-32 offers simple indexing because every character is exactly four bytes, but it quadruples storage costs for any text composed primarily of ASCII characters. UTF-16 is efficient for languages in the Basic Multilingual Plane but loses ASCII compatibility. UTF-8 remains the only encoding that is simultaneously backward-compatible with ASCII, space-efficient for Latin-script text, and capable of representing every Unicode code point.
Common UTF-8 Encoding Issues and How to Solve Them
Mojibake (garbled characters)
When a UTF-8 file is read using a different encoding (commonly ISO-8859-1 or Windows-1252), multibyte UTF-8 sequences are misinterpreted as multiple single-byte characters. The string "Café" may render as "Café". The fix is to declare the correct charset at every layer — in HTTP headers (Content-Type: text/html; charset=UTF-8), HTML meta tags (<meta charset="UTF-8">), database connections, and file readers.
Byte length exceeding character count
A common validation mistake is checking string.length in JavaScript for a field with a maximum byte constraint. "😀".length returns 2 in JavaScript (due to surrogate pair encoding in UTF-16), but the emoji is actually four bytes in UTF-8. Use new TextEncoder().encode(str).length for an accurate UTF-8 byte count.
BOM (Byte Order Mark) compatibility issues
Some Windows tools write a UTF-8 BOM (EF BB BF) at the start of files. While harmless in most contexts, some parsers and systems treat the BOM as literal characters, causing parsing failures. If your tool produces unexpected leading characters in output, check for a BOM in the source file.
Invalid byte sequences
UTF-8 has strict rules for valid byte sequences. Bytes in the range 0x80–0xBF are continuation bytes and cannot appear as leading bytes. Invalid sequences can arise from data corruption, incorrect encoding conversion, or concatenation of byte strings from different sources. Always validate input encoding before processing.
Frequently Asked Questions
Encoding converts a human-readable string into raw bytes. Decoding reverses the process — it takes a byte sequence and reconstructs the original string. Testsigma provides a dedicated UTF-8 Decoder for the reverse operation.
