HTML Tidy can clean-up HTML documents by normalizing them, stripping comments, sorting element attributes alphabetically, and outputting consistent pretty-printed markup. The result is an HTML document with less unique data with more consistent and repeating patterns. This yields improved Gzip/DEFLATE compression-rates compared to a less neatly organized documents.
You can take these principles one step farther and get even better compression-rates by pre-processing and re-arranging HTML elements that don’t affect the document presentation or semantics. The metadata elements in the <head>
element are prime candidates for this type of optimization.
Similar data compresses well, but similar data a short byte-distance apart compresses even better. This is all you need to know about how the relevant compression algorithm works for this article. I’ll get back to a more technical explanation at the end of the article.
Webpages commonly need to include duplicated metadata to ensure compatibility with popular apps and services. Due to poor standardization efforts, this involves requires a lot of duplicated data.
The document order of the metadata elements <link>
, <meta>
, and <title>
(with exceptions) doesn’t affect how the document is displayed. These elements can be reordered losslessly. In my implementation, I’ve opted to leave the <script>
and <style>
elements unmodified below all the other metadata elements as their order may be important.
The order of <link rel="stylesheet">
elements can affect the stylesheet parsing if your layout depends on the stylesheet overrides. This means that you’re sending redundant data to the client, and you should stop and fix that before proceeding with the optimization discussed in this article. The <noscript>
-wrapped lazy-loading <link rel="stylesheet">
anti-pattern might also break when you move <link>
elements about.
A second quick note on compatibility requirements: the <meta charset="utf-8">
element must appear as the first child element of <head>
to after to maintain compatibility with ancient libraries and tools still in use today.
Implementation
HTML elements should already have been normalized on a single line per element with alphabetically sorted element attributes. Again, this is something HTML Tidy, or a pretty-printing or HTML minifying tool can handle for you. You can build on that base, and extract all the <meta>
, <meta>
, and <title>
elements, and lastly all the <link>
from the <meta>
elements, and lastly all the <head>
section of the document.
Once you’ve separated out that list of elements, add the <meta charset="utf-8">
element back to the document and drop it from your list. Then all you need to do is to sort your list and put everything back into the document.
When considering the sorting order, the unique <title>Example Title</title>
element can be sorted as <meta content="Example Title" name="title">
. This element is only used once but this sorting order will place it alongside other metadata elements that duplicate its value fully or partially.
The Open Graph Protocol annoyingly uses a <meta>
element to express linking relationships instead of the dedicated <link>
element. You can optimize things further by moving <meta>
elements whose value is a URL down to the bottom of the <meta>
elements (just before the first <link>
elements.)
These pre-processing optimizations should yield a measurable improvement in the document compression rates. Here’s a short example document sorted using the logic outlined in this article:
You may want to study what you have in the <head>
section in your documents and apply further optimizations. However, removing redundant tags and generally cleaning it up a bit may very well give you a better end-result than implementing the micro-optimizations discussed in this article.
How much does all of this save?
I’m not gonna lie. This isn’t a magical solution that will save you a whole lot of bytes. The prerequisites for implementing this, like normalizing the document and sorting element attributes, will likely yield better compression gains than this. You should see better results the more metadata your documents contain.
However, the methods discussed here are can be automated. This is likely only relevant to if you’re implementing a web-optimization library, or writing an article about it for your blog.
I applied the metadata sorting pre-processing optimizations discussed in this article to every page on this blog. It saved a median of 26 bytes from the compressed file size per page. That’s about 0,58 % of the average total compressed file size. The size of the uncompressed files doesn’t change as they still contain the same number of bytes.
I’ll need to briefly go into how the DEFLATE algorithm works to better explain why these optimizations steps save any data at all. It all comes down to making data more simular to data stored close to it in the file.
DEFLATE compression works in two stages. The first stage looks for repeated byte-sequences. For our purposes, you can think of it as identifying repetitions of the same substrings within a text. DEFLATE replaces duplicated sequences with the a back-reference to how far back it saw the previous copy of the sequence (the distance) and the length of the duplicated sequence, and with the length of the duplicated sequence (a back-reference.) This information tells the decompresser where to find the data and how much of it to copy.
The second stage uses a bit-reduction technique called Huffman coding. Essentially, frequently used bytes are assigned a shorter code and rarer ones get longer codes (requiring more bytes to store them.) There’s more to it than that but that’s the short version.
For text documents, this essentially builds a new compressed character encoding table per file based on what characters are common in the file. However, the same trick is used to encode the distance and length information used for the de-duplication back-references in the file.
In other words, by sorting the metadata elements you restructure the document in a way that increases the repetition and reduces the distances between copies of the same byte-sequences. That enables DEFLATE to save a few bytes encoding this information with more efficiently bit-reduced back-references.