Modern file systems like APFS, Btrfs, XFS, and ZFS support deduplicating whole files and chunks of files. The same applies to network sharing protocols like BitTorrent and IPFS. Can storage deduplication be used to reduce the storage requirement for something like the intro sequences of TV shows? What other options are there?
Most episodes of the same TV show will have the same intro sequence. No one wants to waste their available storage space by storing multiple copies of the same data. (Other than for backup purposes, of course.) Let’s explore the available storage deduplication methods and how they might apply to this problem.
There are a few different approaches to file system-level storage deduplication. The gist of it is that identical data chunks can be referenced by multiple files without consuming additional storage space. The exact details of how its implemented vary a lot, and I’ll touch on a few throughout this article. At the end, I’ll also discuss some alternatives to deduplicating file systems.
Before writing a large file to any storage media, it’s broken up into more manageable chunks. The chunk size is most often matched to the block-size of the underlying storage media. The primary method for file system deduplication is block-level deduplication. This can be done in-band before the block is written by comparing it to a table of all other blocks. The downside of this method is that it requires huge amounts of memory for storing the comparison tables. Alternatively, it can be done out-of-band at a later time.
This method isn’t useful for our scenario. The intro sequence might not start at the same wall-clock time in each episode. Even if they did, they’d still be encoded and stored at different positions in the file corresponding to each episode. It’s extremely unlikely that two episodes would chunk at the exact time the intro sequence began and ended.
There’s a more fundamental problem with this approach that has to do with the files themselves. Video codecs sample the raw data with an inherently lossy and unpredictable result. Re-encoding the same video sequence twice is unlikely to produce the exact same file. The compression scheme would also be primed differently based on the data that came before it in the video. The video may look identical to humans, but the files would be completely different.
You could compensate for this problem by encoding the intro sequences once and splicing it in at the appropriate times in each file. Many video container formats support splicing different clips together into one file. Lossless video splicing is complicated on its own, and the task is greatly complicated when you want to preserve the exact underlying file structure.
This still wouldn’t be enough to align the chunks for a fixed-length file chunker. There are two options that could take care of this problem as well.
The first option would be to add zero-byte padding before the intro sequence to align it with the chunker. This would require precise knowledge of the target file system and the media container format. Btrfs, OCFS2, and XFS on Linux support an API called fideduperange that can be used to dedupplicate chunk-aligned identical data.
The second option would be to replace the fixed-length chunker with something smarter. A rolling chunker reads a minimum length into the file, and then looks for a pattern in the file where it can make the cut. It’s not guaranteed, but more likely to chunk different files in the same places and thus produce duplicated blocks. No file system currently use a rolling chunker, however.
However, there are also other alternatives to consider all-together. I mentioned earlier that video encoders can merge splice different videos into each other. A few media container formats — including MKV, WebM, and QuickTime — lets you reference media segments external to the file itself. You can encode the intro sequence once in a separate file (or even as a part of, e.g. the first episode), and then include references to that sequence in other files. This is a variation of the methods that I’ve already discussed. However, you run into different problems like limited player app support for external segment references.
A better and more widely supported approach is to create playlists for each episode. There’s wide support for XSPF and M3U8 media playlist files. You can embrace media chunks fully and save them as separate files instead of teaching the file system or media player magic tricks. It’s much easier to create a playlist of deduplicated video files than it is to carefully manipulate the video files to become more deduplicateable. Each episode would consist of at least three video files: before the intro, the shared intro sequence, and after the intro. As a nice bonus, the forward button would let you skip right past the intro sequence!
So, how much space could you save? Let’s look at an episode of Stargate SG-1. The average episode is 42 minutes and 25 seconds. The intro sequence is about 58 seconds (2,28 % of the episode.) That’s about the amount you could hope to save under ideal conditions.
But let’s assume you’re willing to extend even more effort to deduplicate your media files! The Metro Goldwyn Mayer lion roars for 4 seconds at the start and end of each episode. That could be cut and deduplicated even within the same episode. The Double Secret Productions and Gekko Film Corp outros at the end play for 3 second each. Including the intro, you can reliably deduplicate at least 72 second (2,83 %) of each episode.
The end credits are hard to deduplicate because they differ from episode to episode. It would also be difficult to sync up the music. However, with effort — it should be possible to deduplicate at least parts of the end credits between some episodes.
A show like Stargate SG-1 also feature a lot of computer-generated imagery that is reused in different episodes. It’s conceivably possible to deduplicate at least some of the spinning-Stargates and outside-view-of-a-space-ship scenes. However, the show tries to differentiate these scenes from each episode using tricks like flipping the scene horizontally.
All of this requires a huge amount of effort for very little gain. It might be worth if you store an enormous amount of TV series at a very high quality. However, it would be a huge amount of work and there aren’t any tools to automatic this task.
The appeal of an efficient deduplicating file system is that it would theoretically handle all of this automatically. It sounds reasonable that a deduplicating file system should be able to handle duplicated media content. Unfortunately, as this explored in this article, the problem gets messy as soon as you involve real data. Real data is often more messy than one might assume at first glance.
There might be a parallel universe out there that’s similar to our own, except that they never managed to significantly increase the capacity and reduce the cost of digital storage. In this theoretical parallel universe, I’m sure there has been a whole lot more effort put into automating and making this type of optimization more feasible. Or maybe they’ve focused all their energy into absurdly efficient compression and video codecs instead?
Fun bonus fact: The repeated still frames from various TV shows in the feature illustration at the top of the article may be deduplicated! The image may be served as either a PNG or lossless WebP file in a size appropriate to your device. In the smaller image sizes, the duplicated frames get deduplicated by each format’s native compression. It doesn’t work in the larger sizes because of size limitations with the look-ahead/behind buffers of the image formats.