Complex Lossless Compression (Pt 2)
The first part of this article dealt with part of creating the dictionary and is at Complex Lossless Compression (Part 1 of 2).
We can now check for other phrases and the following sequences show up.
|9||_in_the_||2||8||Lines 4 and 8|
|10||ever_||2||5||In whatever and never|
|11||_and_||2||5||Lines 3 and 8|
|13||the_||2||4||Lines 1 and 5|
|14||and||2||3||In Island and Landing|
|15||end||3||3||Line 1 and in defend and surrender|
|16||nce||2||3||In France and confidence|
Checking the rest of the sequences shows that "_growing_" and "ever_" remain. The sequence "end" remains as it has 3 entries whereas "nce" is removed as it only has 2 entries.
The sequences "_in_the_" and "in_the_" from the "we shall fight" reduction can be combined as 4 entries of "in_the_" resulting in a saving of 7 characters.
The sequence "the_" should also go based on only having 2 entries. However it also forms part of the sequences "on_the_" and "in_the_". The rules we are using would treat each of these separately and neither reaches the breakeven point of at least 2 characters reduction. The "on_the_" sequence would save 4 characters in the dictionary but require 3 new code tags. The "in_the_" sequence would save 4 characters but require 4 new code tags. As a result the "the_" sequence is removed.
This leaves the 2 entries of "_and_", 1 entry of "_and" and 2 entries of "and". This is marginal but just reaches the breakeven point of 2 characters reduction.
The Complex Lossless Dictionary
The new dictionary looks like this:
|0||_we_shall_||9||10||Lines 2 to 10|
|2||on_the_||3||7||Lines 1, 6 and 7|
|3||in_the_||4||7||Lines 2, 4, 8 and 9|
|4||fight_||7||6||Lines 2, 3, 4, 6, 7, 8 and 9|
|5||ever_||2||5||In whatever and never|
|6||and||5||3||Lines 3, 4, 8 and in Island and Landing|
|7||end||3||3||Line 1 and in defend and surrender|
This is a much smaller dictionary then the Simple Lossless with very little redundancy.
The Compressed File
Each of phrases is replaced with its code resulting in the compressed file looking like this:
Again this is very different from the original file but if the phrases are substituted for the tags then the original file is restored.
The file that is output after compression consists of the reduced file with the code tags inserted plus the dictionary of phrases and the codes that now represent them. The figures in brackets are the equivalents for the Simple Lossless compression.
The original file consisted of 391 characters of which 233 (259) have been taken out leaving 158 (132) characters. This is 26 characters more than the Simple Lossless compression. To mark the phases 35 (27) tags have been inserted which means the file is 193 (159) characters long. Thus this approach has left a larger file of 34 characters.
However the dictionary is much smaller. It has 8 (12) entries and 50 (114) characters for the phrases giving a total for the dictionary of 58 (126) characters. This is due to the consolidation of similar entries in the dictionary.
Therefore the whole file is 251 (285) characters long. Compared with the original file of 391 characters this is compression of 64% (73%) of the original size.
Therefore by applying more complex rules a higher level of compression can be achieved. There is limit due to the need to restore the file exactly as it was. To achieve greater compression then lossy techniques need to be used.
Understanding How The Techniques Work
To understand how the techniques work there are a series of articles on both techniques:
- Data Compression is an overview of lossy and lossless data compression techniques.
- Simple Lossless Compression describes the basics for understanding data compression including finding patterns and creating dictionaries using a text based example.
- Lossy Compression part 1 and part 2 deals with creating smaller files using the lossy technique resulting in some alterations in the file.