[Jump to start of article]

Home > System Issues > Lossy Compression (Part 2 of 2)

Lossy Compression (Pt 2)

The first part of this article dealt with part of creating the dictionary and is at Lossy Compression (Part 1 of 2).

Other Sequences

We can now check for other phrases and the following sequences for consideration show up.

Code Phrase Count Size Notes
8 _growing_ 2 9 Line 4
9 _in_the_ 2 8 Lines 4 and 8
10 ever_ 2 5 In whatever and never
11 _and_ 2 5 Lines 3 and 8
12 _and 1 4 Line 4
13 the_ 2 4 Lines 1 and 5
14 and 2 3 In Island and Landing
15 end 3 3 Line 1 and in defend and surrender
16 nce 2 3 In France and confidence

Checking the rest of the sequences shows that "_growing_" and "ever_" remain. The sequence "end" remains as it has 3 entries whereas "nce" is removed as it only has 2 entries.

The sequences "_in_the_" and "on_the_" from the "we shall fight" reduction can be combined as 7 entries of "on_the_" resulting in a saving of 7 characters.

The sequence "the_" forms part of the sequence "on_the_". The "on_the_" sequence would save 4 characters in the dictionary but require 7 new code tags. As a result the "the_" sequence is removed.

This leaves the 2 entries of "_and_", 1 entry of "_and" and 2 entries of "and". These are combined into the most common form.

The Lossy Dictionary

The dictionary looks like this:

Code Phrase Count Size Notes
0 ,_we_shall_ 10 11 Lines 1 to 10
1 _growing_ 2 9 Line 4
2 on_the_ 7 7 Lines 1, 2, 4, 6, 7, 8 and 9
3 fight_ 7 6 Lines 2, 3, 4, 6, 7, 8 and 9
4 ever_ 2 5 In whatever and never
5 _and_ 5 5 Lines 3, 4, 8 and in Island and Landing
6 end 3 3 Line 1 and in defend and surrender

This is a smaller dictionary then the Complex Lossless with very little redundancy.

The Compressed File

Each of phrases is replaced with its tag code resulting in the compressed file looking like this:


The Restored File

This is very different from the original file however if the phrases are substituted for the tags then an approximate, but readable version of the original file is restored.

we_shall_fight_with_growing_confidence_and__growing_strength_ on_the__air,_

The differences are highlighted in the text and are:

It is still recognisable as the original text. In many cases people would not spot the difference as many believe the key line is one of:

All of which are wrong but have the same feel to them.

Compression Achieved

The file that is output after compression consists of the reduced file with the tags inserted plus the dictionary of phrases and the codes that now represent them. The figures shown in brackets are the equivalents for the Complex Lossless compression.

The original file consisted of 391 characters of which 255 (233) have been taken out leaving 136 (158) characters. This is 26 characters more than the Simple Lossless compression. To mark the phrase positions 36 (35) tags have been inserted which means the file is 172 (193) characters long. Thus this approach has resulted in a smaller file of 21 characters.

The dictionary is also smaller. It has 7 (8) entries and 46 (50) characters for the phrases giving a total for the dictionary of 53 (58) characters. This is due to the consolidation of similar entries in the dictionary.

Therefore the whole file is 225 (251) characters long. Compared with the original file of 391 characters this is compression of 58% (64%) of the original size.

Comparison with Lossless Techniques

Even with this text based example more compression has been obtained with very little loss in quality. However in practice lossy techniques are not applied to text due to the subtle changes in meaning that can occur. However the JPEG standard for pictures allows much greater compression levels and even at 10% of the original file very little is lost as people cannot perceive the differences.

Understanding How The Techniques Work

To understand how the techniques work there are a series of articles on both techniques: