Complex Lossless Compression (Pt 1)
The article on "Simple Lossless Data Compression" showed that using straightforward rules the sample document could be compressed down to 73% of its size. However within the dictionary there are several character strings which are repeated making for a large dictionary.
The most noticeable repeated long phrase is "we shall fight" which has 3 entries in the dictionary. Among the smaller words "the" appears in 4 entries in the dictionary. If these could be reduced then a smaller dictionary would result.
Another characteristic to check is just because a string is repeated does it result in a saving in file size. An example of this is the sequence "nce" which appears twice. If it was kept in the file it would use 6 characters. However putting it in the dictionary require 3 characters for the phrases, 1 character for the dictionary code and 2 characters as code tags making a total of 6 and thus no reduction in size. This has an overhead of creating a bigger dictionary and requiring more processing for no benefit.
Revised Finding Repeating Patterns Rules
Compared with "Simple Lossless" the rules I am using are:
- Start by finding the largest sequence of characters that repeats at least once. (as before)
- Compare smaller sequences with larger one to see if the larger one can be split.
- Look at smaller sequences until you reach three character sequences. (as before)
- Only accept a sequence if there it will result in an at least two character reduction in file size.
The fourth rule means that to be viable there must be 3 repetitions of a 3 or 4 character sequence to be viable. For all greater sequence sizes 2 repetitions is enough.
The "We Shall Fight" Sequences
Applying the new rules to the example text (see Simple Lossless) for the phrase "we shall fight" results in some considerable savings. These are the repetitions of the phrase:
|1||,_we_shall_fight_on_the_||3||24||Lines 3, 6 and7|
|2||,_we_shall_fight_in_the_||2||24||Lines 8 and 9|
Following the rules shows that first we find in number 1 three examples of the long sequence ",_we_shall_fight_on_the_". The next long sequence is two examples of ",_we_shall_fight_in_the_". Comparing the phrases shows they have a common 17 character sequence.
The savings from combining them would be the 17 characters in the ",_we_shall_fight_" sequence leaving it as "in_the_". The cost would be a new dictionary entry for "on_the_" of 7 characters plus 1 for the code and 3 tags in the text making a total of 11 characters. This shows it is worthwhile as overall 6 characters are saved which is well above the target of at least 2.
The third phrase is ",_we_shall_fight_in_". The common factor of ",_we_shall_fight_" can be included in the previous entry and the three characters of "in_" put back in the main text.
The fourth phrase is ",_we_shall_fight_" just requires a tag in the text.
The next fifth phrase is ",_we_shall_" consisting of 11 characters. The phrase ",_we_shall_fight_" can be split into two. If so it will require 1 character for the code for "fight_" and 8 characters for new tags in the text. This is a saving of 3 characters and is accepted.
The next sixth phrase is ";_we_shall_" consisting of 11 characters with different punctuation. For this to be incorporated into the savings would require the punctuation to be reinserted back into the text. This would cost 8 characters but means the resulting 10 characters "_we_shall_" can be included giving a saving of 2 characters.
Lastly the seventh variation "We_shall_" can only be included by excluding the first letter also it is not preceded by a space. Therefore to included it would mean reinserting 18 characters in the text in order to include 8 characters back. This adds size so this phrase is not included. As a result of this process we have four phrases to include in the dictionary: "_we_shall_", "fight_", "on_the_" and "in_the_".
The second part of this article deals with creating the compressed file and is at Complex Lossless Compression (Part 2 of 2).
Understanding How The Techniques Work
To understand how the techniques work there are a series of articles on both techniques:
- Data Compression is an overview of lossy and lossless data compression techniques.
- Simple Lossless Compression describes the basics for understanding data compression including finding patterns and creating dictionaries using a text based example.
- Lossy Compression part 1 and part 2 deals with creating smaller files using the lossy technique resulting in some alterations in the file.