Presented by Mark Hosang
NSF Grant#
CCR0081952
1. Introduction
In natural languages, there is a certain amount of dependency between characters. Knowing the previous characters, one could make an educated guess about what the next character could be. We refer to this dependency as “correlation” and one goal of any good compression algorithm is to eliminate all dependency between characters, thus reducing the amount of data needed to convey the same information. With the increasing size of files and the low bandwidth of modem connections, compression allows one to transmit the same amount of data in a shorter span of time.
This paper presents a detailed description of a lossless compression algorithm intended for use on files with non-uniform character distributions. This algorithm takes advantage of the relatively small distances between character occurrences once we remove the less frequent characters. This allows the algorithm to create a compressed version of the file that, when decompressed, is an exact copy of the file that was compressed.
This algorithm takes a file as input and scans it to create a character frequency model for the compression phase. Compression begins by encoding the least frequent characters first, moving up to the most frequent. To encode a character, we record the character, the number of instances and the distance between each instance. Then, we remove this character from the file. Subsequent deletions reduce the distance between the more frequent characters until there is only one character left to encode. Since the file now is totally comprised of this character, all that is necessary is the character and the number of instances. The decompression algorithm works in a very similar manner but begins with the most frequent character and works to the least frequent using the distances to reconstruct the original file.
The preprocessing stage of the algorithm consists of scanning the input file, counting the number of occurrences of each character, and then sorting the characters by their frequency. The algorithm then proceeds to the actual compression phase. Starting with the most infrequent character, it traverses the file, removing that character while storing the distance (the number of characters from the last removal.) For example, if “p” was to be encoded and the file had the string “peep”, the number of characters separating the first “p” and next “p” would be two. When the algorithm reaches the most frequent character, the file is now comprised solely of this character. Thus, it is only necessary to record the number of instances.
Example of character elimination (table 2-1):
In the following example, we will now illustrate how we perform the character elimination. Each line is the removed character followed by the recorded distances and the resulting string after deletion.
|
Character Removed |
Distances |
Resulting String |
|
Starting string |
- |
The zoologist didn’t like the zephyre. |
|
‘ |
18 |
The zoologist didnt like the zephyre. |
|
. |
36 |
The zoologist didnt like the zephyre |
|
T |
0 |
he zoologist didnt like the zephyre |
|
g |
8 |
he zooloist didnt like the zephyre |
|
k |
20 |
he zooloist didnt lie the zephyre |
|
n |
15 |
he zooloist didt lie the zephyre |
|
p |
27 |
he zooloist didt lie the zehyre |
|
r |
29 |
he zooloist didt lie the zehye |
|
s |
9 |
he zooloit didt lie the zehye |
|
y |
27 |
he zooloit didt lie the zehe |
|
d |
11, 1 |
he zooloit it lie the zehe |
|
l |
6, 7 |
he zoooit it ie the zehe |
|
z |
3, 16 |
he oooit it ie the ehe |
|
h |
0, 15, 2 |
e oooit it ie te ee |
|
i |
5, 2, 2 |
e ooot t e te ee |
|
o |
2, 0, 0 |
e t t e te ee |
|
t |
2, 1, 3 |
e e e ee |
|
<space> |
1, 0, 0, 1, 1 |
eeeee |
|
e |
NA |
<empty
string> |
Table 2-1 an example of character elimination.
When creating the compressed file, we write the character, then the number of distances minus one, and lastly, the distances for that character. Since there will always be at least one distance value associated with a character, in the compressed file we can adjust the number of instances by subtracting 1 from its value such that the lowest value is now ‘0’ instead of ‘1’. This allows fewer bits to represent the same number of distance values. Furthermore, since the characters are stored in a cascading occurrence fashion, we can denote the number of instances for a character by the difference from the previous character. We repeat the character elimination until we have recorded all characters. The encoder writes the compressed file in reverse order, writing the data for the most frequent character first and then proceeding to the least frequent characters. This removal method is reversible, which means that performing these operations in the reverse order will reconstruct the file. We added the size of the uncompressed file to simplify the decompression phase, but it is not necessary for successful decompression.
Example of compressed file format:
<Uncompressed file size>
<Most frequent character><# of instances>
<2nd most frequent character><# of instances><d1, d2, …>
<3rd most frequent character><# of instances><d1, d2, …>
.
.
.
In practice, there is a significant improvement in compression if we perform a Burrows-Wheeler Transformation (BWT) on the file about to be compressed. Table 3-1 shows an average reduction of 40% in the size of the compressed file if we perform a BWT first. BWT rearranges the file such that the resulting file has long runs of characters which are easy to compress.
|
Files |
Book1 |
Book2 |
Geo |
Bib |
Pic |
Paper1 |
Paper2 |
ProgC |
|
Orig. Size |
768711 |
610856 |
102400 |
111261 |
513216 |
53161 |
82199 |
39611 |
|
Compressed |
441063 |
363324 |
77893 |
75838 |
67891 |
34947 |
50001 |
27380 |
|
After BWT |
768823 |
610908 |
102413 |
111274 |
513255 |
53174 |
82212 |
39624 |
|
Compressed |
270230 |
191985 |
65761 |
32180 |
52200 |
18804 |
28158 |
14403 |
Table 3-1 Effect on compression after we performed a BWT,
without usual source-coding phase, on the input file.
The straightforward approach for character removal is to do a single sweep of the original file for each individual character. During each sweep, a character is removed. Thus, after each sweep, the original file requires modification. This results in a time complexity of O(S*n) where S is the number of characters to be encoded and n is the length of the file. Furthermore, this approach uses a memory space of 2n, as we must create a separate modified version after each sweep.
In our approach, we use an array of size S that holds a counter for each character in the same order as the character removal. The encoder scans the file one character at a time and searches downward in the list until it finds the match to this character. On the way down the array, each counter it encounters is incremented. When it reaches the match, the counter value is stored as the next distance and we reset the counter to zero. Because the distance counters for other characters are not incremented, it is figuratively “blind” to all characters that would have been removed before its own removal. This optimized removal scheme allows us to gather all the distance information in one pass of the file and changes the memory complexity to O(n). If a frequency model was pre-established, for instance, if the file to be compressed was English text, it is then theoretically possible to accomplish streaming during the compression phase because we do not need to do any preprocessing. The structure of the array allows for an efficient hardware implementation using S processors in a cascading parallel configuration.
To deal with the issue of bit
encoding, we write every number as a byte or sequence of bytes to the
compressed file and run an arithmetic encoder after the file has been
compiled. To deal with values larger
than 254, we have a flag value, 255, that states that the next string of bytes
is what we refer to as a “compound” number.
Let us denote N as some compound number. To write this value we divide and output 255,
and then we output N mod
255. Then we set N to
. We keep repeating
this process until N is smaller than 255, at which point we output N. When the decoder reads the sequence, it knows
to stop when an odd position of the sequence is not 255, specifying the end of
the sequence.
Example of compound number encoding:
N = 65875 output flag value 255, output 85 (N mod 255)
N = 258 (
)
output flag value 255,
output 3
N = 1 output 1
Compound number stored as: 255, 85, 255, 3, 1
4. Variants
The first variant we will discuss utilizes a minimum offset value for the distances. The encoder scans the list of distances for each character, records the minimum value, and adjusts all the distances by subtracting this value. If there is only one distance for the character, we do not calculate its offset. We observed that the offset values for the first 50-75% characters are zero, thus an offset value can be specified, called the zero offset, for when the first nonzero offset occurs. It may be possible to start from the last character and work backwards, denoting when the first zero offset occurs. It would miss some nonzero offset values, but it would ensure no zero offsets. Table 4-1 illustrates the implementation of the offset variant, where the zero ratio is the number of zeroes and the number of total offsets.
|
Files |
Book1 |
Book2 |
Geo |
Paper1 |
Pic |
Obj1 |
|
Size |
270234 |
191988 |
65837 |
18804 |
52221 |
12974 |
|
Compressed Size
Difference |
-4 |
-3 |
-76 |
0 |
-21 |
-5 |
|
Zero offset |
73 |
87 |
103 |
90 |
48 |
117 |
|
Zero ratio |
76/81 |
95/98 |
183/255 |
91/93 |
62/131 |
185/254 |
Table 4-1 Results of minimal offset variant. Neg. values denote loss in compression.
The next variant reverses the order of character
elimination, removing the most frequent characters first. Though the advantage of just encoding the
number of instances for the most frequent character is lost, the number of large
distance values, greater than 255, are reduced by 50%. Thus, we use less flag bytes to encode the
file improving overall compression by .5%.
All files except bib and pic benefited from the reversal
of the character elimination, while geo had the largest increase in
compression of 9%.
4a. Pair Variants
We can modify the algorithm further to deal with pairs of characters instead of individual characters. We refer to any combination of two characters as a pair. The following sub-variants are variations of different pair removal schemes.
In one adaptation of the pair removal scheme, we interleave the pair and single character removal. Beginning with the least frequent characters/pairs for the removal, we follow the standard algorithm. When a pair and single character have an equal number of occurrences, we give preference to single character removal to shrink the size of the available alphabet and increase the speed of pair removal. After each removal, we recalculate the frequency data for both pairs and individual characters. We observed that if the pair removal was done from least to most frequent while recounting, it was almost guaranteed that their existed a pair that occurred only once. This resulted in a compressed file consisting of pair removals that had only one instance, which in turn resulted in data expansion.
When we remove the most frequent pairs first, this results in a worse compression ratio than the single character algorithm. Instead of calculating the pair frequency once, this variant requires that we recalculate the pair frequencies after each removal. We adjust any newly created pairs after a pair deletion with a rescan of the pair frequencies. The pair removal continues until we compress the entire file. We made no special considerations when dealing with file sizes that were not even. Interestingly enough, performing a BWT on geo followed by this pair removal resulted in an even poorer compression ratio than if the BWT had not been performed. Table 4-2 shows the compressed file size using the pair method illustrated above and how much larger these files were from the single character elimination set.
|
File |
Bib |
Book1 |
Book2 |
Paper1 |
Geo |
Geo (No BWT) |
|
Size |
52509 |
382720 |
276848 |
35137 |
92909 |
89058 |
|
Larger by |
10733 |
59920 |
41429 |
13783 |
23516 |
19665 |
Table 4-2 Results of Pair removal variant
A variant to the above stated approach is to have a hybrid of both the recounting pair method and the single character elimination method. In the hybrid approach, we specify a cut-off point to stop doing pair removals, and process the remaining file with the single character elimination approach. It was observed that the pair counts had an exponential distribution curve, where most counts were five or below. With a cutoff value of one, the hybrid method showed an increase in compression over the pair method by 1k in most files. The factor of reduction with a cutoff of one is doubled at five and tripled at twenty. Table 4-3 illustrates typical hybrid elimination with a cut-off point of one. As the table reveals, the pair <space><space> does not exist until we removed other pairs showing the need to recalculate pair frequencies.
|
Character Removed |
Distances |
Resulting String |
|
Starting string |
- |
the hook he found at heather’s house |
|
he |
1, 6, 9, 2 |
t hook used at atr’s house |
|
ho |
2, 18 |
t ok used at atr’s use |
|
at |
11, 1 |
t ok used r’s use |
|
<space><space> |
4, 4 |
t okusedr’s use |
|
us |
4, 6 |
t okedr’s e |
|
e |
4, 5 |
t okdr’s |
|
<space> |
1, 6 |
tokdr’s |
|
d |
3 |
tokr’s |
|
k |
2 |
tor’s |
|
o |
1 |
tr’s |
|
r |
1 |
t’s |
|
s |
2 |
t’ |
|
t |
0 |
‘ |
|
‘ |
0 |
N/A |
Table 4-3 an example of pair hybrid elimination where single Character removal
begins when the number of pair occurrences is no longer greater than 1.
In our last approach to pair removal, we utilize one-pass pair elimination. We remove a pair from the list of possible pairs after we remove it from the file. When the list of pairs is empty, we then proceed to single character removal. The list is resorted after each removal and we remove the least frequent pair. We observed that the pairs were not independent of each other. Thus, removing a less frequent pair affected the frequency of the more frequent pairs. This explains why pair removal similar to ones mentioned in this paper cannot improve compression over the single character removal method.
4b. Lossy Variants
There is a lossy variant of the hybrid algorithm in which we prescan the input file and change all upper-case characters to lower-case after performing the BWT. We can only apply this modification to natural language text, in which case the conversion does not affect readability. As table 4-4 shows, the number of characters that are changed to lower-case is close to the improvement in compression. Unfortunately, the increase in compression is not large enough to justify the number of changes to the file. We can also apply this variant to single character elimination, but the gains are less pronounced, as seen in table 4-5.
|
Lower Case |
Book1 |
Book2 |
Bib |
Geo |
|
File size |
366028 |
253980 |
64933 |
80868 |
|
Improvement |
14806 |
10813 |
N/A |
1694 |
|
# of instances |
16336 |
12847 |
15247 |
17564 |
Table 4-4 Pair lower-case compression results
|
File |
Book1 |
Book2 |
Bib |
Geo |
|
Size |
262914 |
186120 |
30164 |
63950 |
|
Improvement |
7316 |
5865 |
2016 |
1811 |
|
# Changed to Lowercase |
16336 |
12847 |
15247 |
17564 |
Table 4-5 Single lower-case compression results
The last lossy variant involves manipulating the distances to improve compression. We found that simply having a cutoff point for distances was ineffective. If we removed a distance, we would have to add its value to the next distance to maintain integrity. This of course would be above the cutoff limit and would thus remove all following distances.
5. Performance of
Implementation
Table
5-1 shows the average time needed to compress the specified
|
Time (ms) |
Book1 |
Bib |
Pic |
Geo |
Book2 |
ProgC |
Paper1 |
Paper2 |
|
Encode |
5723 |
770 |
2253 |
1077 |
4777 |
244 |
510 |
623 |
|
Decode |
9283 |
1267 |
8717 |
2617 |
7857 |
400 |
847 |
1043 |
Table 5-1 Average run time of algorithm, excluding BWT and arithmetic encoder.
Table 5-2 lists the files used from the Calgary Corpus with their uncompressed size, their compressed size, and their bit to byte ratio. We obtained these values using a BWT with a block size of 200,000 characters. Burrow’s and Wheeler’s paper on the BWT1 shows that the larger block size, when used in conjunction with algorithms such as the one presented in this paper, will result in better compression. Using the Calgary Corpus achieves a weighted bit ratio of 2.89:8. This algorithm reduces a file, on average, to 31.31% of its original size.
|
File |
Size (bytes) |
Presented Algorithm |
GZIP |
BZIP2 |
UNIX Compress |
|
Bib |
111261 |
2.31 |
2.52 |
1.97 |
3.35 |
|
Book1 |
768771 |
2.81 |
3.26 |
2.42 |
3.30 |
|
Book2 |
610856 |
2.51 |
2.71 |
2.06 |
3.29 |
|
Geo |
102400 |
5.14 |
5.35 |
4.45 |
6.01 |
|
Paper1 |
53161 |
2.83 |
2.80 |
2.49 |
3.77 |
|
Paper2 |
82199 |
2.74 |
2.90 |
2.44 |
3.52 |
|
Paper3 |
46526 |
3.05 |
3.11 |
2.72 |
3.81 |
|
Paper4 |
13286 |
3.53 |
3.33 |
3.12 |
4.19 |
|
Paper5 |
11954 |
3.65 |
3.34 |
3.24 |
4.40 |
|
Paper6 |
38105 |
2.93 |
2.78 |
2.58 |
3.92 |
|
Progl |
71646 |
2.02 |
1.82 |
1.74 |
3.03 |
|
ProgC |
39611 |
2.91 |
2.68 |
2.53 |
3.87 |
|
Progp |
49379 |
2.05 |
1.82 |
1.74 |
3.11 |
|
News |
377109 |
3.01 |
3.07 |
2.52 |
3.90 |
|
Obj1 |
21504 |
4.82 |
3.84 |
4.01 |
5.23 |
|
Obj2 |
246814 |
3.06 |
2.65 |
2.48 |
4.17 |
|
Trans |
93695 |
1.84 |
1.62 |
1.53 |
3.27 |
|
Pic |
513216 |
0.82 |
0.88 |
0.78 |
0.97 |
|
Total |
3251493 |
2.50 |
2.62 |
2.13 |
3.30 |
Table 5-2 File sizes
and Bits per Byte ratios for
6. Related
Algorithms
7. Further Plans
Although we have presented data on the effectiveness of the algorithm, there are still many avenues to explore which could improve its performance. We could achieve further compression by using a BWT with a block size equal to the size of the file. Burrows’ and Wheeler’s paper illustrates that increasing the block yields significant improvements in compression up to a block size of 64M. Specifically, an increase in block size from 256K to 750K for Book1 resulted in an additional 2.4% compression.
Furthermore, if we could find a better mechanism with smaller overhead, then the one presented in this paper, for dealing with values greater than 255, this would result in further compression. The results of the reversal of the algorithm, presented in the variant section, illustrated that the current large integer scheme has a large amount of overhead, such as to offset any gains from encoding just the number of instances for the most frequent character.
Lastly, we currently do not discriminate between two characters that have the same frequency. It may be possible to see an improvement in compression by eliminating the character with the smaller combined total of distances. This would result in smaller distance values for the other character that may improve compression.
We have presented an algorithm that works well on files in which the there is a non-uniform distribution of the characters, such as text. This algorithm achieves more compression than some publicly available compression programs. However, a superior compression ratio is not the algorithm’s primary asset. Its ability to encode and decode on the fly makes it a potent tool for Internet transfers, thus giving it a practical utility over and above other lossless data compression techniques.
[1] M. Burrows and D.J. Wheeler. A Block-sorting Lossless Data Compression Algorithm. SRC Research Report. May 1994.
[2]
[3] Mark Nelson. BWT and Arithmetic encoder source code. http://web2.airmail.net/markn.
[4] P. Subrahmanya. An Asymptotically Optimal Data Compression Based on an Inverted Index (Abstract). Data Compression Conference Proceedings. 1999.
[5] P. Subrahmanya. An Asymptotically Optimal Data Compression Based on an Inverted Index. Personal Contact.
I would like to thank all my friends who helped me prepare this paper.