A Character Elimination Algorithm for

Lossless Data Compression

 

 

Presented by Mark Hosang

NSF Grant# CCR0081952

 

 

 

November 5, 2001


1.  Introduction

 

            In natural languages, there is a certain amount of dependency between characters.  Knowing the previous characters, one could make an educated guess about what the next character could be.  We refer to this dependency as “correlation” and one goal of any good compression algorithm is to eliminate all dependency between characters, thus reducing the amount of data needed to convey the same information.  With the increasing size of files and the low bandwidth of modem connections, compression allows one to transmit the same amount of data in a shorter span of time.

This paper presents a detailed description of a lossless compression algorithm intended for use on files with non-uniform character distributions.  This algorithm takes advantage of the relatively small distances between character occurrences once we remove the less frequent characters.  This allows the algorithm to create a compressed version of the file that, when decompressed, is an exact copy of the file that was compressed.

This algorithm takes a file as input and scans it to create a character frequency model for the compression phase.  Compression begins by encoding the least frequent characters first, moving up to the most frequent.  To encode a character, we record the character, the number of instances and the distance between each instance.  Then, we remove this character from the file.  Subsequent deletions reduce the distance between the more frequent characters until there is only one character left to encode.  Since the file now is totally comprised of this character, all that is necessary is the character and the number of instances.  The decompression algorithm works in a very similar manner but begins with the most frequent character and works to the least frequent using the distances to reconstruct the original file.

 

2.  Algorithm

           

            The preprocessing stage of the algorithm consists of scanning the input file, counting the number of occurrences of each character, and then sorting the characters by their frequency.  The algorithm then proceeds to the actual compression phase.  Starting with the most infrequent character, it traverses the file, removing that character while storing the distance (the number of characters from the last removal.)  For example, if “p” was to be encoded and the file had the string “peep”, the number of characters separating the first “p” and next “p” would be two.  When the algorithm reaches the most frequent character, the file is now comprised solely of this character. Thus, it is only necessary to record the number of instances. 

 

Example of character elimination (table 2-1): 

In the following example, we will now illustrate how we perform the character elimination.  Each line is the removed character followed by the recorded distances and the resulting string after deletion. 

 

Character Removed

Distances

Resulting String

Starting string

-

The zoologist didn’t like the zephyre.

18

The zoologist didnt like the zephyre.

.

36

The zoologist didnt like the zephyre

T

0

he zoologist didnt like the zephyre

g

8

he zooloist didnt like the zephyre

k

20

he zooloist didnt lie the zephyre

n

15

he zooloist didt lie the zephyre

p

27

he zooloist didt lie the zehyre

r

29

he zooloist didt lie the zehye

s

9

he zooloit didt lie the zehye

y

27

he zooloit didt lie the zehe

d

11, 1

he zooloit it lie the zehe

l

6, 7

he zoooit it ie the zehe

z

3, 16

he oooit it ie the ehe

h

0, 15, 2

e oooit it ie te ee

i

5, 2, 2

e ooot t e te ee

o

2, 0, 0

e t t e te ee

t

2, 1, 3

e   e e ee

<space>

1, 0, 0, 1, 1

eeeee

e

NA

<empty string>

Table 2-1 an example of character elimination.

 

When creating the compressed file, we write the character, then the number of distances minus one, and lastly, the distances for that character.  Since there will always be at least one distance value associated with a character, in the compressed file we can adjust the number of instances by subtracting 1 from its value such that the lowest value is now ‘0’ instead of ‘1’.  This allows fewer bits to represent the same number of distance values.  Furthermore, since the characters are stored in a cascading occurrence fashion, we can denote the number of instances for a character by the difference from the previous character.  We repeat the character elimination until we have recorded all characters.  The encoder writes the compressed file in reverse order, writing the data for the most frequent character first and then proceeding to the least frequent characters.  This removal method is reversible, which means that performing these operations in the reverse order will reconstruct the file.  We added the size of the uncompressed file to simplify the decompression phase, but it is not necessary for successful decompression.

 

Example of compressed file format:

<Uncompressed file size>

<Most frequent character><# of instances>

<2nd most frequent character><# of instances><d1, d2, …>

<3rd most frequent character><# of instances><d1, d2, …>

                                             .

                                             .

                                             .


             

3.  Implementation

           

In practice, there is a significant improvement in compression if we perform a Burrows-Wheeler Transformation (BWT) on the file about to be compressed.  Table 3-1 shows an average reduction of 40% in the size of the compressed file if we perform a BWT first.  BWT rearranges the file such that the resulting file has long runs of characters which are easy to compress.

 

Files

Book1

Book2

Geo

Bib

Pic

Paper1

Paper2

ProgC

Orig. Size

768711

610856

102400

111261

513216

53161

82199

39611

Compressed

441063

363324

77893

75838

67891

34947

50001

27380

After BWT

768823

610908

102413

111274

513255

53174

82212

39624

Compressed

270230

191985

65761

32180

52200

18804

28158

14403

Table 3-1 Effect on compression after we performed a BWT,

without usual source-coding phase, on the input file.

 

The straightforward approach for character removal is to do a single sweep of the original file for each individual character.  During each sweep, a character is removed. Thus, after each sweep, the original file requires modification.  This results in a time complexity of O(S*n) where S is the number of characters to be encoded and n is the length of the file.  Furthermore, this approach uses a memory space of 2n, as we must create a separate modified version after each sweep.

In our approach, we use an array of size S that holds a counter for each character in the same order as the character removal.  The encoder scans the file one character at a time and searches downward in the list until it finds the match to this character.  On the way down the array, each counter it encounters is incremented.  When it reaches the match, the counter value is stored as the next distance and we reset the counter to zero.  Because the distance counters for other characters are not incremented, it is figuratively “blind” to all characters that would have been removed before its own removal.  This optimized removal scheme allows us to gather all the distance information in one pass of the file and changes the memory complexity to O(n).  If a frequency model was pre-established, for instance, if the file to be compressed was English text, it is then theoretically possible to accomplish streaming during the compression phase because we do not need to do any preprocessing.  The structure of the array allows for an efficient hardware implementation using S processors in a cascading parallel configuration.

To deal with the issue of bit encoding, we write every number as a byte or sequence of bytes to the compressed file and run an arithmetic encoder after the file has been compiled.  To deal with values larger than 254, we have a flag value, 255, that states that the next string of bytes is what we refer to as a “compound” number.  Let us denote N as some compound number.  To write this value we divide and output 255, and then we output N  mod 255.  Then we set N to .  We keep repeating this process until N is smaller than 255, at which point we output N.  When the decoder reads the sequence, it knows to stop when an odd position of the sequence is not 255, specifying the end of the sequence.

 

Example of compound number encoding:

N = 65875                   output flag value 255, output 85 (N mod 255)

N = 258 ()  output flag value 255, output 3

N = 1                           output 1

 

Compound number stored as: 255, 85, 255, 3, 1

 

4.  Variants

 

            The first variant we will discuss utilizes a minimum offset value for the distances.  The encoder scans the list of distances for each character, records the minimum value, and adjusts all the distances by subtracting this value.  If there is only one distance for the character, we do not calculate its offset.  We observed that the offset values for the first 50-75% characters are zero, thus an offset value can be specified, called the zero offset, for when the first nonzero offset occurs.  It may be possible to start from the last character and work backwards, denoting when the first zero offset occurs.  It would miss some nonzero offset values, but it would ensure no zero offsets.  Table 4-1 illustrates the implementation of the offset variant, where the zero ratio is the number of zeroes and the number of total offsets.

 

Files

Book1

Book2

Geo

Paper1

Pic

Obj1

Size

270234

191988

65837

18804

52221

12974

Compressed Size Difference

-4

-3

-76

0

-21

-5

Zero offset

73

87

103

90

48

117

Zero ratio

76/81

95/98

183/255

91/93

62/131

185/254

Table 4-1 Results of minimal offset variant. Neg. values denote loss in compression.

 

            The next variant reverses the order of character elimination, removing the most frequent characters first.  Though the advantage of just encoding the number of instances for the most frequent character is lost, the number of large distance values, greater than 255, are reduced by 50%.  Thus, we use less flag bytes to encode the file improving overall compression by .5%.  All files except bib and pic benefited from the reversal of the character elimination, while geo had the largest increase in compression of 9%.

 

4a.  Pair Variants

            We can modify the algorithm further to deal with pairs of characters instead of individual characters.  We refer to any combination of two characters as a pair.  The following sub-variants are variations of different pair removal schemes.

            In one adaptation of the pair removal scheme, we interleave the pair and single character removal.  Beginning with the least frequent characters/pairs for the removal, we follow the standard algorithm.  When a pair and single character have an equal number of occurrences, we give preference to single character removal to shrink the size of the available alphabet and increase the speed of pair removal.  After each removal, we recalculate the frequency data for both pairs and individual characters.  We observed that if the pair removal was done from least to most frequent while recounting, it was almost guaranteed that their existed a pair that occurred only once.  This resulted in a compressed file consisting of pair removals that had only one instance, which in turn resulted in data expansion.

When we remove the most frequent pairs first, this results in a worse compression ratio than the single character algorithm.  Instead of calculating the pair frequency once, this variant requires that we recalculate the pair frequencies after each removal.  We adjust any newly created pairs after a pair deletion with a rescan of the pair frequencies.  The pair removal continues until we compress the entire file.  We made no special considerations when dealing with file sizes that were not even.  Interestingly enough, performing a BWT on geo followed by this pair removal resulted in an even poorer compression ratio than if the BWT had not been performed.  Table 4-2 shows the compressed file size using the pair method illustrated above and how much larger these files were from the single character elimination set.

 

File

Bib

Book1

Book2

Paper1

Geo

Geo

(No BWT)

Size

52509

382720

276848

35137

92909

89058

Larger by

10733

59920

41429

13783

23516

19665

Table 4-2 Results of Pair removal variant

 

            A variant to the above stated approach is to have a hybrid of both the recounting pair method and the single character elimination method.  In the hybrid approach, we specify a cut-off point to stop doing pair removals, and process the remaining file with the single character elimination approach.  It was observed that the pair counts had an exponential distribution curve, where most counts were five or below.  With a cutoff value of one, the hybrid method showed an increase in compression over the pair method by 1k in most files.  The factor of reduction with a cutoff of one is doubled at five and tripled at twenty.  Table 4-3 illustrates typical hybrid elimination with a cut-off point of one. As the table reveals, the pair <space><space> does not exist until we removed other pairs showing the need to recalculate pair frequencies. 

 

Character Removed

Distances

Resulting String

Starting string

-

the hook he found at heather’s house

he

1, 6, 9, 2

t hook  used at atr’s house

ho

2, 18

t ok  used at atr’s use

at

11, 1

t ok  used  r’s use

<space><space>

4, 4

t okusedr’s use

us

4, 6

t okedr’s e

e

4, 5

t okdr’s

<space>

1, 6

tokdr’s

d

3

tokr’s

k

2

tor’s

o

1

tr’s

r

1

t’s

s

2

t’

t

0

0

N/A

Table 4-3 an example of pair hybrid elimination where single Character removal

begins when the number of pair occurrences is no longer greater than 1.

 

            In our last approach to pair removal, we utilize one-pass pair elimination.   We remove a pair from the list of possible pairs after we remove it from the file. When the list of pairs is empty, we then proceed to single character removal.  The list is resorted after each removal and we remove the least frequent pair.  We observed that the pairs were not independent of each other.  Thus, removing a less frequent pair affected the frequency of the more frequent pairs.  This explains why pair removal similar to ones mentioned in this paper cannot improve compression over the single character removal method.

 

4b.  Lossy Variants

            There is a lossy variant of the hybrid algorithm in which we prescan the input file and change all upper-case characters to lower-case after performing the BWT.  We can only apply this modification to natural language text, in which case the conversion does not affect readability.  As table 4-4 shows, the number of characters that are changed to lower-case is close to the improvement in compression.  Unfortunately, the increase in compression is not large enough to justify the number of changes to the file.  We can also apply this variant to single character elimination, but the gains are less pronounced, as seen in table 4-5.

 

Lower Case

Book1

Book2

Bib

Geo

File size

366028

253980

64933

80868

Improvement

14806

10813

N/A

1694

# of instances

16336

12847

15247

17564

Table 4-4 Pair lower-case compression results

 

File

Book1

Book2

Bib

Geo

Size

262914

186120

30164

63950

Improvement

7316

5865

2016

1811

# Changed to Lowercase

16336

12847

15247

17564

Table 4-5 Single lower-case compression results

 

            The last lossy variant involves manipulating the distances to improve compression.  We found that simply having a cutoff point for distances was ineffective.  If we removed a distance, we would have to add its value to the next distance to maintain integrity.  This of course would be above the cutoff limit and would thus remove all following distances.


5.  Performance of Implementation

 

            Table 5-1 shows the average time needed to compress the specified Calgary corpus file.  This time is representative of the encoding processes only and excludes the time needed to perform the BWT as well as the arithmetic encoding at the backend.  These times were compiled using a Pentium III EB 933Mhz on an ATA100 7200RPM Hard Drive, and represent the computation as well as disk I/O to write the processed file.  We could improve the speed of this algorithm, coded and compiled in JAVA, if we coded it in a faster language such as C/C++.

 

Time (ms)

Book1

Bib

Pic

Geo

Book2

ProgC

Paper1

Paper2

Encode

5723

770

2253

1077

4777

244

510

623

Decode

9283

1267

8717

2617

7857

400

847

1043

Table 5-1 Average run time of algorithm, excluding BWT and arithmetic encoder.

           

Table 5-2 lists the files used from the Calgary Corpus with their uncompressed size, their compressed size, and their bit to byte ratio.  We obtained these values using a BWT with a block size of 200,000 characters.  Burrow’s and Wheeler’s paper on the BWT1 shows that the larger block size, when used in conjunction with algorithms such as the one presented in this paper, will result in better compression.  Using the Calgary Corpus achieves a weighted bit ratio of 2.89:8.  This algorithm reduces a file, on average, to 31.31% of its original size.

 

File

Size (bytes)

Presented Algorithm

GZIP

BZIP2

UNIX

Compress

Bib

111261

2.31

2.52

1.97

3.35

Book1

768771

2.81

3.26

2.42

3.30

Book2

610856

2.51

2.71

2.06

3.29

Geo

102400

5.14

5.35

4.45

6.01

Paper1

53161

2.83

2.80

2.49

3.77

Paper2

82199

2.74

2.90

2.44

3.52

Paper3

46526

3.05

3.11

2.72

3.81

Paper4

13286

3.53

3.33

3.12

4.19

Paper5

11954

3.65

3.34

3.24

4.40

Paper6

38105

2.93

2.78

2.58

3.92

Progl

71646

2.02

1.82

1.74

3.03

ProgC

39611

2.91

2.68

2.53

3.87

Progp

49379

2.05

1.82

1.74

3.11

News

377109

3.01

3.07

2.52

3.90

Obj1

21504

4.82

3.84

4.01

5.23

Obj2

246814

3.06

2.65

2.48

4.17

Trans

93695

1.84

1.62

1.53

3.27

Pic

513216

0.82

0.88

0.78

0.97

Total

3251493

2.50

2.62

2.13

3.30

Table 5-2 File sizes and Bits per Byte ratios for Calgary corpus

 

6.  Related Algorithms

 

            In P. Subrahmanya’s paper on An Asymptotically Optimal Data Compression Based on an Inverted Index, a compression technique for use as an inverted file was presented with a similar approach as the algorithm presented in this paper.  Subrahmanya begins by creating a prefix dictionary to parse the file.  He then uses this dictionary to scan through the file creating a list of locations for every dictionary entry and then proceeds to run-length encode the list of offsets.  Table 6-1 illustrates an example of the resulting list.

 

Table 6-1 (Figures from Subrahmanya’s paper)

 

            The difference between Subrahmanya’s approach and the one presented in this is paper is as follows: first, Subrahmanya uses a dictionary, which adds the functionality as an inverted file, whereas we use single characters to encode.  The effect of run-length encoding the locations results in an offset from the last match.  Though this appears similar to the single character elimination method, it must be mentioned that using a dictionary with the algorithm presented in this paper would denote how many dictionary matches there were from the last match, not the number of characters.  Furthermore, since we are not concerned with creating a searchable compressed file, we are free to reorganize the data with a BWT to increase performance.  Lastly, Subrahmanya does not utilize the theoretical removal of less frequent dictionary entries to produce smaller location values.  This may destroy the searchability of his compressed file, since the file would have to be decompressed to point to the appropriate position in the uncompressed file.  If we make certain assumptions about the dictionary entries, it may be possible to use the aforementioned method at some cost to the accuracy; the compressed file would point to the general area of the dictionary entry in the uncompressed file, instead of the exact location.

 

7.  Further Plans

 

Although we have presented data on the effectiveness of the algorithm, there are still many avenues to explore which could improve its performance.  We could achieve further compression by using a BWT with a block size equal to the size of the file.  Burrows’ and Wheeler’s paper illustrates that increasing the block yields significant improvements in compression up to a block size of 64M.  Specifically, an increase in block size from 256K to 750K for Book1 resulted in an additional 2.4% compression.     

Furthermore, if we could find a better mechanism with smaller overhead, then the one presented in this paper, for dealing with values greater than 255, this would result in further compression.  The results of the reversal of the algorithm, presented in the variant section, illustrated that the current large integer scheme has a large amount of overhead, such as to offset any gains from encoding just the number of instances for the most frequent character.

Lastly, we currently do not discriminate between two characters that have the same frequency.  It may be possible to see an improvement in compression by eliminating the character with the smaller combined total of distances.  This would result in smaller distance values for the other character that may improve compression.

 

8.  Conclusion

           

            We have presented an algorithm that works well on files in which the there is a non-uniform distribution of the characters, such as text.  This algorithm achieves more compression than some publicly available compression programs.  However, a superior compression ratio is not the algorithm’s primary asset.  Its ability to encode and decode on the fly makes it a potent tool for Internet transfers, thus giving it a practical utility over and above other lossless data compression techniques.

 

References

 

[1]        M. Burrows and D.J. Wheeler.  A Block-sorting Lossless Data Compression Algorithm.  SRC Research Report. May 1994.

 

[2]        Calgary Corpus.  http://compression.ca/act-calgary.html

 

[3]        Mark Nelson.  BWT and Arithmetic encoder source code.  http://web2.airmail.net/markn.

 

[4]        P. Subrahmanya.  An Asymptotically Optimal Data Compression Based on an Inverted Index (Abstract).  Data Compression Conference Proceedings. 1999.

 

[5]        P. Subrahmanya.  An Asymptotically Optimal Data Compression Based on an Inverted Index.  Personal Contact.

 

 

I would like to thank all my friends who helped me prepare this paper.