pouët.net

LZW encoding ideas

category: code [glöplog]
 
I know someone has done LZW compression before.
I want to know if anyone has any thoughts about this.

Instead of the well known 256 entries, the entries allready known are two: 0 and 1. plain bits..
I know dictionary size can be 12 bits (4096 entries) for example.
is it therefore not usefull to build a bit-based dictionary like this?

of course entries will get filled up very fast and for bigger data, but lets say we are restricted to compressing small files like tinytros for example.

Code: .-------------------------------------------------------------------- |data |cur: |in dict?|encode: |dict: |----------------|--------------|--------|--------------|------------ |0 |0 |yes |-- |-- |01 |01 |no |01 |01 (2) |011 |1 |yes |-- |-- |0111 |11 |no |11 |11 (3) |01111 |1 |yes |-- |-- |011110 |10 |no |10 |10 (4) |0111101 |1 |yes |-- |-- |01111011 |11 |yes |-- |-- |011110110 |110 |no |110 |110 (5) |0111101100 |0 |yes |-- |-- |01111011001 |01 |yes |-- |-- |011110110011 |011 |no |011 |011 (6) |0111101100111 |1 |yes |-- |-- |01111011001111 |11 |yes |-- |-- |011110110011110 |110 |yes |-- |-- |0111101100111101|1101 |no |1101 |1101 (7) `-------------------------------------------------------------------- raw stream: 0111101100111101 encoded stream: 01 + 11 + 10 + 110 + 011 + 1101 = 0111101100111101

raw and encoded stream are identical!? (so no compression done).
if the last bits read from the raw-data stream are in dictionary we can output
the dictionary entry instead and the encoded stream will be smaller.

lets remove the last line in that table.
so.. line 15 is the last line:
011110110011110 110 yes -- --

since we know this was the last bits we add (5) 110 to end of encoded stream which was allready in dictionary.
but this require that index is smaller than encoded stream.
all entries (0) and (1) which we allready know and (2),(3),...(7) 'should' use three bits in the example above.
for larger bistreams this number will change. but for smaller bitstream a fixed dictionary size
which dont have all possible combinations of entries, compression is possible by analysis beforehand and
by searching the space of possible bits with with reversible cellular automaton rules.
in unknown number of cases the last entry is allready in dictionary and more compressible.

So in a very few cases compression is possible for even smaller files.
like with tinytros.

one should atleast test cellular automaton rules with periodic boundary conditions like for example rule 85,
to transform the bits and see if it gets more compressible. rule 85 acts like rotate left and inverse bits.

if anyone have tried something like this id like to know.
thanks.
added on the 2015-10-23 03:00:28 by rudi rudi
I tried the similar thing you are describing here. If I understood correctly, your description is a bit messy: converting bytes to eight 0/1 bytes i.e compress bit stream instead of byte stream to see if that compresses better, but I used deflate instead of LZW.

It did not create smaller packed output. the difference being in average around 5-10% worse. Add insult to injury then this decompressed stream needs still to be processed...

Maybe it would work better in LZW who knows. Although I suspect it will produce similar results, like you said the dictionary grows (too) fast. For deflate it does not matter much if the symbols 2-255 are not used, the symbols from 256-> are present and needed to do actual compression.
added on the 2015-10-23 07:39:06 by ts ts
I think his idea is to consider the input as a bit stream, not as a stream of octets, so every input bit goes through the compression separately.
added on the 2015-10-23 14:35:35 by sol_hsa sol_hsa
if i was being a bit vague it was a bit late last night when i wrote this.
yes, the encode does not know the dictionary bit-length at compile-time, instead it will build the dictionary at run-time by input bits (0,1) as dictionary grows rapidly.

theres some (maybe 50/50 change) that the last bitstream that are encoded are _not_ in the dict and will therefore not be compressible, because of the nature of the algorithm.
last bit may be or may not be a repetition (that is it may be a new dictionary entry or an entry allready in the dict). if it is allready in the dict, it will be compressible based on the last entry's bit-length size. this in itself does not compress the data very much, but the idea is for it to be a iterative algorithm such that doing transforms on the data will actually compress until it can not be compressed further. that was the idea.
added on the 2015-10-23 15:46:23 by rudi rudi
also, i dont want to confuse with terms here, LZW afaik traditionaly does have fixed length codes at startup. the algorithms that i mention in this post and i'd like to explore further, does not have that. so the dictionary lookups have to be dynamically created during the encode. may be that the dictionary entries need to have an own format for all i know yet.
added on the 2015-10-23 15:58:24 by rudi rudi
To bad that you can't @blueberry here because he knows a lot about compression :)
added on the 2015-10-23 16:07:08 by emoon emoon
I don't think going to bit level is going to improve anything. 256 symbols come really useful here, as in both x86 and HLSL/GLSL code many of these symbols are completely unused (e.g. non-ASCII codes in shaders) and many others can be quite accurately predicted (e.g. after a dot in shader you are very likely to have 0-9 or x/y/z/w).

While most of that can be done at bit level, you'll get some obscuring noise in prediction routine because you lose byte sync - some perfect predictions can be destroyed by other codes which appear the same after bit shifting.
added on the 2015-10-23 16:52:16 by KK KK

login