Mandriva

LZMA Compression

A few months back I proposed the use of lzma compression over bzip2, but being right before the release of 2007 Spring, it was just not the right time...

Now OTOH cooker is fully active again and it's the appropriate time for making decissions and changes regarding this. :)

I proposed the change on cooker & maintainers list yesterday and people seems mostly to be positive, everyone seems to agree on it being better than current bzip2 compression at least.

The reason for this is due to it being faster to decompress, achieves better compression and can also result in lower memory usage.

On Gustavo's request I made some comparisions on my old & slow Blade 100 UltraSparc IIi 500 Mhz to make the difference more obvious:

(bzip2 -9 is standard for man pages)
[peroyvind@blade100 SPECS?]$ MANPAGER='true' time man ./bash.1.bz2
4.43user 0.20system 0:04.64elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+6454minor)pagefaults 0swaps

(lzma -5 is used to get slightly lower memory usage than bzip2 while still better compression, ratio shouldn't affect decompression time for lzma though)
[peroyvind@blade100 SPECS?]$ MANPAGER='true' time man ./bash.1.lzma
1.66user 0.15system 0:01.92elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (12major+4923minor)pagefaults 0swaps

[peroyvind@blade100 SPECS?]$ MANPAGER='true' time man ./zgrep.1.bz2
0.54user 0.11system 0:00.66elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+5436minor)pagefaults 0swaps

[peroyvind@blade100 SPECS?]$ MANPAGER='true' time man ./zgrep.1.lzma
0.46user 0.11system 0:00.58elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+4786minor)pagefaults 0swaps

Gustavo also did some further investigation where size comparasion was made too:

-rw-rr 1 spuk spuk 31315707 Jun 7 16:25 manpages.gz (gzip -9)
-rw-rr 1 spuk spuk 17808514 Jun 7 16:33 manpages.lzma (lzma -5)
-rw-rr 1 spuk spuk 22764006 Jun 7 16:35 manpages.bz2 (bzip2 -9)
-rw-rr 1 spuk spuk 115609592 Jun 7 16:37 manpages

$ find /usr/share/man -name *.bz2 | xargs cat | wc -c 40741331

(Compressing all in a single file give better compression, of course.)

$ time lzmadec /dev/null
3.76user 0.06system 0:04.41elapsed 86%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+772minor)pagefaults 0swaps

$ time bzcat /dev/null
14.78user 0.12system 0:17.67elapsed 84%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1074minor)pagefaults 0swaps

$ time zcat /dev/null
1.35user 0.06system 0:02.52elapsed 55%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+588minor)pagefaults 0swaps

So it seems to be a very nice improvement for this, not that much for size compared to bzip2, but a lot when it comes to time where it's almost near gzip while being way better (although someone tends to care so little that they're obsessed of not wanting any compression at all;p)!

To make it work though, support needs to be implemented, I've implemented it in lesspipe earlier which took care of handling compressed formats for man in the past, but now it seems that man takes care of this by it self. Not a big problem though, implementing support for it was quite simple (also sent back to and accepted upstream:), same goes for info (from texinfo) where I also implemented support for it yesterday. I'm not entirely certain about the correctnes of my patch for install-info though, but then again I didn't even quite get how it's used there nor does it seem like support for it there is something we actually use/need..

Not big news for everyone, but fun for all of us that likes to squeeze out the very best of even the most tiny things even if most others don't care. I know at least that this includes both me and Austin. :o)

Blog Home

dvalin - LZMASupport
Version 1.12 last modified by Arkub on 10/06/2007 at 10:32

Comments (5)

awilliamson | 08.06.2007 at 06:29 PM
There's also Spuk's second test, which shows that in a more realistic scenario (compressing single manpages into separate archives, rather than a big clump of them in a single archive), gzip and lzma perform almost identically. Given this, I think it's more sensible to just use gzip, for reasons given in the thread.

dexter11 | 09.06.2007 at 12:45 PM
Can this LZMA be used together with url=http://code.google.com/p/xar/XAR/url?, and used in packaging? That would be a more sensible way to use a better archiver than to compress man pages IMHO.

proyvind Karlsen | 10.06.2007 at 06:59 AM
No, spuk didn't do such a test, Andreas OTOH did where he compared bzip2 vs lzma, not gzip.

Also current lzma format isn't optimal, especially not wrt to small text files, I've recently got involved with project and when next release is done and format finalized, it will compress text files much better as well as ported to C (lzma being written in C++ leading to additional dependencies is only valid complaint so far IMO), I think at that time lzma will be better in all the ways that matters. :) While lzma might not be mature for compression of man & info pages yet or rpm payload yet, I think it's at least a sensible replacement for bzip2 wrt compression of source.

I'll supply comparision data for lzma vs gzip on per file basis soon..

dexter11: archiver != compression, ie. archiving is done when you want to include several files in an archive, on single files there's sense in using an archiver, just compressing the file is sufficient. Using lzma with xar should be fully possible, I'm not sure about if support is added yet, but I'll investigate this and look into adding it if it's not :)


proyvind Karlsen | 10.06.2007 at 07:13 AM
just checked out xar, currently it'll be difficult to implement lzma support as xar uses libraries for gzip & bzip2 to compress and not invoking the binary itself. lzma library for such is not really ready yet, so it might be better to just wait for new lzma release before implementing support.

But you should be able to just archive with xar first, then compress/decompress it with lzma later. Don't really have any experience with xar, but looking from the source I cannot seem to find any option to output archive to stdout, that way you could do like something like 'xar - |lzma ' to archive while compressing at the same time..


proyvind Karlsen | 10.06.2007 at 07:49 AM
You were correct, I didn't read spuk's second test throughly..

Still, current lzma compression shows off as a bit better, and I'm still pretty determined that persistent space usage is of bigger importance than resource usage due to it not matter when both are so fast (and much faster than slow bzip2) that you won't notice much difference unless you're on some really old hardware..

I'm pretty determind that next release will make it the obvious best option :)


 


en

RSS

Creator: proyvind Karlsen on 2007/06/08 15:05
(c) Mandriva 2007
18888888