arbtt capture logs' compressibility

Gwern Branwen gwern at gwern.net
Fri Jan 2 22:11:11 CET 2015


I was wondering how well the append-logs work for removing redundancy
and how compressible capture logs are, so I looked at some compression
algorithms' (zip/gzip/bzip2/xz) performance on my 403MB of logs
captured since 2012 (with varying interval settings).

It looks like even with the weakest compression, my logs compress to
1/5th the size, and with the strongest reasonably available
compression algorithms (barring exotic novelties like ZPAQ), down to
1/7th the size.

Table of results:

    Method   Setting  Result size Compression Time
    -------- -------- ----------- ----------- --------
                      423485623   1.000       00:01.80
    gzip     min      093571261   0.221       00:06.76
    gzip     default  083031289   0.196       00:14.22
    gzip     max      080882789   0.191       01:31.65
    zip      min      093571387   0.221       00:07.41
    zip      default  083031415   0.196       00:15.16
    zip      max      080882915   0.191       01:37.52
    bzip2    min      075065976   0.177       00:49.55
    bzip2    max      071075812   0.168       00:54.65
    xz       min      066347932   0.157       05:09.73
    xz       default  063307572   0.150       08:01.83
    xz       max      062339916   0.147       10:10.62

Shell commands:

    # 404M total
    du -c *.log
    255692    2012-2013.log
    47572    2013-2014.log
    108148    2014.log
    2168    capture.log
    413580    total

    # time to read off SSD:
    cat *.log | time wc --bytes
    423485623
    0.00user 0.12system 0:01.80elapsed 7%CPU (0avgtext+0avgdata
1844maxresident)k
    0inputs+0outputs (0major+86minor)pagefaults 0swaps

    ## GZIP
    # min
    $ cat *.log | gzip -1 --stdout - | time wc --bytes
    93571261
    0.01user 0.04system 0:06.76elapsed 0%CPU (0avgtext+0avgdata
1928maxresident)k
    0inputs+0outputs (0major+87minor)pagefaults 0swaps
    # default
    $ cat *.log | gzip -6 --stdout - | time wc --bytes
    83031289
    0.00user 0.04system 0:14.22elapsed 0%CPU (0avgtext+0avgdata
1908maxresident)k
    0inputs+0outputs (0major+85minor)pagefaults 0swaps
    # max
    $ cat *.log | gzip -9 --stdout - | time wc --bytes
    80882789
    0.00user 0.06system 1:31.65elapsed 0%CPU (0avgtext+0avgdata
1796maxresident)k
    0inputs+0outputs (0major+82minor)pagefaults 0swaps

    ## ZIP
    $ cat *.log | zip -1 - - | time wc --bytes
      adding: - (deflated 78%)
    93571387
    0.01user 0.03system 0:07.41elapsed 0%CPU (0avgtext+0avgdata
1864maxresident)k
    0inputs+0outputs (0major+84minor)pagefaults 0swaps
    $ cat *.log | zip -6 - - | time wc --bytes
      adding: - (deflated 80%)
    83031415
    0.00user 0.05system 0:15.16elapsed 0%CPU (0avgtext+0avgdata
1860maxresident)k
    0inputs+0outputs (0major+83minor)pagefaults 0swaps
    $ cat *.log | zip -9 - - | time wc --bytes
      adding: - (deflated 81%)
    80882915
    0.00user 0.05system 1:37.52elapsed 0%CPU (0avgtext+0avgdata
1920maxresident)k
    0inputs+0outputs (0major+85minor)pagefaults 0swaps

    ## BZIP2
    # min
    $ cat *.log | bzip2 -1 --stdout --compress --quiet | time wc --bytes
    75065976
    0.03user 0.06system 0:49.55elapsed 0%CPU (0avgtext+0avgdata
1864maxresident)k
    0inputs+0outputs (0major+84minor)pagefaults 0swaps
    # max (default)
    $ cat *.log | bzip2 -9 --stdout --compress --quiet | time wc --bytes
    71075812
    0.00user 0.05system 0:54.65elapsed 0%CPU (0avgtext+0avgdata
1932maxresident)k
    8inputs+0outputs (1major+87minor)pagefaults 0swaps

    ## XZ
    $ cat *.log | xz -0           --stdout | time wc --bytes
    66347932
    0.00user 0.03system 5:09.73elapsed 0%CPU (0avgtext+0avgdata
1924maxresident)k
    0inputs+0outputs (0major+87minor)pagefaults 0swaps
    $ cat *.log | xz -6           --stdout | time wc --bytes
    63307572
    0.00user 0.04system 8:01.83elapsed 0%CPU (0avgtext+0avgdata
1800maxresident)k
    $ cat *.log | xz -9 --extreme --stdout | time wc --bytes
    62339916
    0.00user 0.03system 10:10.62elapsed 0%CPU (0avgtext+0avgdata
1840maxresident)k
    0inputs+0outputs (0major+83minor)pagefaults 0swaps

-- 
gwern
http://www.gwern.net




More information about the arbtt mailing list