Listing samples which are not matched by any tags?

Wed Sep 3 21:32:09 CEST 2014

On Sat, Jul 5, 2014 at 12:44 PM, Joachim Breitner
<mail at joachim-breitner.de> wrote:
> No, but the manual is quite short on “how do I do X”. Would you be
> interesting in contributing here? I think that _not_ being a developer
> is an advantage when writing good documentation.

Maybe. I may not be an arbtt developer, but I'm still not a regular
user. Regardless, I think some of the tricks and observations I made
while working with arbtt are worth including in the manual; the manual
currently gives one little idea how one would actually go about
effectively using arbtt. I wrote up some thoughts in Markdown (below)
- I have never used that XML stuff you are using and would probably
muck it up, so hopefully you can convert the Markdown version to XML.

If other arbtt users could mention various roadblocks and solutions
they came up with, that'd be helpful.

----

The idea is that this would be placed after 'Configuring the arbtt
categorizer (arbtt-stats)'
http://arbtt.nomeata.de/doc/users_guide/configuration.html#idp20408 -

# Effective Use of Arbtt

Now that the syntax has been described & the toolbox laid out, how
does one practically go about using and configuring arbtt?

## Enabling data collection

After installing arbtt, one needs to configure it to run. There are
many ways one can run the `arbtt-capture` daemon, but a standard way
on Unix systems would be to add it as a
[`cron`](https://en.wikipedia.org/wiki/Cron) job: for example, one
could edit one's crontab file (`crontab -e`) and add a line like this:

    DISPLAY=:0
    @reboot arbtt-capture --logfile=/home/username/doc/arbtt/capture.log

At boot, `arbtt-capture` will be run in the background and will
capture a snapshot of the X metadata for active windows every 60
seconds (the default). If one wanted more fine-grained time data at
the expense of doubling storage use per day, one could increase the
sampling rate with a command like `--sample-rate=30`. To be resilient
to any errors or segfaults, one could also wrap it in a infinite loop
to restart the daemon should it ever crash, with a command like

    DISPLAY=:0
    @reboot while true; do arbtt-capture --sample-rate=30; sleep 1m; done

## Checking data availability

arbtt tracks X properties like window title, class, and running
program, and one writes regexp rules to classify the strings as one
wishes; but this assumes that the necessary data is present in those
properties.

For some programs, this is the case. For example, web browsers like
Firefox typically set the X title to the `<title>` of the web page in
the currently-focused tab, which is enough for classification.

Some programs do not set titles or class, and all arbtt sees is empty
strings like ""; or they may set the title/class to a constant like
"Liferea", which may be acceptable if that program is used for only
one purpose, but if it is used for many purposes, then one cannot
write a rule matching it without producing highly-misleading time
analyses. (For example, a web browser may be used for countless
purposes, ranging from work to research to music to writing to
programming; but if the web browser's title/class were always just
"Web browser", how would one classify 5 hours spent using the web
browser? If the 5 hours are classified as any or all of those
purposes, then the results will be misleading garbage - one probably
didn't spend 5 hours just listening to music, but a mixture of those
purposes, which changes from day to day.)

One should check for such problematic programs upon starting using
arbtt. It would be unfortunate if one were to log for a few months, go
back for a detailed report for some reason, and discover that the
necessary data was never actually available for arbtt to log!

These programs can sometimes be customized internally, a bug report
filed with the maintainers, or their titles can be externally set by
[`wmctrl`](https://en.wikipedia.org/wiki/Wmctrl) or
[`xprop`](http://jonisalonen.com/2014/setting-x11-window-properties-with-xprop/).

### `xprop`

One can check the X properties of a running window by running the
command [`xprop`](http://www.xfree86.org/current/xprop.1.html) and
clicking on the window; `xprop` will print out all the relevant X
information. For example, the output for Emacs might look like this

    $ xprop | tail -5
    WM_CLASS(STRING) = "emacs", "Emacs"
    WM_ICON_NAME(STRING) = "emacs at elan"
    _NET_WM_ICON_NAME(UTF8_STRING) = "emacs at elan"
    WM_NAME(STRING) = "emacs at elan"
    _NET_WM_NAME(UTF8_STRING) = "emacs at elan"

This is not very helpful: it does not tell us the filename being
edited, the mode being used, or anything. One could classify time
spent in Emacs as "programming" or "writing", but this would be
imperfect, especially if one does both activities regularly. However,
Emacs can be customized by editing `~/.emacs`, and after some
searching with queries like "setting Emacs window title", the [Emacs
wiki](http://www.emacswiki.org/emacs-en/FrameTitle) &
[manual](https://www.gnu.org/software/emacs/manual/html_node/efaq/Displaying-the-current-file-name-in-the-titlebar.html)
advise us to put something like this Elisp in our `.emacs` file:

    (setq frame-title-format "%f")

Now the output looks different:

    $ xprop | tail -5
    WM_CLASS(STRING) = "emacs", "Emacs"
    WM_ICON_NAME(STRING) = "/home/gwern/arbtt.page"
    _NET_WM_ICON_NAME(UTF8_STRING) = "/home/gwern/arbtt.page"
    WM_NAME(STRING) = "/home/gwern/arbtt.page"
    _NET_WM_NAME(UTF8_STRING) = "/home/gwern/arbtt.page"

With this, we can usefully classify all such time samples as being "writing".

Another common gap is terminals/shells: they often do not include
information in the title like the current working directory or last
shell command. For example, urxvt/Bash:

    WM_COMMAND(STRING) = { "urxvt" }
    _NET_WM_ICON_NAME(UTF8_STRING) = "urxvt"
    WM_ICON_NAME(STRING) = "urxvt"
    _NET_WM_NAME(UTF8_STRING) = "urxvt"
    WM_NAME(STRING) = "urxvt"

Programmers may spend many hours in the shell doing a variety of
things (like Emacs), so this is a problem. Fortunately, this is also
solvable by customizing one's `.bashrc` to set the prompt to emit an
escape code interpreted by the terminal (baroque, but it works). The
following will include the working directory, a timestamp, and the
last command:

    trap 'echo -ne "\033]2;$(pwd); $(history 1 | sed "s/^[ ]*[0-9]*[
]*//g")\007"' DEBUG

Now the urxvt samples are useful:

    _NET_WM_NAME(UTF8_STRING) = "/home/gwern/wiki; 2014-09-03 13:39:32
arbtt-stats --help"

A rule could classify based on the directory one is working in, the
command one ran, or both. Other shells like zsh can be fixed this way
too but the exact command may differ; you will need to research &
experiment.

Some programs can be tricky to set. The [X image viewer
feh](http://feh.finalrewind.org/) has a `--title` option but it cannot
be set in the configuration file, `.config/feh/themes`, because it
needs to be specified dynamically; so one needs to set up a shell
alias or script to wrap the command like `feh --title "$(pwd) / %f /
%n"`.

### Raw samples

`xprop` can be tedious to use on every running window and one may not
think to check rarer programs. A better approach is to use
`arbtt-stats`'s `--dump-samples` option: this option will print out
the collected data for specified time periods, allowing one to examine
the X properties en masse. This option can be used with the
`-x`/`--exclude=` options to print the samples for *samples not
matched by existing rules* as well, which is indispensable for
improving coverage and suggesting ideas for new rules. A good way to
figure out what customizations to make is to run arbtt as a daemon for
a day or so, and then begin examining the raw samples for problems.

An example: suppose I create a simple category file named `foo` with
just the line

    $idle > 30 ==> tag inactive

I can then dump all my arbtt samples for the past day with a command like this:

    arbtt-stats --categorizefile=foo --m=0 --filter='$sampleage
<24:00' --dump-samples

Because there are so many open windows, this produces a large amount
(26586 lines) of hard-to-read output:

    ...
    ( ) Navigator:      /r/Touhou's Favorite Arranges! Part 71:
Retribution for the Eternal Night ~ Imperishable Night : touhou -
Iceweasel
    ( ) Navigator:      Configuring the arbtt categorizer
(arbtt-stats) - Iceweasel
    ( ) evince:         ATTACHMENT02
    ( ) evince:         2009-geisler.pdf — Heart rate variability
predicts self-control in goal pursuit
    ( ) urxvt:          /home/gwern; arbtt-stats --categorizefile=foo
--m=0 --filter='$sampleage <24:00' --dump-samples
    ( ) mnemosyne:      Mnemosyne
    ( ) urxvt:          /home/gwern; 2014-09-03 13:11:45 xprop
    ( ) urxvt:          /home/gwern; 2014-09-03 13:42:17 history 1 |
cut --delimiter=' ' --fields=5-
    ( ) urxvt:          /home/gwern; 2014-09-03 13:12:21 git log -p .emacs
    (*) emacs:          emacs at elan
    ( ) urxvt:          /home/gwern; 2014-09-01 14:50:30 while true;
do cd ~/ && getmail_fetch --ssl pop.gmail.com gwern0
'ugaozoumbhwcijxb' ./mail/; done
    ( ) urxvt:
/home/gwern/blackmarket-mirrors/silkroad2-forums; 2014-08-31 23:20:10
mv /home/gwern/cookies.txt ./; http_proxy="localhost:8118" wget...
    ( ) urxvt:          /home/gwern/blackmarket-mirrors/agora;
2014-08-31 23:15:50 mv /home/gwern/cookies.txt ./;
http_proxy="localhost:8118" wget --mirror ...
    ( ) urxvt:
/home/gwern/blackmarket-mirrors/evolution-forums; 2014-08-31 23:04:10
mv ~/cookies.txt ./; http_proxy="localhost:8118" wget --mirror ...
    ( ) puddletag:      puddletag: /home/gwern/music

Active windows are denoted by an asterisk, so I can focus & simplify
by adding a pipe like `| fgrep '(*)'`, producing more manageable
output like

    (*) urxvt:          irssi
    (*) urxvt:          irssi
    (*) urxvt:          irssi
    (*) Navigator:      Pyramid of Technology - NextNature.net - Iceweasel
    (*) Navigator:      Search results - gwern0 at gmail.com - Gmail - Iceweasel
    (*) Navigator:      [New comment] The Wrong Path -
gwern0 at gmail.com - Gmail - Iceweasel
    (*) Navigator:      Iceweasel
    (*) Navigator:      Litecoin Exchange Rate - $4.83 USD -
litecoinexchangerate.org - Iceweasel
    (*) Navigator:      PredictionBook: LiteCoin will trade at >=10
USD per ltc in 2 years, - Iceweasel
    (*) urxvt:          irssi
    (*) Navigator:      Bug#691547 closed by Mikhail Gusarov
<dottedmag at dottedmag.net> (Re: s3cmd: Man page: --default-mime-type
documentation incomplete...)
    (*) Navigator:      Bug#691547 closed by Mikhail Gusarov
<dottedmag at dottedmag.net> (Re: s3cmd: Man page: --default-mime-type
documentation incomplete...)
    (*) Navigator:      Bug#691547 closed by Mikhail Gusarov
<dottedmag at dottedmag.net> (Re: s3cmd: Man page: --default-mime-type
documentation incomplete...)
    (*) urxvt:          /home/gwern; 2014-09-02 14:25:17 man s3cmd
    (*) evince:         bayesiancausality.pdf
    (*) evince:         bayesiancausality.pdf
    (*) puddletag:      puddletag: /home/gwern/music
    (*) puddletag:      puddletag: /home/gwern/music
    (*) evince:         bayesiancausality.pdf
    (*) Navigator:      ▶ Umineko no Naku Koro ni Music Box 4 - オルガン小曲
第2億番 ハ短調 - YouTube - Iceweasel
    ...

This is better. We can see a few things: the windows all now produce
enough information to be usefully classified (Gmail can be classified
under email, irssi can be classified as IRC, the urxvt usage can
clearly be classified as programming, the PDF being read is
statistics, etc) in part because of customizations to bash/urxvt. The
duplication still impedes focus, and we don't know what's most common.
We can use another pipeline to sort, count duplicates, and sort by
number of duplicates (`| sort | uniq --count | sort
--general-numeric-sort`), yielding:

     ...
     14     (*) Navigator:      A Bluer Shade of White Chapter 4, a
frozen fanfic | FanFiction - Iceweasel
     14     (*) Navigator:      Iceweasel
     15     (*) evince:         2009-geisler.pdf — Heart rate
variability predicts self-control in goal pursuit
     15     (*) Navigator:      Tool use by animals - Wikipedia, the
free encyclopedia - Iceweasel
     16     (*) Navigator:      Hacker News | Add Comment - Iceweasel
     17     (*) evince:         bayesiancausality.pdf
     17     (*) Navigator:      Comments - Less Wrong Discussion - Iceweasel
     17     (*) Navigator:      Keith Gessen · Why not kill them all?:
In Donetsk · LRB 11 September 2014 - Iceweasel
     17     (*) Navigator:      Notes on the Celebrity Data Theft |
Hacker News - Iceweasel
     18     (*) Navigator:      A Bluer Shade of White Chapter 1, a
frozen fanfic | FanFiction - Iceweasel
     19     (*) gl:             mplayer2
     19     (*) Navigator:      Neural networks and deep learning - Iceweasel
     20     (*) Navigator:      Harry Potter and the Philosopher's
Zombie, a harry potter fanfic | FanFiction - Iceweasel
     20     (*) Navigator:      [OBNYC] Time tracking app -
gwern0 at gmail.com - Gmail - Iceweasel
     25     (*) evince:         ps2007.pdf — untitled
     35     (*) emacs:          /home/gwern/arbtt.page
     43     (*) Navigator:      CCC comments on The Octopus, the
Dolphin and Us: a Great Filter tale - Less Wrong - Iceweasel
     62     (*) evince:         The physics of information processing
superobjects - Anders Sandberg - 1999.pdf — Brains2
     69     (*) liferea:        Liferea
     82     (*) evince:         BMS_raftery.pdf — untitled
     84     (*) emacs:          emacs at elan
     87     (*) Navigator:      overview for gwern - Iceweasel
    109     (*) puddletag:      puddletag: /home/gwern/music
    150     (*) urxvt:          irssi

Put this way, we can see what rules we should write to categorize: we
could categorize the activities here into a few categories of
"recreational", "statistics", "music", "email", "IRC", "research", &
"writing"; and add to the `categorize.cfg` some rules like thus:

    $idle > 30 ==> tag inactive,

    current window $title =~ [/.*Hacker News.*/, /.*Less Wrong.*/,
/.*overview for gwern.*/, /.*[fF]an[fF]ic.*/, /.* LRB .*/]
      || current window $program == "liferea" ==> tag Recreation,
    current window $title =~ [/.*puddletag.*/, /.*mplayer2.*/] ==> tag Music,
    current window $title =~ [/.*[bB]ayesian.*/, /.*[nN]eural
[nN]etworks.*/, /.*ps2007.pdf.*/, /.*[Rr]aftery.*/] ==> tag
Statistics,
    current window $title =~ [/.*Wikipedia.*/, /.*Heart rate
variability.*/, /.*Anders Sandberg.*/] ==> tag Research,
    current window $title =~ [/.*Gmail.*/] ==> tag Email,
    current window $title =~ [/.*arbtt.*/] ==> tag Writing,
    current window $title == "irssi" ==> tag IRC,

If we reran the command, we'd see the same output, so we need to
leverage our new rules and *exclude* any samples matching our current
tags, so now we run a command like:

    arbtt-stats --categorizefile=foo --filter='$sampleage <24:00'
--dump-samples --exclude=Recreation --exclude=Music
--exclude=Statistics
                 --exclude=Research --exclude=Email --exclude=Writing
--exclude=IRC |
                 fgrep '(*)' | sort | uniq --count | sort --general-numeric-sort

Now the previous samples disappear, leaving us with a fresh batch of
unclassified samples to work with:

      9     (*) Navigator:      New Web Order > Nik Cubrilovic - -
Notes on the Celebrity Data Theft - Iceweasel
      9     ( ) urxvt:          /home/gwern; arbtt-stats
--categorizefile=foo --filter='$sampleage <24:00' --dump-samples |
fgrep '(*)' | less
     10     (*) evince:         ATTACHMENT02
     10     (*) Navigator:      These Giant Copper Orbs Show Just How
Much Metal Comes From a Mine | Design | WIRED - Iceweasel
     12     (*) evince:
[Jon_Elster]_Alchemies_of_the_Mind_Rationality_an(BookFi.org).pdf —
Alchemies of the mind
     12     (*) Navigator:      Morality Quiz/Test your Morals, Values
& Ethics - YourMorals.Org - Iceweasel
     33     ( ) urxvt:          /home/gwern; arbtt-stats
--categorizefile=foo --filter='$sampleage <24:00' --dump-samples |
fgrep '(*)'...

We can add rules categorizing these as 'Recreational', 'Writing',
'Research', 'Recreational', 'Research', 'Writing', and 'Writing'
respectively; and we might decide at this point that 'Writing' is
starting to become overloaded, so we'll split it into two tags,
'Writing' and 'Programming'. And then after tossing another
`--exclude=Programming` into our rules, we can repeat the process.

As we refine our rules, we will quickly spot instances where the
title/class/program are insufficient to allow accurate classification,
and we will figure out the best collection of tags for our particular
purposes. A few iterations is enough for most purposes.

## Categorizing advice

When building up rules, a few rules of thumb should be kept in mind:

1. categorize by purpose, not by program

    This leads to misleading time reports. Avoid, for example, lumping
all web browser time into a single category named 'Internet'; this is
more misleading than helpful. Good categories describe an activity or
goal, such as 'Work' or 'Recreation', not a tool, like 'Emacs' or
'Vim'.
2. when in doubt, write narrow rules and generalize later

    Regexps are tricky and it can be easy to write rules far broader
than one intended. The `--exclude` filters mean that one will never
see samples which are matched accidentally. If one is in doubt, it can
be helpful to take a specific sample one wants to match and several
similar strings and look at how well one's regexp rule works in
Emacs's [regexp-builder](http://www.emacswiki.org/emacs/ReBuilder) or
online regexp-testers like [regexpal](http://regexpal.com/).
3. don't try to classify everything

    You will never classify 100% of samples because sometimes programs
do not include useful X properties & cannot be fixed, you have samples
from before you fixed them, or they are too transient (like popups and
dialogues) to be worth fixing. It is not necessary to classify 100% of
your time, since as long as the most common programs and, say,
[80%](https://en.wikipedia.org/wiki/Pareto_principle) of your time is
classified, then you have most of the value. It is easy to waste more
time tweaking arbtt than one gains from increased accuracy or more
finely-grained tags.

## Long-term storage

Each halving of the sampling rate doubles the number of samples taken
and hence the storage requirement; sampling rates below 20s are
probably wasteful. But even the default 60s can accumulate into a
nontrivial amount of data over a year. A constantly-changing binary
file can interact poorly with backup systems, may make arbtt analyses
slower, and if one's system occasionally crashes or experiences other
problems, cause some corruption of the log and be a nuisance in having
to run `arbtt-recover`.

Thus it may be a good idea to archive one's `capture.log` on an annual
basis. If one needs to query the historical data, the particular log
file can be specified as an option like
`--logfile=/home/gwern/doc/arbtt/2013-2014.log`

## Advanced queries

arbtt supports CSV export of time by category in various levels of
granularity in a 'long' format (multiple rows for each day, with _n_
row specifying a category's value for that day). These CSV exports can
be imported into statistical programs like R or Excel and manipulated
as desired.

R users may prefer to have their time data in a 'wide' format (each
row is 1 day, with _n_ columns for each possible category); this can
be done with the `reshape` default library. After reading in the CSV,
the time-intervals can be converted to counts and the data to a wide
data-frame with R code like the following:

    arbtt <- read.csv("arbtt.csv")
    interval <- function(x) { if (!is.na(x)) { if (grepl(" s",x))
as.integer(sub(" s","",x))
                                              else { y <-
unlist(strsplit(x, ":"));

as.integer(y[[1]])*3600 + as.integer(y[[2]])*60 + as.integer(y[[3]]);
}
                                                     }
                             else NA
                             }
    arbtt$Time <- sapply(as.character(arbtt$Time), interval)
    library(reshape)
    arbtt <- reshape(arbtt, v.names="Time", timevar="Tag",
idvar="Day", direction="wide")

-----

-- 
gwern
http://www.gwern.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: arbtt.page
Type: application/octet-stream
Size: 19514 bytes
Desc: not available
URL: <https://lists.nomeata.de/pipermail/arbtt/attachments/20140903/12431bea/attachment.obj>