arbtt: use of database like sqlite3?

Sun Dec 14 23:59:23 CET 2014

Hi Gwern,

Am Sonntag, den 14.12.2014, 15:39 -0500 schrieb Gwern Branwen:
> So I was waiting on arbtt to --dump-samples for the past 100 hours to
> write a rule classifying a web serial I read as recreational, and I
> began wondering: what is arbtt doing that it takes so long?

wait, it is taking 100 hours for one run of "arbtt-stats
--dump-samples"?

> Is it because of the log structure that it has to read through, parse,
> and classify my full 85M arbtt log just to get the last 100 hours of
> data? I know from working with an 18GB sqlite3 db for Mnemosyne that
> date range queries in databases can be *extremely* fast, and
> arbtt-capture dumping into a db would probably be more reliable and
> durable (ACID rather than arbtt-recover), and sqlite3 has had multiple
> Haskell bindings for half a decade now.
> 
> Would switching to sqlite3 be an improvement?

Possibly. My goal with the current system is to make arbtt-capture as
cheap as possible, by doing nothing than simply appending a few bytes to
a file. I might have been over-optimizing here, but it is certainly a
good idea to pay attention to something that is going to run constantly,
even when on battery.

But that does not mean that the benefits of sqlite (such as date range
queries) or otherwise improved data formats would not outweigh this.

Or maybe an alternative route could be taken: arbtt-capture still writes
to a binary append-only log, but regularly (i.e. once a day), this log
is imported into a format more amendable to searching and flushed.

But note that there is another possibly important feature of the current
log format: It shares strings between a sample and its previous sample.
Otherwise, upon every sample, every window title would be duplicated
stored again and again, yielding in considerably larger files and
arbtt-stats memory consumption. I doubt that this is easily possible
with sqlite.

Maybe it is also sufficient to create an index file for the log file,
with the offsets of, say, the first sample of each day. This would allow
arbtt-stats to categorize a specific time span faster. But this is also
tricky given the string-sharing format.

Maybe it is also sufficient to keep the log format, but split it into
smaller files, i.e. one per day. Again, date ranges would be smaller,
plus it would be back-friendlier and easier to delete certain dates
manually.

So all in all quite a few options, and no clear best way to follow. What
do you think?

Greetings,
Joachim

-- 
Joachim “nomeata” Breitner
  mail at joachim-breitner.de • http://www.joachim-breitner.de/
  Jabber: nomeata at joachim-breitner.de  • GPG-Key: 0xF0FBF51F
  Debian Developer: nomeata at debian.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <https://lists.nomeata.de/pipermail/arbtt/attachments/20141214/05997d40/attachment.asc>