arbtt: use of database like sqlite3?

Mon Dec 15 01:40:16 CET 2014

On Sun, Dec 14, 2014 at 5:59 PM, Joachim Breitner
<mail at joachim-breitner.de> wrote:
> wait, it is taking 100 hours for one run of "arbtt-stats
> --dump-samples"?

No, I meant the query was restricted to the last 100 hours (which was
when I was reading it)

> Possibly. My goal with the current system is to make arbtt-capture as
> cheap as possible, by doing nothing than simply appending a few bytes to
> a file. I might have been over-optimizing here, but it is certainly a
> good idea to pay attention to something that is going to run constantly,
> even when on battery.

I don't know how much more expensive an append is, but I suspect that
even a single run through a logfile will use up more joules than
whatever additional overhead sqlite3 has in INSERTs. (Since sqlite3 is
often used in embedded and resource-constrained applications or in
applications that write heavily to the database like Firefox, it can't
be *that* bad.)

> Or maybe an alternative route could be taken: arbtt-capture still writes
> to a binary append-only log, but regularly (i.e. once a day), this log
> is imported into a format more amendable to searching and flushed.

I think that would probably have the worst of both worlds; two
architectures means a lot more can go wrong.

> But note that there is another possibly important feature of the current
> log format: It shares strings between a sample and its previous sample.
> Otherwise, upon every sample, every window title would be duplicated
> stored again and again, yielding in considerably larger files and
> arbtt-stats memory consumption. I doubt that this is easily possible
> with sqlite.

That's a good point: sqlite3 has pretty minimal support for
compression. There's a plugin or two for coding in
compression/decompression on specific entries (such as
https://github.com/salviati/sqlite3-lz4 ), but that wouldn't save much
space since we want between-entry compression. There's 2 official
proprietary plugins, but besides being proprietary are intended for
read-only databases.

We could bite the bullet and accept greater filesize. I wouldn't mind
doubling space usage if it made queries near-instant. Another option
is that the database could be structured to explicitly exploit
duplication; I saw this suggestion in
http://stackoverflow.com/a/1829601 :

> Compression is all about removing duplication, and in a log file most of the duplication is between entries rather than within each entry so compressing each entry individually is not going to be a huge win.
>
> This is off the top of my head so feel free to shoot it down in flames, but I would consider breaking the table into a set of smaller tables holding the individual parts of the entry. A log entry would then mostly consist of a timestamp (as DATE type rather than a string) plus a set of indexes into other tables (e.g. requesting IP, request type, requested URL, browser type etc.)

That is, in more Haskelly terms, each arbtt-capture sample is a
(timestamp, [String]); each string is assigned a unique ID and stored
in a hashmap if not already present, and the IDs are stored with the
timestamp. So  a few seconds of samples would look something like

(1418603655,[1,54,20,333])
(1418603678,[1,53,333])
(1418603693,[1,53,333])
(1418603702,[1,53,333])
(1418603712,[53,333,801])

avoiding the worst of between-row redundancy; and another table would
store the definition of string #1, #53, #801, etc whenever one needed
them. This probably would compress even better than a log format which
looks an entry back for redundancy since it extends the lookback to
the entire database history, and indexes presumably mean the queries
remain as fast (since sqlite3 knows where the indices point to in the
other table).

-- 
gwern
http://www.gwern.net