Table of Content

MetaFS::IndexDB

THIS IS A PRELIMINARY DOCUMENT

1. Introduction

MetaFS::IndexDB or IndexDB for short, is a hybrid of an embeddable and client/server NoSQL (schema-free) database which indexes all document content by default, aimed to fulfill the needs of MetaFS:

There is a 3 layered reference for a document:

MetaFS::IndexDB Database/Collection/Document
The main aims of IndexDB are:
  1. all key/value recursively are indexed by default for providing fast search/find for values (equality, inequality/range)
  2. low memory footprint is maintained so it can scale based on diskspace instead of memory usage

2. CRUD

MetaFS::IndexDB System Overview
CRUD+FLC+SSM (Create+Read+Update+Delete + Find+List+Count + Size+Stats+Meta) are implemented as

3. Data Structure

Let's explain the overall data structure of IndexDB with an example where the database name would be 'myset', and the collection name 'items', having 3 small documents:

myset.items:

{
   "name" : "AA.txt",
   "size" : 12,
   "uid" : "61b078f21f16641567a84f1343f04956-551783dc-cfcdef"
},

{
   "name" : "BB.txt",
   "size" : 182,
   "uid" : "b32184727775c4c8ed457fa535a86a99-554c7e46-fdc210"
},

{
   "name" : "CC.txt",
   "size" : 23,
   "uid" : "94fd0c3fdd3ed0cf4bbcbb3a5f2cd773-55177848-cf2ed4"
}

which results in following two inverted indexes:

name index (alphanumerical sorted):
key value
AA.txt 61b078f21f16641567a84f1343f04956-551783dc-cfcdef
BB.txt b32184727775c4c8ed457fa535a86a99-554c7e46-fdc210
CC.txt 94fd0c3fdd3ed0cf4bbcbb3a5f2cd773-55177848-cf2ed4

size index (numerical sorted):
key value
12 61b078f21f16641567a84f1343f04956-551783dc-cfcdef
23 94fd0c3fdd3ed0cf4bbcbb3a5f2cd773-55177848-cf2ed4
182 b32184727775c4c8ed457fa535a86a99-554c7e46-fdc210

In real world application a wide variety of documents lead easily to 2000+ keys to be indexed.

4. API

4.1. Init

$ix = new IndexDB({ .. });      

  • host: server (default: none = direct access)
  • port: port of remote access (default: 9138)
  • autoConnect: try remote (default: 1), if fails, gracefully fallback on local
  • autoType: auto type keys based on first use (number vs string), (default: 1)
  • root: root of db (default: /var/lib/indexdb)
  • maxKeyLength: max length of a key (default: 512)
  • maxIndexDepth: max depth of a key (default: 32)
  • maxIndexArrayLength: max array length to index (default: 1024)
  • syncTimeOut: sync after x seconds (default: 30)
  • ixStore: index backend ('', bx (default), uq, so, ro, lm, lv)
  • docStore: document backend (undef, '' or flat (default), bk, pg, so)
  • docType: serializing (json (default), frth)
  • docCompress: document compress ('' (default), sn)
  • sync: 0 = async (find, list, stats), 1 = sync (one command at a time)
Example:
 my $ixdb = new IndexDB({ 
    host => '192.168.1.2', 
    ...
 });

Abbreviations:

  • docStore:
    • flat (default)
    • bk
    • pg
    • so
If IndexDB runs as server or locally, backends can be defined; run as client these cannot be set.

4.1.1. Backends

Several document backends (docStore) are available:

name state functionality comments rating
flat (default) mature CRUD + reliable, easy to recover ★★★☆☆
bk mature CRUD - easy to corrupt, expensive to recover ★☆☆☆☆
pg infant CRUD + reliable, but indexing limits queries (implies ixStore: pg) ★★☆☆☆
so infant CRUD + fast
- memory intensive M(n)
★☆☆☆☆

Several index backends (ixStore) are available:

name state functionality comments rating
bk mature CRUD, find(match, regex, inequality, sort, skip, limit) + low memory usage
- slow delete of dups (do not use in production, only as reference)
★★☆☆☆
bx (default) mature CRUD, find(match, regex, inequality, sort, skip, limit) + low memory usage
+ fast delete of dups
★★★★☆
pg infant CRUD, find(match) + metadata & index together,
- in-place update not yet (9.5 perhaps)
- not certain if it will be continued, as pg backend optionally is partially also in metafs itself
★★☆☆☆
uq infant CRUD, find(match, regex) - memory usage significant (surprise) ★☆☆☆☆
so moderate CRUD, find(match, regex) - no dups natively supported (adding trailer to keys), fast delete of dups then
- not so stable yet
★★★★☆
ro infant CRUD, find(match, regex) + low memory usage
- no dups natively supported (adding trailler to keys)
- dedicated value sorting does not work yet
★★☆☆☆
lv infant CRUD - too slow, requires more fine-tuning ★☆☆☆☆
sq infant - coming soon
lm infant - coming soon

4.2. Create

$s = $ixdb->create($db,$c,$d)

  • $db = database (e.g. "myset")
  • $c = collection (e.g. "items")
  • $d = document, if $d->{uid} is not set, one is created
  • $s != 0 => error
Example:
$ixdb->create("myset","items",{
   name => "AA.txt",
   size => 12
});

which will create a document like this:

{
   "name" : "AA.txt",
   "size" : 12,
   "uid" : "61b078f21f16641567a84f1343f04956-551783dc-cfcdef"
}

4.3. Read

$e = $ixdb->read($db,$c,$d)
  • $db = database
  • $c = collection
  • $d = document, must have $d->{uid} set
  • $e = JSON object of the docoument

4.4. Update

$s = $ixdb->update($db,$c,$d,$opts)
  • $db = database
  • $c = collection
  • $d = document, must have $d->{uid} set, along other keys which are updated/set
  • $opts = options (optional)
    • clear: 1, delete existing & set new
    • set: 1, set data
    • merge: 1 (default), merge all keys recursively
  • $s != 0 => error

4.4.1. Update Methods

3 methods are available for updating: merge (default), set and clear:

The 3 methods can be looked in regards of destructiveness of existing data:

  • merge (default): non destructive, merges strictly and overwrites existing keys if neccessary
  • set: partially destructive, makes sure other keys not defined in set update are discharged
  • clear: highly destructive, all is discharged only new update is stored
To pull or delete individual keys, see delete in next section.

Example:

my(@e) = $ixdb->find("myset","items",{name=>"AA.txt"},{limit=>1});
my $a = $e[0];

$ixdb->update("myset","items",{
   uid => $a->{uid},                # -- uid must be set
   name => "BB.txt");
});

4.5. Delete

$s = $ixdb->delete($db,$c,$d)
  • $db = database
  • $c = collection (optional)
  • $d = document (optional), if set it must have $d->{uid} set too
    • a) if $db present, delete collections
    • b) if $db & $c present, delete all items in collection
    • c) if $db & $c & $d present: if uid is only set then delete entire entry, otherwise delete individual keys
  • $s != 0 => error
Example:
$ixdb->delete("myset","items",{uid=>$id});              # -- delete entire item
$ixdb->delete("myset","items",{uid=>$id,a=>1,b=>1});    # -- delete keys a & b of item referenced by $id

4.6. Find

@r = $ixdb->find($db,$c,$q,$opts,$f)
$cu = $ixdb->find($db,$c,$q,{cursor=>1})
  • $db = database
  • $c = collection
  • $q = query object
    • MongoDB alike query object:
      • exact match: { 'name': 'AA1.txt' }
      • regex: { key: { '$regex': 'AA', '$options': 'i' } }
      • exists: { key: { '$exists': 1 } }
      • distinct: { key: { '$distinct': 1 } }
      • (in)equalities: { key: { '$lt': 200 } }
        • $lt $lte $gt $gte $eq $ne
  • $opts = options (optional)
    • uidOnly: 1, do not read entire metadata, only uid (and matching key)
    • limit: n, limit n results (disregarded when $f is set)
    • skip: n, skip n results (disregarded when $f is set)
    • sort: { key => dir }, whereas dir -1 (descending) or 1 (ascending, default), (disregarded when $f is set), also key must be the same key which is looked for (single key), multiple key match (e.g. AND) sorting not yet available
    • OR: 1, consider all keys in query object logical OR (otherwise logically AND)
    • cursor: 1, request a cursor for findNext()
  • $f = function to be called (optional) if $f is not present, all results are in @r, where each item is an object with matching key/value, plus uid, e.g. {'name':'AA1.txt','uid':'.....'}
Examples:

Retrieve results in one go:

 my(@e) = $ixdb->find("myset","items",{name=>{'$exists'=>1}},{skip=>10,limit=>100});
Walk through results individually:
 $db->find("myset","items",{name=>{'$exists'=>1}},{skip=>10,limit=>100},sub {
    my($e) = @_;
    ...
 });
Request a cursor:
 my $c = $ixdb->find("myset","items",{name=>{'$exists'=>1}},
    {skip=>10,limit=>100,cursor=>1});

Note: Preferably use cursor for results which could be huge (which will use up server memory), and grab the results with findNext() as presented next:

4.7. FindNext

$e = $ixdb->findNext($cu);

findNext() is used in conjunction with find() where a cursor is requested:

Example:

my $c = $ixdb->find("myset","items",{name=>"AA.txt"},{cursor=>1});
while(my $e = $ixdb->findNext($c)) {
   ...
}

4.8. List

@r = $ixdb->list($db,$c,$k,$f)
  • $db = database (optional)
  • $c = collection (optional)
  • $k = key (optional)
  • $f = function to be called (optional)
    • a) if no argument is present, list all databases
    • b) if $db present, list collections
    • c) if $db & $c present, list items { ... }, preferably use $f for callback
    • d) if $db & $c & $k present, list keys { key: '...', 'uid': '.....' }

4.9. Count

$n = $ixdb->count($db,$c,$k) 
  • $db = database (optional)
  • $c = collection (optional)
  • $k = key (optional)
    • a) if no argument is present, count of all databases
    • b) if $db present, count collections of database
    • c) if $db & $c present, count items of that collection
    • d) if $db & $c & $k present, count different values of that key
  • $n reports count

4.10. Size

$n = $ixdb->size($db,$c,$k) 
  • $db = database (optional)
  • $c = collection (optional)
  • $k = key (optional)
    • a) if no argument is present, size of all databases
    • b) if $db present, size of all collections of database
    • c) if $db & $c present, size of all items of that collection
    • d) if $db & $c & $k present, size of index of that key
  • $n reports in bytes

4.11. Keys

$k = $ixdb->keys($db,$c)
  • $db = database
  • $c = collection
  • $k reference to an array listing all keys in dot-notion
{
   [ 
      "atime", "author", "ctime", "hash", "image.average.a", "image.average.h", ...
      "title", "type", "utime", "uid", "utime" 
   ]
}

4.12. Stats

$i = $ixdb->stats($db,$c,$k) 
  • $db = database (optional)
  • $c = collection (optional)
  • $k = key (optional)
    • a) if no argument is present, stats of all databases
    • b) if $db present, stats of all collections of database
    • c) if $db & $c present, stats of all items of that collection
    • d) if $db & $c & $k present, stats of index of that key
  • $i reports a structure like:
{
   "conf" : {
      "autoConnect" : 1,
      "backend" : {
         "bk" : {
            "cache" : 20000000,
            "levels" : 5
         }
      },
      "backendIX" : "bk",
      "backendMD" : "flat",
      "backendSZ" : "json",
      "index" : 1,
      "maxIndexArrayLength" : 1024,
      "maxIndexDepth" : 32,
      "maxKeyLength" : 512,
      "me" : "local",
      "port" : 9138,
      "root" : "/var/lib/indexdb",
      "sync" : 1,
      "syncTimeOut" : 30,
   },
   "db" : {
      "metafs_alpha" : {
         "items" : {
            "count" : 1618,
            "diskUsed" : 322494464,
            "ix" : {
               ...
            }
         }
      }
   },
   "diskFree" : 72957542400,
   "diskTotal" : 234813100032,
   "diskUsed" : 506613760,
   "pid" : 27528
}

4.13. Meta

$i = $ixdb->meta($db,$c,$m) 
  • $db = database
  • $c = collection
  • $m = meta (optional)
    • types: object with key/value defining the types
    • indexing: object with key/value prioritize indexing
  • $i reports meta structure types & indexing

4.13.1. Types

By default all key/value are autotyped, first create or insert into a collection determines the type of the value. In case you want to be sure a value is properly typed and indexed thereby (alphanumerical vs numerical), therefore optionally define types:

  • string: value indexed alphanumerical
  • number: value indexed numerical
    • date, time, percent
Note: changing an existing key from one type to another will cause complete re-indexing and slow down overall operations.

Example:

$ixdb->meta("myset","items",{
   types => {
      size => "number",
      uid => "string",
   },
   indexing => { .. }
);

4.13.2. Indexing

By default all key/value are indexed, in case you want to omit or specially index a key, define indexing, the priority or level of the key and optional the index-type:

  • by default all keys are indexed
  • define priority as 0, so the index is skipped, any other positive integer indicates priority
  • define optionally the index-type(s)
priority + [ ':' + index-type1 + [ ',' + index-type2 ... ] ]

Examples:

0
1
1:i
1:i,e
1:loc

Index Types

Note: This part is highly experimental, and might change soon.

  • i: case-insensitive, disregard case-sensitivity in the index; be aware: keys() will return keys all lowercase
  • e: tune for regular expression queries (regex) using an additional trigram index; yet size of index is linear to length of value O(size(v)), e.g. indexing filenames with 5-20 chars, will create a 20x larger index, also increase amount of update writes 20x
  • loc: (coming soon), geohash the key which should have lat and long as sub-fields, e.g. location: { lat: .., long: .. }
Note: i and e are combinable, e implies i functionality though.

Example:

$ixdb->meta("myset","items",{
   types => { .. },
   indexing => {
      name => "1:i,e",           # -- case insensitive & regular expression optimized
      tags => "1:i",             # -- case insensitive
      keywords => "1:i",         #       "        "
      image => {
         histocube => 0,         # -- omit indexing this one 
         histogram => {
            h => 0,              #      "             "
            s => 0,              #      etc.
            l => 0,
            a => 0
         }
      }
   }
});

5. Tuning

5.1. ulimit/limit

All indexes are kept open so you need to increase limit of open files from the usual 1024 to 10000 at least or even higher.

Increase the per process file descriptor/open-files limit:

% sudo su
ulimit -n 100000
and /etc/security/limits.conf:
* soft nofile 10000
* hard nofile 10000

which takes effect one relogin, then

bash:

% ulimit -n 10000
% indexdb server &
csh:
% limit descriptors 10000
% indexdb server &

or you change the limits of the running indexdb process:

% ps aux | grep indexdb
kiwi     28199  1.9  0.1  91340 27324 pts/95   S+   Aug24  24:39 /usr/bin/perl ./indexdb server

% sudo prlimit --pid=28199 --nofile=10000:10000

6. Updates

Significant updates of this document: Authors