Table of Content

Blog

Describing significant features and improvements of MetaFS:

MetaFS Query & Aggregation Pipeline: Core & GUI

posted 2016/11/26 by rkm

MongoDB Query and MongoDB Aggregation Pipeline compatible functionality has been implemented

see detailed documentation for the state of support.

MetaFS::Cloud on GPU via Docker

posted 2016/09/09 by rkm

The semantic backend handlers of caption (NeuralTalk2) and densecap (DenseCap) are based complex setups (comprehensive dependencies), also imagetags (Darknet with ImageNet/Darknet model) running on NVIDIA's GPU CUDA platform brings a speed up of 10-90x compared to CPU-only version, using nvidia-docker:

using NVIDIA GTX 1070 with 8GB DDR5, all three services implemented as REST, loading the model once, evaluation of images on request via REST API.

MetaFS::Cloud & Semantics Extended

posted 2016/05/01 by rkm

MetaFS::Cloud in enterprise application
In order to support complex and large deployments of MetaFS some computational intensive and complex services can be distributed via MetaFS::Cloud.

And with such also more heavy weight semantic analysis has been integrated preliminary:

as well existing semantic analysis of can now be distributed within a private cloud infrastructure.

Semantics Extended

posted 2016/04/05 by rkm

New semantics handlers:

WebGUI close-up: topics & entities
both are documented at Semantics.
Entities example:
% mmeta --entities "Amateur Photographer - 2 April 2016.pdf"
  entities: [ Nikon, Canon, Olympus, Panasonic, Sony, BP, Facebook, SanDisk, Twitter, Ricoh, Bayer, Hoya, "Jenny Lewis", "Muhammad Ali", Visa, "Ansel Adams", DCC, Eaton, Empire, Google, "Konica Minolta", "Leonardo da Vinci", "Neil Armstrong" ]

Music query examples:

% mfind topics=jazz
 ... anything related to jazz: text, image, audio etc
% mfind audio.music.rhythm.bpm=~120
 ... just music with about 120bpm (beats per minute)
% mfind semantics.music.gender=male
 ... music with male voice
% mfind semantics.music.genres=disco
 ... just disco for The Martian

Partial Rewrite of the Core

posted 2016/02/28 by rkm

After an extended "alpha" phase the "beta" state now lead to a rewrite of the core functionality to

to reach production quality of the software. The "alpha" phase was a pure MetaFS.org concept implementation, whereas the "beta" will trade some of the features for sake of stability. More details will be posted later.

MetaFS Paper Published (Preview)

posted 2015/12/05 by rkm

MetaFS Paper (PDF) published.

The paper describes MetaFS briefly and summarizes what About features and the Handbook and the Cookbook describe in more details.

IndexDB & Semantic Handlers

posted 2015/10/30 by rkm

Updating Web-Site

posted 2015/09/12 by rkm

Even though the development continues non-public, I update the web-site and extend the documentation, e.g. Programming Guide.

Working on Stable

posted 2015/08/04 by rkm

Comprehensive refactoring of MetaFS has started, de-couple former MongoDB centric code to use more abstract NoSQL layer which allows then to use multiple backends instead of MongoDB. For the time being the development of MetaFS beta is non-public and eventually replace the former alpha (proof of concept).

MetaFS beta is expected early 2016.

MetaFS::IndexDB

posted 2015/07/18 by rkm

MetaFS::IndexDB development has started and first steps for integration into MetaFS has begun. Eventually it will be released in 2016 with MetaFS beta, the preliminary API has been published.

IndexDB lifts some of the limitations of the other backends:

File Browser & Image Query Screencast

posted 2015/05/11 by rkm

Two brief screencasts to show the early state of web-frontend to test the backend of MetaFS:

Note: the Location experiment with "Berlin" are two steps: 1) look up GPS coordinate of "Berlin" and 2) search with those coordinate in the MetaFS volume for GPS tagged items like photos.

Backend: MetaFS-0.4.10 with MongoDB-3.0.2 WiredTiger on a Intel Core i7-4710HQ CPU @ 2.50GHz (4 cores) with 16GB RAM, running Ubuntu 14.10 64 bit.

MongoDB is working "OK", it could be better, some queries with additional post sorting cause immense memory usage and cause slow down. So for casual testing MongoDB is "OK", but MetaFS::IndexDB or PostgreSQL with better JSON support are still prefered once they are mature enough.

General Query

posted 2015/05/10 by rkm

Some early tests of a 2D Web-GUI to cover the complexity of querying items:

Query "Berlin" with query settings
Query "Berlin" results
Hierarchical-iconic view

I queried "Berlin" as example so the Location would provide results as well, "Berlin" is a known location, and some photos have GPS location and those are found too.

The volume was mounted on a HDD not SSD, the lookup time on SSD is about 5-10% of a HDD.

Image Query

posted 2015/04/29 by rkm

In order to test image indexing (image-handler) a rudimentary interface is used to query a set of 280,000 images:

much white
much black
red & black
yellow, green & violet
partially transparent
low brightness
high brightness
low saturation (grayish)
high saturation
low color diversity
high color diversity
black & white
grayscale
violet & portrait
~10MPixels

Web interface will be released later along with a general "data-browser" UI.

Changelog 0.4.9

posted 2015/04/26 by rkm

MongoDB vs Others

posted 2015/04/24 by rkm

The default backend of MetaFS is MongoDB, a quasi standard NoSQL database. Yet during testing the "proof of concept" some significant short-comings showed up:

otherwise MongoDB is fast, easy to use, and supports JSON natively. The alternatives with their (dis-)advantages:

MongoDB-3.0.2 TokuMX-2.0.1 PostgreSQL-9.4.1 MetaFS::IndexDB-0.1.7
NoSQL state: mature mature inmature[1] experimental
ACID: no yes yes no[2]
internal format: BSON BSON JSONB JSON + binary
inequality functions: full full none[3] full
memory usage: heavy[4] heavy light light
control memory usage: no[5] yes yes no
max indexes: 64 64 unlimited unlimited
insertion speed:[6] fast (2000/s) fast (2000/s) moderate (200/s) moderate (200/s)
lookup speed: slow - fast[7] slow - fast[8] fast fast
sorting results with another key: no / yes[9] no / yes[10] yes no

  1. PostgreSQL-9.4.1 doesn't support in-place key/value updates yet
  2. IndexDB-0.1.7 doesn't support transactions yet, but is planned
  3. Postgres-9.4.1 with GIN index supports no inequality comparisons
  4. MongoDB-3.0.2 virtual memory usage can be huge (hundreds of GBs), resident memory use 10-20% of physical RAM
  5. MongoDB-3.0.2 let's the OS control the overall memory usage
  6. tested on the same machine (4 core, 16GB RAM, 256 GB SSD)
  7. MongoDB-3.0.2, lookup speed depends whether key is indexed
  8. TokuMX alike MongoDB
  9. MongoDB-3.0.2: only when key is indexed
  10. TokuMX-2.0.1: only when key is indexed

Efforts to immitate NoSQL of MongoDB with PostgreSQL backend:

Few important aspects for MetaFS backend db, listed by priority:
  1. low memory usage: it's a "rich" filesystem, yet, it can't or shouldn't use significant portions of memory by itself
  2. fast lookup: once the data is inserted into the filesystem (and indexed) it should be fast to lookup, hence, the fully indexed approach:
  3. fully indexed: preferable all keys are indexed[1]
  4. fast insertion: still important but not 1st priority is decent speed to insert, yet, insertion has several steps:
    1. insertion of data & metadata,
    2. indexing of the metadata,
    3. extraction of metadata from the data from handlers (delayed/queued execution)
  1. currently as of 0.4.9 keys can be prioritized to be indexed, and depending on backend all keys or certain prioritized keys can or will be indexed

I will update this blog post with updates when other DBs are considered and benchmarked.

Fully Indexed DB Tests

posted 2015/04/12 by rkm

updated 2015/04/22 by rkm: Adding Postgres-9.4.1 with GIN index benchmark.

Background

The "proof of concept" of MetaFS (0.4.8) with MongoDB-2.6.1 (or 3.0.2) has a limit with 64 indexes per collection (which is the equivalent of a MetaFS volume) which is too limiting for real life usage. To be specific, 64 keys per item (and being indexed) might sufficient, yet, there are different kind of items with different keys so 64 keys for a wide-range of items are reached quickly:

16 + (6..15) + (19..27) + 7 + 6 + 4 + (3..10) = 22..43 keys per item, total 85 different keys.

Setup

Machine has an Intel Core i7-4710HQ CPU 2.5GHz (4 Cores), 16GB RAM, 256GB LITEON IT L8T-256L9G (H881202) SSD running Ubuntu 14.10 64 Bit (Kernel 3.16.0-33-generic), with CPU load of 2.0 as base line; in other words, an already busy system is used to simulate more real world scenario.

Two test setups are benchmarked with 1 mio JSON documents where all keys are fully indexed:

Full Indexing Average (linear)
Full Indexing Average (log)
Full Indexing (5s Average, linear)

MetaFS::IndexDB-0.1.7, a custom database in development for MetaFS using BerkeleyDB 5.3 B-Tree as key/value index storage, saving JSON as flat file, starts fast with 600+ inserts/s and then steadily declines to 230 inserts/s after 1 mio entries.

Postgres-9.4.1 with dedicated keys indexes performs for the first 100K entries quite well and then steeply declines to 95 inserts/s, after 400K entries benchmark aborted. Since 9.4 there are NoSQL features included, see Linux Journal: PostgreSQL, the NoSQL Database (2015/01).

CREATE TABLE items (data JSONB)
INSERT INTO items (data) values (?)
INSERT INTO items (data) values (?)
INSERT INTO items (data) values (?)
...
and the data structure is JSON with 30-40 keys, all keys are recorded and indexes requested if not yet created:
CREATE INDEX ON items ((data->>atime))
CREATE INDEX ON items ((data->>ctime))
CREATE INDEX ON items ((data->>mtime))
CREATE INDEX ON items ((data->image->>width))
CREATE INDEX ON items ((data->image->>height))
CREATE INDEX ON items ((data->image->theme->>black))
...
until all keys of all items are indexed.

Postgres-9.4.1 with GIN (Generalized Inverted Index) indexes all JSON keys by default, no need to track the keys, and it performs quite well, alike to IndexDB; no surprise since both use B-Tree.

CREATE TABLE items (data JSONB)
CREATE INDEX ON items USING GIN (data jsonb_path_ops)

Conclusion

The "5s Average" reveals more details: MetaFS::IndexDB inserts remain fast alike Postgres with GIN, yet, the syncing to disk every 30 secs (can be altered) with IndexDB gets longer and CPU & IO intensive and causing the overall average to go down, whereas Postgres with dedicated keys falls quicker and likely syncing immediately to disk (has 5 processes handling the DB), and is less demanding on the IO as the system remains more responsive.

IndexDB-0.1.7 supports inequality operators: < <= > >= != and ranges, whereas Postgres-9.4.1 with jsonb_path_ops only can query (SELECT) on value or existance of keys, but not inequalities yet.

Outlook

With every key additionally indexed the amount of to be written data increases: O(nkeys). Additionally, BerkeleyDB documentation regarding BTree access method:

Insertions, deletions, and lookups take the same amount of time: The time taken is of the order of O(log(B N)), where B equals the number of records per page, and N equals the total number of records.
which the benchmark pretty much confirms.

If delayed indexing with MetaFS::IndexDB would be considered, the index time would still be depending on the actual amount of all items in a volume, but having faster inserts but delayed search capability on the full data. So depending on the use case or choice, either data becoming quickly searchable, or having constant insertion speed.

IndexDB might scale a bit better post 10mio items, as Postgres+GIN shows a slightly steeper decline in performance - but that has be confirmed with actual benchmarks.

Both DBs without indexing at all as comparison:

No Indexing (linear)
No Indexing (log)

Postgres-9.4.1: stays around 410 inserts/s, MetaFS::IndexDB-0.1.7: starts slow (creating required directories, apprx. 256 x 256) and then increases in speed then drops to ~700 inserts/s after 200K. Both DBs stay constant after 100K inserts, MetaFS::IndexDB (1.5 - 1.7x) a bit faster than Postgres (1x). The 5s average (not shown) fluctuate between 200 - 1200 inserts/s for MetaFS::IndexDB, and 150 - 700 inserts/s for Postgres.

Image (Color) Theme

posted 2015/03/27 by rkm

The image-handler got a major improvement, actually it's not much code which I added, but the little code is quite impressive in its effect, for example:

You can search images based on their color theme, the main colors

all parts, if they exist visible[1] in the image, are listed in image.theme.* and add up to 100% or 1.0

Sunflowers (Nakae@Flickr)
theme: { 
   black: 2.60%
   blue: 30.88%
   gray: 12.40%
   green: 2.31%
   orange: 21.85%
   white: 20.17%
   yellow: 9.18%
}

So, searching for ocean scenery with the sun present, a lot of blue, and sufficient yellow:

% mfind 'image.theme.blue>50%' 'image.theme.yellow>10%'
which results, in my case, with one single image:

Let's try with tiny range of yellow:

% mfind 'image.theme.blue>50%' 'image.theme.yellow=1..3%'

Some yellow matches are obvious, the last image match is about the palms trunks and partially at the palm leaves: brown, the lower saturated yellow. So, you use the saturated color names, but the match includes lesser saturated and brighter aspects of that color - so keep this in mind in this context.

Although I've got the ocean (beach) scenery, there is no sun - perhaps orange would be more accurate:

% mfind 'image.theme.blue>50%' 'image.theme.orange>10%'

You get the idea how searching with colors works, I deliberately included my moderately successful results with a very small image library of 800+ images.

So, for now mfind command line interface (CLI) is only available, but I work on some web demo which will focus on image-handler features, providing full featured image search facility.

More Image Metadata

There are many more image.* metadata to query:

see Cookbook section on Images.

PS: I gonna update this blog post and rerun the lookups with a larger image library (280K images).

  1. very dark colors become black, very bright colors become white, and low saturated colors become gray

MetaFS Archive (marc)

posted 2015/03/26 by rkm

Well, there are so many new additions since the last update back in November 2014, let's focus on marc, the tool and archive format which allows you to create, add and extract items from an archive:

% marc cv ../my.marc .
creates the archive my.marc of the current directory, with verbosity enabled, and saves it one directory above. All items are saved with full metadata, so one can transfer MetaFS volumes easily.

Transfer a volume from machine to another via ssh:

% marc cvz - . | (ssh a01.remote.com "cd alpha/; marc cv -")
Note: it's not a typo that the 2nd marc has no z, as the stream identifies itself being compressed. It also means, using z flag you still use .marc extension, no need to add .gz or so.

For all full documentation on marc, consult the Handbook.

Changelog 0.4.7

posted 2015/03/25 by rkm

First post covering version 0.4.7, the past 3 months I haven't updated the github repository as so many changes and updates occured:

I also started to develop MetaFS::IndexDB which eventually (around 0.6.x) replace MongoDB/TokuMX as they were ok to develop the "proof of concept", but now it's getting more serious MongoDB just is too memory consuming for this use case (see FAQ too), therefore MetaFS::IndexDB is focused to cover the required functionality while having a very small memory footprint.

Authors