Describing significant features and improvements of
:
MongoDB Query and MongoDB Aggregation Pipeline compatible functionality has been implemented
The semantic backend handlers of caption
(NeuralTalk2) and densecap
(DenseCap) are based complex setups (comprehensive dependencies), also imagetags
(Darknet with ImageNet/Darknet model) running on
NVIDIA's GPU CUDA platform brings a speed up of 10-90x compared to CPU-only version, using nvidia-docker
:
caption
: 200-300ms per evaluation (CPU 1.5-2 secs, 10x)
densecap
: 500-600ms per evaluation (CPU 16-24 secs, 30x)
imagetags
: 3-4ms per evaluation (CPU 0.2-0.3 secs, 90x)
And with such also more heavy weight semantic analysis has been integrated preliminary:
faces
: recognize faces (and train new faces) in images
imagetags
: image tagging, recognize various elements of the image
caption
: image caption, general impression of the overall image
densecap
: detailed or deep image caption
sentiments
topics
entities
music
New semantics handlers:
entities
: recognizes individuals, organizations and companies (~20,000 entities) based on DBpedia.org and Wikidata.org and hand-picked sources.
music
: concludes rhythm (bpm), tonality, genres, vocal/instrumental, moods using Essentia.
% mmeta --entities "Amateur Photographer - 2 April 2016.pdf"
entities: [ Nikon, Canon, Olympus, Panasonic, Sony, BP, Facebook, SanDisk, Twitter, Ricoh, Bayer, Hoya, "Jenny Lewis", "Muhammad Ali", Visa, "Ansel Adams", DCC, Eaton, Empire, Google, "Konica Minolta", "Leonardo da Vinci", "Neil Armstrong" ]
Music query examples:
% mfind topics=jazz
... anything related to jazz: text, image, audio etc
% mfind audio.music.rhythm.bpm=~120
... just music with about 120bpm (beats per minute)
% mfind semantics.music.gender=male
... music with male voice
% mfind semantics.music.genres=disco
... just disco for The Martian
After an extended "alpha" phase the "beta" state now lead to a rewrite of the core functionality to
posted 2015/12/05 by rkm
MetaFS Paper (PDF) published.
The paper describes MetaFS briefly and summarizes what About features and the Handbook and the Cookbook describe in more details.
posted 2015/10/30 by rkm
set
supported, required for atomic updates in MetaFS::Item
topics
basic text topicalizer started, heading to integrate larger ontology (Wordnet, DBpedia, YAGO)
ocr
image OCR (Optical Character Recognition) to convert scanned pages into text
html
improved to cover general xml parsing as well
pdf
supports explode
event type, which extracts all pages as images as sub-nodes; in combination of ocr
semantic handler very useful
docx
(new) preliminary OpenXML (Microsoft) support
tags
, and text.topics
posted 2015/09/12 by rkm
Even though the development continues non-public, I update the web-site and extend the documentation, e.g. Programming Guide.
posted 2015/08/04 by rkm
Comprehensive refactoring of
has started, de-couple former MongoDB centric code to use more abstract NoSQL layer which allows then to use multiple backends instead of MongoDB. For the time being the development of beta is non-public and eventually replace the former alpha (proof of concept).beta is expected early 2016.
MetaFS::IndexDB development has started and first steps for integration into has begun. Eventually it will be released in 2016 with beta, the preliminary API has been published.
IndexDB lifts some of the limitations of the other backends:
Two brief screencasts to show the early state of web-frontend to test the backend of MetaFS:
Note: the Location experiment with "Berlin" are two steps: 1) look up GPS coordinate of "Berlin" and 2) search with those coordinate in the MetaFS volume for GPS tagged items like photos.
Backend: MetaFS-0.4.10 with MongoDB-3.0.2 WiredTiger on a Intel Core i7-4710HQ CPU @ 2.50GHz (4 cores) with 16GB RAM, running Ubuntu 14.10 64 bit.
MongoDB is working "OK", it could be better, some queries with additional post sorting cause immense memory usage and cause slow down. So for casual testing MongoDB is "OK", but MetaFS::IndexDB or PostgreSQL with better JSON support are still prefered once they are mature enough.
Some early tests of a 2D Web-GUI to cover the complexity of querying items:
I queried "Berlin" as example so the Location would provide results as well, "Berlin" is a known location, and some photos have GPS location and those are found too.
The volume was mounted on a HDD not SSD, the lookup time on SSD is about 5-10% of a HDD.
In order to test image indexing (image
-handler) a rudimentary interface is used to query a set of 280,000 images:
Web interface will be released later along with a general "data-browser" UI.
metafs.conf
and other *.conf can contain indexing
section where keys are listed with their indexing-priority:
indexing
-sections are only considered currently by MongoDB (mongo
) backend (default), priorities 1-3 covered totaling ~59 keys (64 indexes max)
pg
backend support using JSONB (jsonb_path_ops) GIN indexing which indexes all keys by default, doesn't support inequality queries yet (essential missing feature), mfind
not yet supported:
% metafs --backend=pg alpha Alpha/
% cd Alpha/
% ls
but you require (for now) to manually create the SQL database for the volume (in this example alpha
) in PostgreSQL beforehand:
% sudo -u postgres psql
postgres=# create database metafs_alpha
postgres=# \q
expose.indexDirs
feature: on
or off
(default), listing indexes as directories one can dive into (for now no mls
and other metabusy
-tools support) and read access (no write yet) under @/
:
% metafs --expose.indexDirs=on alpha Alpha/
% cd Alpha
% ls
@/ BB DIR/ open-source-logo.png* Untitled.odg* zero.bin
20130914_140844.jpg* bitcoin.pdf fables_01_01_aesop_64kb.mp3 shakespeare-midsummer-16.txt Untitled.ods*
AA.txt CC Metadata.odt* timings.txt violet_sunset.jpg*
% cd @
% ls
atime/ author/ hash/ keywords/ mime/ name/ parent/ tags/ title/ uid/ video/
audio/ ctime/ image/ location/ mtime/ otime/ size/ text/ type/ utime/
% cd image
average/ color/ height/ orient/ pixels/ theme/ type/ variance/ width/
% cd orient
% ls
landscape/ portrait/ square/
% cd square/
% ls
2dbaeffc27f5d7afc38b6a31a9ca9307-553be760-0d0a9b#open-source-logo.png
...
% metafs --mongo.port=27020 alpha Alpha/
% cd Alpha/
% ls
for using MongoDB running at e.g. port 27020 (default 27017), simplifies running multiple MongoDBs (MongoDB 2.6.1, MongoDB 3.0.2 MMAPv1 or WiredTiger) or MongoDB-compatible DBs (TokuMX or more mature ToroDB) without requirement for dedicated metafs.conf
for each volume.
The default backend of
is MongoDB, a quasi standard NoSQL database. Yet during testing the "proof of concept" some significant short-comings showed up:
MongoDB-3.0.2 | TokuMX-2.0.1 | PostgreSQL-9.4.1 | MetaFS::IndexDB-0.1.7 | |
NoSQL state: | mature | mature | inmature[1] | experimental |
ACID: | no | yes | yes | no[2] |
internal format: | BSON | BSON | JSONB | JSON + binary |
inequality functions: | full | full | none[3] | full |
memory usage: | heavy[4] | heavy | light | light |
control memory usage: | no[5] | yes | yes | no |
max indexes: | 64 | 64 | unlimited | unlimited |
insertion speed:[6] | fast (2000/s) | fast (2000/s) | moderate (200/s) | moderate (200/s) |
lookup speed: | slow - fast[7] | slow - fast[8] | fast | fast |
sorting results with another key: | no / yes[9] | no / yes[10] | yes | no |
Efforts to immitate NoSQL of MongoDB with PostgreSQL backend:
I will update this blog post with updates when other DBs are considered and benchmarked.
updated 2015/04/22 by rkm: Adding Postgres-9.4.1 with GIN index benchmark.
Background
The "proof of concept" of
(0.4.8) with MongoDB-2.6.1 (or 3.0.2) has a limit with 64 indexes per collection (which is the equivalent of a volume) which is too limiting for real life usage. To be specific, 64 keys per item (and being indexed) might sufficient, yet, there are different kind of items with different keys so 64 keys for a wide-range of items are reached quickly:Setup
Machine has an Intel Core i7-4710HQ CPU 2.5GHz (4 Cores), 16GB RAM, 256GB LITEON IT L8T-256L9G (H881202) SSD running Ubuntu 14.10 64 Bit (Kernel 3.16.0-33-generic), with CPU load of 2.0 as base line; in other words, an already busy system is used to simulate more real world scenario.
Two test setups are benchmarked with 1 mio JSON documents where all keys are fully indexed:
MetaFS::IndexDB-0.1.7, a custom database in development for
using BerkeleyDB 5.3 B-Tree as key/value index storage, saving JSON as flat file, starts fast with 600+ inserts/s and then steadily declines to 230 inserts/s after 1 mio entries.Postgres-9.4.1 with dedicated keys indexes performs for the first 100K entries quite well and then steeply declines to 95 inserts/s, after 400K entries benchmark aborted. Since 9.4 there are NoSQL features included, see Linux Journal: PostgreSQL, the NoSQL Database (2015/01).
CREATE TABLE items (data JSONB)
INSERT INTO items (data) values (?)
INSERT INTO items (data) values (?)
INSERT INTO items (data) values (?)
...
and the data structure is JSON with 30-40 keys, all keys are recorded and indexes requested if not yet created:
CREATE INDEX ON items ((data->>atime))
CREATE INDEX ON items ((data->>ctime))
CREATE INDEX ON items ((data->>mtime))
CREATE INDEX ON items ((data->image->>width))
CREATE INDEX ON items ((data->image->>height))
CREATE INDEX ON items ((data->image->theme->>black))
...
until all keys of all items are indexed.
Postgres-9.4.1 with GIN (Generalized Inverted Index) indexes all JSON keys by default, no need to track the keys, and it performs quite well, alike to IndexDB; no surprise since both use B-Tree.
CREATE TABLE items (data JSONB)
CREATE INDEX ON items USING GIN (data jsonb_path_ops)
Conclusion
The "5s Average" reveals more details: MetaFS::IndexDB inserts remain fast alike Postgres with GIN, yet, the syncing to disk every 30 secs (can be altered) with IndexDB gets longer and CPU & IO intensive and causing the overall average to go down, whereas Postgres with dedicated keys falls quicker and likely syncing immediately to disk (has 5 processes handling the DB), and is less demanding on the IO as the system remains more responsive.
IndexDB-0.1.7 supports inequality operators: < <= > >= != and ranges,
whereas Postgres-9.4.1 with jsonb_path_ops
only can query (SELECT) on value or existance of keys, but not inequalities yet.
Outlook
With every key additionally indexed the amount of to be written data increases: O(nkeys). Additionally, BerkeleyDB documentation regarding BTree access method:
Insertions, deletions, and lookups take the same amount of time: The time taken is of the order of O(log(B N)), where B equals the number of records per page, and N equals the total number of records.which the benchmark pretty much confirms.
If delayed indexing with MetaFS::IndexDB would be considered, the index time would still be depending on the actual amount of all items in a volume, but having faster inserts but delayed search capability on the full data. So depending on the use case or choice, either data becoming quickly searchable, or having constant insertion speed.
IndexDB might scale a bit better post 10mio items, as Postgres+GIN shows a slightly steeper decline in performance - but that has be confirmed with actual benchmarks.
Both DBs without indexing at all as comparison:
Postgres-9.4.1: stays around 410 inserts/s, MetaFS::IndexDB-0.1.7: starts slow (creating required directories, apprx. 256 x 256) and then increases in speed then drops to ~700 inserts/s after 200K. Both DBs stay constant after 100K inserts, MetaFS::IndexDB (1.5 - 1.7x) a bit faster than Postgres (1x). The 5s average (not shown) fluctuate between 200 - 1200 inserts/s for MetaFS::IndexDB, and 150 - 700 inserts/s for Postgres.
The image
-handler got a major improvement, actually it's not much code which I added, but the little code is quite impressive in its effect, for example:
You can search images based on their color theme, the main colors
image.theme.*
and add up to 100% or 1.0
theme: {
black: 2.60%
blue: 30.88%
gray: 12.40%
green: 2.31%
orange: 21.85%
white: 20.17%
yellow: 9.18%
}
So, searching for ocean scenery with the sun present, a lot of blue, and sufficient yellow:
% mfind 'image.theme.blue>50%' 'image.theme.yellow>10%'
which results, in my case, with one single image:
Let's try with tiny range of yellow:
% mfind 'image.theme.blue>50%' 'image.theme.yellow=1..3%'
Although I've got the ocean (beach) scenery, there is no sun - perhaps orange would be more accurate:
% mfind 'image.theme.blue>50%' 'image.theme.orange>10%'
You get the idea how searching with colors works, I deliberately included my moderately successful results with a very small image library of 800+ images.
So, for now mfind
command line interface (CLI) is only available, but I work on some web demo which will focus on
image
-handler features, providing full featured image search facility.
More Image Metadata
There are many more image.*
metadata to query:
type
(icon, illustration, photo etc),
size.ratio
(4/3, 16/9),
pixels
(5M),
color.type
(bw, grayscale, limited, full),
PS: I gonna update this blog post and rerun the lookups with a larger image library (280K images).
Well, there are so many new additions since the last update back in November 2014, let's focus on marc, the tool and archive format which allows you to create, add and extract items from an archive:
% marc cv ../my.marc .
creates the archive my.marc
of the current directory, with verbosity enabled, and saves it one directory above.
All items are saved with full metadata, so one can transfer volumes easily.
Transfer a volume from machine to another via ssh
:
% marc cvz - . | (ssh a01.remote.com "cd alpha/; marc cv -")
Note: it's not a typo that the 2nd marc
has no z
, as the stream identifies itself being compressed.
It also means, using z
flag you still use .marc
extension, no need to add .gz
or so.
For all full documentation on marc
, consult the Handbook.
First post covering version 0.4.7, the past 3 months I haven't updated the github repository as so many changes and updates occured:
marc
working like tar
, including optional built-in compression, to store items with full metadata on any media
Authors