Luca Garulli - The Zion City (OrientDB)

Tuesday, October 19, 2010

When someone is criticizing something you care...

When someone is criticizing something you care, don't defend it till death, but ask him what's his idea to improve it! Both will growth. Lvc@

Thursday, October 14, 2010

A standard for GraphDB

Yesterday GraphDBs have covered a niche market. In the last period things are changed and they are becoming much more popular than ever. This is due to:

The always growing need to have data strongly interconnected. What better than a GraphDB for it?
Applications are much more "social" than before. And social means relationships among subjects that perfectly bound to the Graph model
Performance on large datasets. You can live without a pure GraphDB, and therefore continue to use a Relational DBMS, until you need a real fast access to your data
The NoSQL movement has contributed a lot to change the mind of developers to focus to DBMS different (and sometime very different) from the Relational model.

The GraphDB has few but strong concepts. Graphs are composed by:

Vertex or Node, the linked entity. Vertexes can have properties.
Edge or Arc, as the link between the Vertexes. Edges can have properties and can be unidirectional or bidirectional
Property, is a value to assign to Vertexes and Edges. A property has a name and a value

Each GraphDB product is based upon the above concepts, but features can be very different among them. Furthermore there isn't a standard for them and an application can't migrate from one to another one at zero or low cost.

This is the reason why I want to introduce the Tinkerpop Blueprints. This Open Source project aims to have a shared set of basic interfaces to abstract the concepts of Graph, Vertex, Edge and Property. In this way applications can easily migrate to another available implementation of the specification.

Furthermore Tinkerpop has created a complete stack of tools and framework to work with Graphs. Below the main ones:

Gremlin is a Turing-complete, graph-based programming language designed for key/value-pair multi-relational graphs. Gremlin makes use of an XPath-like syntax to support complex graph traversals. This language has application in the areas of graph query, analysis, and manipulation.
Rexster is a RESTful graph shell that exposes any Blueprints graph as a standalone server. Extensions support standard traversal goals such as search, score, rank, and, in concert, recommendation. Rexster makes extensive use of Blueprints, Pipes, and Gremlin. In this way its possible to run Rexster over various graph systems.
Pipes is a graph-based data flow framework for Java 1.6+. A process graph is composed of a set of process vertices connected to one another by a set of communication edges. Pipes supports the splitting, merging, and transformation of data from input to output.

(1) What do you think about this effort to standardize GraphDB?

(2) What is missed today in Tinkerpop Blueprints API?

(3) Does a .NET porting make sense?

Tuesday, September 28, 2010

Working on large datasets in OrientDB

I was really busy in the last 5 days because I'was working in a huge rewriting of the RB+Tree algorithm that is behind the OrientDB index system. All is started some days ago when in the OrientDB Group the user Mark Harwood was sharing the first experiences in the import of the entire Internet Movie Data Base (IMDB) inside an OrientDB instance. We're talking about 13 millions of Vertices and about 8 millions of Edges.

Well, performance seemed good, specially in comparison with Neo4J, but the insertion time was not-linear and after some millions of records the engine slow down a lot. Furthermore a nasty bug in the management of multiple segments broke the import job after some minutes in his machine while on my PC all went correctly.

This is the reason why I stopped the development of the Cluster feature in favor of a strong review of the RB+Tree algorithm. Indexing has been alway fast, but the Mark Harwood's case stressed the engine as no one did until now. The reason is in the import process itself that executes random reads into the index while the RB+Tree seemed optimized for sequential insertion and retrieval.

I'm fixing all the pending issues but first benchmarks show a speed improvement in Indexing between 100% and 800%! Furthermore times now are really linear. Inserting the first millions of Vertices is the same of the last million. This is a big result to the already super-fast OrientDB engine.

Once finished the test phase (with new Unit tests, of course) I'll commit all the changes in SVN to be reviewed by the community before the next release.

Thursday, September 16, 2010

Started the development of OrientDB clustering

Today is a new day, the day after the official release of OrientDB version 0.9.22. I apologize to all the requesters of the issues planned for this release and not yet finished and postponed to the 0.9.24. What is the reason?

---> Replication, Clustering, Fault-Tolerance <---

Seems that the most missed feature in OrientDB is the support for clustering, and therefore high scalability, high availability and high volume of transactions that a single node can't handle. In the last months I studied the whole different architectures of other NoSQL solutions for clustering.

Today is a new day because I'll start the development of clustering for OrientDB with the following features:

Master-Slaves type, where it can be only one Master and N Slaves. If the Master crashes a Slave is elected to be the new Master
IP multicast to discover cluster nodes
Configuration of nodes using TCP/IP, useful for Clouds that don't allow the IP multicast
Two sync modes: full where all the database is compressed and sent over the network, and partial by sending only the changes happened since the last sync
New database handled by the Master OrientDB Server instance to store all the pending records until a configurable threshold. Up this threshold the logs are deleted and the node need a full-sync on startup
New console commands to display nodes, listen clustering messages and elect the master manually

The release 0.9.23 is planned for the October 15th, 2010. Stay tuned and contribute with comments, ideas or critiques.

Wednesday, September 15, 2010

Released OrientDB 0.9.22

Hi all,

many thanks to all the contributors for this new piece of software. A new version of OrientDB has been released after more than 1 month of work (and holidays).

Starting from this version you can use OrientDB as a pure GraphDB using the custom API (http://code.google.com/p/orient/wiki/GraphDatabase), by the Tinkerpop stack (Blueprints, Gremlin, Pipes, etc.) or again by using the Erlang language with the Cali project (http://github.com/dmitriid/cali).

Direct URL to download: http://orient.googlecode.com/files/orient-database-0.9.22.zip

New features:

- Issue 21: Full text index against schema fields

- Issue 91: SQL UPDATE against collections and maps

- Issue 94: Support for GREMLIN graph language

- Issue 108: Regular expression support in WHERE clause

- Issue 109: Support for memory clusters inside regular persistent database

Bugs:

- Issue 86: Cannot insert record from console

- Issue 87: Cannot query records from OrientDB Studio Command-page

- Issue 100: Transaction does not work as expected

- Issue 101: Error in using orient queries having ' character

- Issue 104: Bulk record creation in cluster with pre-created indices causes ClassCastException

- Issue 105: Error in using UPDATE query

- Fixed other minor bugs

The next release is planned for the end of September 2010 with the first version of Clustering support. Here the details: http://code.google.com/p/orient/issues/list?q=label:v0.9.23

To test the OrientDB Studio go here: http://www.moobilis.com:2480. Users: reader/reader and writer/writer. To manage the server: root/root.

If you'd like to suggest some new feature for the next releases please open a New Issue as "New feature": http://code.google.com/p/orient/issues/entry?template=New%20feature

Saturday, September 11, 2010

Erlang interface for OrientDB and micro-benchmark between NoSQL DBMSs

Dmitrii 'Mamut' Dimandt made a micro benchmark against Neo4J, Redis, ThinkerGraph (in memory only) and OrientDB using the erlang language. These are the results:http://github.com/dmitriid/cali/wiki/Benchmarks-of-sorts.

OrientDB is the fastest implementation (only ThinkerGraph is faster but it's not a DBMS since works only in memory). It's nice to see that OrientDB performance are very close to the in-memory only implementation.

Thursday, September 09, 2010

OrientDB the fastest GraphDB available today?

Two days ago I've finished the integration of the last part of the Blueprints: the Index. Now OrientDB can be used as for Neo4J with the entire ThinkerPop stack. This include the Gremlin language as well.

First tests shows that OrientDB outperforms Neo4J, the market leader of GraphDB, in all the tests but the iteration (and therefore counting). This is due to the implementation of the Blueprints that need to create a new wrapper object to contain the OrientDB's OGraphVertex and OGraphEdge objects. I've some ideas to improve it, but I need more time for it. Maybe in the next weeks or earlier if some users need it. However using native OrientDB Graph APIs this overhead is removed.

These are the results:

Test name	Description	Time in ms (less is better)		OrientDB vs Neo4J
Test name	Description	OrientDB 0.9.22	Neo4J	+ faster, - slower
testVertexEquality	1 vertex added and retrieved in	4.74	2.33	-203,4%
testRemoveVertexNullId	1000 vertices added in	216.73	2,070.74	+955,4%
	1000 vertices deleted in	1,093.71	1,910.02	+174,6%
testVertexIterator	5000 vertices added in	476.62	8,314.04	+1.744,4%
	5000 vertices counted in	86.94	1.60	-5.433,8% *
testAddManyVertexProperties	750 vertex properties added (with vertices being added too) in	72.34	43,437.29	+60.046,0%
testAddEdges	6 elements added and checked in	0.79	30.70	+3.886,1%
testAddManyEdges	3000 elements added in	2,314.44	8,031.12	+347,0%
	1000 edges counted in	8.45	12.54	+148,4%
	2000 vertices counted in	31.62	0.51	-6.200,0% *
	2000 vertices checked in	98.27	14.05	-699,4% *
testGetEdges	3 edges retrieved in	0.45	0.12	-375,0%
testRemoveManyEdges	200 vertices counted in	167.28	0.14	-119.485,7% *
	100 edges counted in	34.92	0.79	-4.420,3% *
	100 edges removed and graph checked in	20,555.44	332.54	-6.181,3% *
testStringRepresentation	1 graph string representation generated in	0.01	0.01	100,0%
testClear	75 elements added in	45.89	152.87	+333,1%
	75 elements deleted in	30.76	422.05	+1.372,1%
testRemovingEdges	500 vertices added in	133.62	974.37	+729,2%
	1000 edges added in	1,130.64	4,521.90	+399,9%
	1000 edges deleted (with size check on each delete) in	36,773.66	4,411.69	-833,6% *
testRemovingVertices	500 vertices added in	10.48	1,110.03	+10.591,9%
	250 edges added in	132.32	1,089.45	+823,3%
	500 vertices deleted (with size check on each delete) in	70,675.74	2,140.07	-3.302,5% *
testTreeConnectivity	1464 vertices added in a tree structure in	1,506.82	5,832.02	+387,0%
	1464 vertices iterated in	427.67	0.75	-57.022,7% *
	1463 edges iterated in	9.89	5.39	-183,5% *
testTinkerGraphEdges	graph-example-1 loaded in	43.75	541.15	+1.236,9%
testTinkerGraphVertices	graph-example-1 loaded in	6.81	520.29	+7.640,1%
testTinkerGraphSoftwareVertices	graph-example-1 loaded in	5.31	543.02	+10.226,4%
testTinkerGraphVertexAndEdges	graph-example-1 loaded in	5.01	544.17	+10.861,7%

* are the tests with iteration

Thursday, August 12, 2010

Yesterday Mike Tallent, via Twitter, made his congratulations to me for my recent "Db4o community award".

Is it a joke? The reason seems to reside in the homepage of TevereFlow that claims to support db4o. After some private messages with a db4o guy I explained that TevereFlow supports db4o since it uses DataNucleus as OR-Mapping and it supports db4o. Therefore TevereFlow can run on top of db4o but the support is indirect. All clear.

But after some days I found my name in the "DB4O MOST VALUES PROFESSIONAL" list:

Thanks a lot, but I never have written a single line of code for/with db4o!

Furthermore I'm the author of OrientDB, a competitor of db4o.

So is it a joke? :-)

Saturday, June 12, 2010

Published the Roma presentation for the Open Source Italian Conference in Cagliari, Italy

Check out this SlideShare Presentation:

Roma introduction and concepts

View more presentations from Luca Garulli.

Friday, June 11, 2010

Published the OrientDB presentation for the Open Source Italian Conference in Cagliari, Italy