Thursday, November 01, 2012

OrientDB: huge improvement in performance (+9,000%) in many use cases. Thanks RaspberryPi !



Hi all,
today I've a good story to tell you. A couple of days ago Fabrizio Fortino sent to me an email with some metrics and screenshots about the profiling of an in-production instance of OrientDB. Well, a lot of time was spent on open/close of database. 

That was the issue 1145 (http://code.google.com/p/orient/issues/detail?id=1145) but I assigned to it a low priority because it was an improvement, not a real bug...

Well today I'm hacking with a Raspberry PI cheap HW and OrientDB to see if it could be used in production for some limited use cases. Well on this kind of HW everything is much-much slower! "Yeah, it's normal: I have a $35 HW, Java is not so optimized yet on this ARM platform, etc.". This were my firsts thoughts about the initial results.

But after some profiling I was arrived at the same conclusion of Fabrizio, so I decided to spend 2 hours of my life to investigate in deep.

Well, I've just committed a small patch (r7134) that avoids to open a database every time a database is re-used from the pool. In facts this is a quite costly operation, specially if you do many small atomic operation where most of the cost is in open/close that in the operation itself!

This fix improved a lot these scenarios:
  • Usage via HTTP/Rest, because a new connection is acquired every time from the pool at every operation
  • Java Web Applications where at the server side you used the database pool
  • you wrote a Java App that every time creates a new instance of a database. if this is your case I strongly encourage using the database pool that at this time is much faster
  1. in case metadata changes (schema, security, functions) you would need to invoke a reload() to get the changes
The improvement will be minor in the cases:
All these are PROS, what about CONS?
This is a simple load of a tiny document against a database on my pc:

$ ab -n1000 -A admin:admin -k -c10 http://localhost:2480/document/demo/71:1
...
Requests per second:    52.56 [#/sec] (mean)

$ ab -n1000 -A admin:admin -k -c10 http://localhost:2480/document/demo/71:1
...
Requests per second:    4694.57 [#/sec] (mean)

This is 90x faster, namely 9,000%, namely a huge improvement!

Now it's funny that OrientDB on the Raspberry PI, with the new patch, runs at a speed quite close to my PC I used everyday to work before this patch!

Saturday, May 05, 2012

GraphDB market share

Last week a market analysis agency contacted me to ask some questions about OrientDB saying that OrientDB, following its research, has the second position in the worldwide GraphDB market right after Neo4J. Awesome!

But who are the main players of the GraphDB market?
Since each vendor claims, more or less, to be the market leader, what is the real user base? Seems quite hard to gather real data about users and customers directly from vendors.

So I though that one of the best way is to look into the public groups and forums because users, before or after, will subscribe on it because it's the first hand source of information, help and tricks. They can't lie! This document contains some metrics extracted from public sources. Click on the source to see with your eyes about the source I used.

By reading this data Neo4J is, without any doubts, the GraphDB market leader, followed by OrientDB in rapid grow and after a long distance InfiniteGraph and  DEX. By reading the web site  InfiniteGraph seems to have some real customers, but seems all related to the previous product ObjectivityDB (an ODBMS born more than a decade ago).

Below the metrics:


Products
Updated on
05/05/2012
05/05/2012
05/05/2012
05/05/2012 
Source
created on
April 2011
April 2010
September 2011
May 2011
members
926
620
75
?
threads since the beginning
1,240
1,449
36
33
posts since the beginning
6,752
7,918
233
87
posts in the last month (April 2012)
1,107
439
19
0
posts 2 months ago (March 2012)
1,310
519
13
7
but 100% announcements

Monday, February 20, 2012

Why I hate Maven

Yes, I admin that Maven has improved the development of Java programmers because the tons of dependencies each project brings.

So why I hate it so much? Well, because the thousands (really thousands!) of network calls to the remote server to:

  • check versions
  • check md5
  • download pom.xml files
  • download jars

But why Maven has been realized in the way we know? All the logic is at client side. This means that each Maven user pays the absurd latency cost for each network calls! The solution? Git teaches.

Why don't build a tree of requested JARs, send it to the Maven server and download the resulting zipped archive containing all the stuff to install in one shot?

In this way updates daily updates would take ms or just some seconds depending by the updates and the network bandwidth, not any more by the network latency.

Thursday, April 28, 2011

GraphDB benchmark part II

After some months since my last post about OrientDB (sorry but I prefer micro-blogging than blogging...) I'm back to write some news about the OrientDB engine.

The hard work has been the optimization at many levels:
  • minimize the wasted space created by set/delete operations (HOLES)
  • minimize marshalling/unmarshalling operations, specially on LinkSet type responsible of relationships between vertices and edges
  • fine tuning of Transactions
The main difficulty has been reduce the Disk I/O when you update a record. In facts when you execute an update, rarely the serialized content size will be the same of the original size. In this case you've a new HOLE, namely a free space marked to being reused.

The problems I found with the HOLES were that small spaces aren't reused at all and huge defragmentation was present. This caused a global slowness and the growth of the database on disk (in some cases many times the original size). After 2 weeks of work I've published in the SVN and maven the new version of the OrientDB storage with:
  • In-line defrag: something like some File Systems already do by joining small holes all together. In-line defrag works while the database is online and in use
  • Improved the management of small changes to records
  • 2 configurable strategies of how to find the best hole to join during defrag process
  • configurable hole distance to decide when to join multiple holes all together
The gain of overall speed has been perceived by a lot of users. Since there are not benchmarks against GraphDB yet I've re-run the TinkerPop Blueprints Test Cases (see my previous post about this).

This test suite is part of TinkerPop Blueprints project and is NOT a BENCHMARK, but just a lot of tests against GraphDB implementations to test the compliance level of them. So please don't flame about this. It's not an official benchmark, just a way to test how OrientDB performs in comparison with other GraphDBs and with the previous releases of the same OrientDB. More how OrientDB performs on different platforms. The comparison in this case is only with Neo4J (the market leader?) since DEX supports only few Test Cases and it wouldn't be fair. Note that both GraphDBs run with default settings.

These are the results on a cheap Linux server (Linux CentOS, Intel Atom Dual Core 330 1,6Ghz, 1GB Ram DDR2, HD U-ATA 7200rpm):

Test name Times in ms. Less is better = faster
OrientDB 1.0rc1 snapshot Neo4J 1.3 + faster, - slower
VertexTestSuite 11,190.80 48,354.85 +432,1%
EdgeTestSuite 6,421.55 24,361.78 +379,4%
GraphTestSuite 12,642.21 36,932.11 +292,1%
IndexableGraphTestSuite 1,173.55 2,618.00 +223,1%
IndexTestSuite 462.70 1,052.32 +227,4%
AutomaticIndexTestSuite 1,719.91 4,912.81 +285,6%
TransactionGraphTestSuite 1,603.59 4,237.17 +264,2%
GraphMLReaderTestSuite 1,291.93 2,365.85 +183,1%
Total 36,506.24 124,834.89 +342,0%

OrientDB completes all the tests in less than a third of the time of Neo4J: +342% faster than Neo4J!

The difference is lower running the same test against a MacBook Pro (OS X 10.6.7 64bit, 4GB Ram, CPU Intel core 2 duo 2.4ghz, HD 5400rpm):

Test name Times in ms. Less is better = faster
OrientDB 1.0rc1 snapshot Neo4J 1.3 + faster, - slower
VertexTestSuite 13,380.67 23,647.54 +176,7%
EdgeTestSuite 7,508.78 14,139.71 +188,3%
GraphTestSuite 9,514.58 18,664.99 +196,2%
IndexableGraphTestSuite 523.92 969.06 +185,0%
IndexTestSuite 263.58 640.80 +243,1%
AutomaticIndexTestSuite 1,672.92 2,943.86 +176,0%
TransactionGraphTestSuite 1,378.59 3,047.77 +221,1%
GraphMLReaderTestSuite 1,418.21 1,805.27 +127,3%
Total 35,661.25 65,859.00 +184,7%

In this case OrientDB completes all the tests in about half time: +184,7% faster than Neo4J. I was not able to complete all the tests on Windows 7 machine (Intel i7 720q, 4GB Ram, HD 7200rpm) since Neo4J gives errors and break the test suite:

Test name Times in ms. Less is better = faster
OrientDB 1.0rc1 snapshot Neo4J 1.3 + faster, - slower
VertexTestSuite 7,762.61 51,474.76 +663,1%
EdgeTestSuite 6,737.59 71,479.28 +1.060,9%
GraphTestSuite 7,280.43 75,695.77 +1.039,7%
IndexableGraphTestSuite 2,766.07 Error n.a.
IndexTestSuite 1,064.49 Error n.a.
AutomaticIndexTestSuite 2,796.08 Error n.a.
TransactionGraphTestSuite 3,471.80 Error n.a.
GraphMLReaderTestSuite 3,067.81 Error n.a.
Total 34,946.88 198,649.81 +921,2%

However by looking at the 3 available tests OrientDB outperforms Neo4J also on Windows machines of +921,2%!

Another interesting point is about the platforms. Linux CentOS performs very well even if the underlying HW is cheaper & older than the other 2 machines.

I'm pretty satisfied of these results in relation to the previous ones where Neo4J performed better in some circumstances. Thank you to all the OrientDB contributors and users that have made this possible!

It's not time to rest or sleep, because the work is not yet ended: all the efforts now are for the 1.0 release.

How to execute these test on my PC?

To re-execute the same test on your machine just install Java 6, Git and Apache Maven. Then execute these command in a shell (or command prompt if you've MS Windows):

> git clone git://github.com/tinkerpop/blueprints.git
> cd blueprints
> mvn install

Now wait that all the software is compiled and all the tests start. Now you've your results. Please share them to the OrientDB Group attaching your HW/SW configuration!

Tuesday, November 23, 2010

Codemotion 2011, the most important Italian conference about programming



Save this date: March 5th 2011! In Rome, Italy there will be the most important conference about programming. The event is totally FREE. About 1,500 attendees. Sessions in English and Italian languages.

For information: http://www.codemotion.it


Tuesday, October 19, 2010

When someone is criticizing something you care...

When someone is criticizing something you care, don't defend it till death, but ask him what's his idea to improve it! Both will growth. Lvc@

Thursday, October 14, 2010

A standard for GraphDB

Yesterday GraphDBs have covered a niche market. In the last period things are changed and they are becoming much more popular than ever. This is due to:
  • The always growing need to have data strongly interconnected. What better than a GraphDB for it?
  • Applications are much more "social" than before. And social means relationships among subjects that perfectly bound to the Graph model
  • Performance on large datasets. You can live without a pure GraphDB, and therefore continue to use a Relational DBMS, until you need a real fast access to your data
  • The NoSQL movement has contributed a lot to change the mind of developers to focus to DBMS different (and sometime very different) from the Relational model.

The GraphDB has few but strong concepts. Graphs are composed by:
  • Vertex or Node, the linked entity. Vertexes can have properties.
  • Edge or Arc, as the link between the Vertexes. Edges can have properties and can be unidirectional or bidirectional
  • Property, is a value to assign to Vertexes and Edges. A property has a name and a value
Each GraphDB product is based upon the above concepts, but features can be very different among them. Furthermore there isn't a standard for them and an application can't migrate from one to another one at zero or low cost.

This is the reason why I want to introduce the Tinkerpop Blueprints. This Open Source project aims to have a shared set of basic interfaces to abstract the concepts of Graph, Vertex, Edge and Property. In this way applications can easily migrate to another available implementation of the specification.

Furthermore Tinkerpop has created a complete stack of tools and framework to work with Graphs. Below the main ones:
  • Gremlin is a Turing-complete, graph-based programming language designed for key/value-pair multi-relational graphs. Gremlin makes use of an XPath-like syntax to support complex graph traversals. This language has application in the areas of graph query, analysis, and manipulation.
  • Rexster is a RESTful graph shell that exposes any Blueprints graph as a standalone server. Extensions support standard traversal goals such as search, score, rank, and, in concert, recommendation. Rexster makes extensive use of Blueprints, Pipes, and Gremlin. In this way its possible to run Rexster over various graph systems.
  • Pipes is a graph-based data flow framework for Java 1.6+. A process graph is composed of a set of process vertices connected to one another by a set of communication edges. Pipes supports the splitting, merging, and transformation of data from input to output.

(1) What do you think about this effort to standardize GraphDB?

(2) What is missed today in Tinkerpop Blueprints API?

(3) Does a .NET porting make sense?

Tuesday, September 28, 2010

Working on large datasets in OrientDB

I was really busy in the last 5 days because I'was working in a huge rewriting of the RB+Tree algorithm that is behind the OrientDB index system. All is started some days ago when in the OrientDB Group the user Mark Harwood was sharing the first experiences in the import of the entire Internet Movie Data Base (IMDB) inside an OrientDB instance. We're talking about 13 millions of Vertices and about 8 millions of Edges.

Well, performance seemed good, specially in comparison with Neo4J, but the insertion time was not-linear and after some millions of records the engine slow down a lot. Furthermore a nasty bug in the management of multiple segments broke the import job after some minutes in his machine while on my PC all went correctly.

This is the reason why I stopped the development of the Cluster feature in favor of a strong review of the RB+Tree algorithm. Indexing has been alway fast, but the Mark Harwood's case stressed the engine as no one did until now. The reason is in the import process itself that executes random reads into the index while the RB+Tree seemed optimized for sequential insertion and retrieval.

I'm fixing all the pending issues but first benchmarks show a speed improvement in Indexing between 100% and 800%! Furthermore times now are really linear. Inserting the first millions of Vertices is the same of the last million. This is a big result to the already super-fast OrientDB engine.

Once finished the test phase (with new Unit tests, of course) I'll commit all the changes in SVN to be reviewed by the community before the next release.

Thursday, September 16, 2010

Started the development of OrientDB clustering

Today is a new day, the day after the official release of OrientDB version 0.9.22. I apologize to all the requesters of the issues planned for this release and not yet finished and postponed to the 0.9.24. What is the reason?

---> Replication, Clustering, Fault-Tolerance <---

Seems that the most missed feature in OrientDB is the support for clustering, and therefore high scalability, high availability and high volume of transactions that a single node can't handle. In the last months I studied the whole different architectures of other NoSQL solutions for clustering.

Today is a new day because I'll start the development of clustering for OrientDB with the following features:
  • Master-Slaves type, where it can be only one Master and N Slaves. If the Master crashes a Slave is elected to be the new Master
  • IP multicast to discover cluster nodes
  • Configuration of nodes using TCP/IP, useful for Clouds that don't allow the IP multicast
  • Two sync modes: full where all the database is compressed and sent over the network, and partial by sending only the changes happened since the last sync
  • New database handled by the Master OrientDB Server instance to store all the pending records until a configurable threshold. Up this threshold the logs are deleted and the node need a full-sync on startup
  • New console commands to display nodes, listen clustering messages and elect the master manually
The release 0.9.23 is planned for the October 15th, 2010. Stay tuned and contribute with comments, ideas or critiques.

Wednesday, September 15, 2010

Released OrientDB 0.9.22

Hi all,
many thanks to all the contributors for this new piece of software. A new version of OrientDB has been released after more than 1 month of work (and holidays).

Starting from this version you can use OrientDB as a pure GraphDB using the custom API (http://code.google.com/p/orient/wiki/GraphDatabase), by the Tinkerpop stack (Blueprints, Gremlin, Pipes, etc.) or again by using the Erlang language with the Cali project (http://github.com/dmitriid/cali).

Direct URL to download: http://orient.googlecode.com/files/orient-database-0.9.22.zip

New features:
- Issue 21: Full text index against schema fields
- Issue 91: SQL UPDATE against collections and maps
- Issue 94: Support for GREMLIN graph language
- Issue 108: Regular expression support in WHERE clause
- Issue 109: Support for memory clusters inside regular persistent database

Bugs:
- Issue 86: Cannot insert record from console
- Issue 87: Cannot query records from OrientDB Studio Command-page
- Issue 100: Transaction does not work as expected
- Issue 101: Error in using orient queries having ' character
- Issue 104: Bulk record creation in cluster with pre-created indices causes ClassCastException
- Issue 105: Error in using UPDATE query
- Fixed other minor bugs

The next release is planned for the end of September 2010 with the first version of Clustering support. Here the details: http://code.google.com/p/orient/issues/list?q=label:v0.9.23

To test the OrientDB Studio go here: http://www.moobilis.com:2480. Users: reader/reader and writer/writer. To manage the server: root/root.

If you'd like to suggest some new feature for the next releases please open a New Issue as "New feature": http://code.google.com/p/orient/issues/entry?template=New%20feature