Graph Databases

Graph databases are databases that store information in a graph structure. The implementation of graph databases are discussed elsewhere.

What Graph Databases Are Good For

(From The NoSQL Movement – Graph Databases, by Shannon Kempe.

Graph data stores are non-SQL stores suitable for finding relationships within massive amounts of data at the fastest possible speed. They see wide use in social networking applications as well as in high-end analytics.

In graph database architectures, the objects known as Nodes and Edges serve the roles of Entities and Relationships from standard SQL architecture. Nodes also contain properties which describe the actual data contained within each object. A diagram of a Graph database looks similar to the object diagrams used in object-oriented programming.

Due to the similarity between graph and object diagrams, Graph databases interface nicely when mapped to an object structure within an application, facilitating or even mitigating the use of an ORM framework.

The biggest advantage of Graph databases is obviously their speed for certain types of transactions; in particular, those involving relationships, since processing-intensive joins are not required. The fact that the design of Graph databases is less dependent on complex schema than their relational counterparts also lends itself to easier modifications and migrations to in-production systems.

Graph Databases Shouldn't be Implemented in SQL

(From Graphs: A Better Database Abstraction

The essence of the concept of an abstraction is a framework that simplifies how you think about and work in a given domain. Abstractions can be (and often are) argued against by suggesting that you don't really need them. In computer programming, we didn't need C because we had assembly language. We didn't need C++ because we had C. We didn't need Java because we had C++. To me these arguments (which people really made) were silly. I abandoned assembly language in the 90's.

The point is none of our existing abstractions are *needed*. But our human brains can only manage a certain amount of complexity at a time. Complexity is fine but only in bite size chunks. Abstractions are, in essence, a really generalized form of user interface. I think the reason I have written so much about user interface is that I think it is so important to figure out how to map complex things into representational models in such a way that more people can access them. Abstractions allow us to do just that with any process we are engaged in. Abstractions allow us to encapsulate complexity so that we don't have to think about it and we can achieve greater and greater levels of complexity in an efficient way allowing us to keep more of the model of a given system in our heads.

And so the anti-abstraction argument rears its head in the RDBMS vs graph database debate. One of the arguments that the pro RDBMS folks make for why there is no need for the graph database model is that you can do everything that can be expressed in a graph database in a relational database. And there is some truth to this.

But there are two problems with this argument. The first is that this is only true in theory. It is not possible to build a graph database of scale using pure SQL – at least with the SQL tools that we currently have to choose from.

One reason for the scale problem is the only way to do it is to do what are called self-joins. This is when you join a table to itself. Conceptually seem just fine. But the problem is that it is impossible for the database engine to do anything other than brute force un-optimized traversals of the graph when confronted with a chain of self-joins. In other words, using this technique will not yield a useful database that is query-able at any kind of scale. Handling certain aspects of providing a graph database model requires some very specific and different kind of thinking and optimizations from those that go into designing an SQL database.

Another problem is that one giant table using self joins for traversal means a huge write bottleneck. Yes, you can avoid that with sharding depending on your design, but it is definitely not part of the SQL model, and so you can't say SQL is helping you here.

The second and I believe more important argument against implementing a graph data model using SQL is that even if SQL could do a good job of representing a graph model, building your graph system in SQL is not a very good abstraction. The truth is that most of the kinds of things we want to do in app development look more like graph than relational structures. Graphs are elemental to computer science because most interesting algorithms and in fact real world data models can be very naturally thought of as a graph. Graph theory is (if things are as they were when I was in school) the first thing you learn when you begin studying computer science, and there is very good reason for this. The fact that Facebook was able to anchor the idea of what they were building as a "social graph" is an incredible testament to the innately natural characteristics of the graph concept.

So if you are representing a graph, you really want an API that reflects the unique and useful characteristics of a graph. In other words, you want an abstraction that reflects how you really think about the data and not some jury-rigged representational model continuously intruding itself into your thought process. And so, having a data store that allows us to express our data in a way that is much more similar to how we actually think is enormously helpful.

And such is the case with attempting to implement a graph database using SQL. You can do it, but it is unlikely to work very well, and because you don't have the benefit of the abstraction, it actually adds to the complexity of the design instead of simplifying it.

The bottom line is that graphs are a better representational model when the structure of your system will change frequently. Relational is a better model when the structure will be static. Today, I think most of us are not building applications that are ideally structurally static.

Because most applications today have a much more dynamic nature, graphs are, for most people, under most circumstances, a far better abstraction. And to me, there is little in this world more powerful and satisfying than a great abstraction.

Graph Database Implementations

If a graph is supposed to come with tools for changing and querying it (member functions if implemented as a class), a graph database must orchestrate changes (also defining and implementing a messaging protocol for changes) and provide a means of storage and possibly transmission over a network (serialization).

It seems reasonable that a graph database be based on a graph class or classes. Early on, we would add a serialize(ostream&) method...

RDF Databases*

Native Graph Storage in Neo4j

Written in Java, Neo4j is an open source graph database with a variety of available license options suitable for prototyping as well as commercial deployment. Neo Technology, developers of Neo4j, provides support for the paid Neo4j licenses.

Neo4j stores graph data in a number of different store files. There are separate files for nodes, relationships, labels (node qualifiers or roles), and properties (arc weights). The separation of graph structure from property data speeds up graph traversals.

Neo4j sports massive scalability. The database handles billions of nodes on only one server, and can easily scale across multiple machines. It also features a disk-based persistence model written in native code.

The database allows for deployment flexibility – either as a full server, or a slim version contained within a 750K jar file.

Neo4j Installation*

Neo4j Cypher Query Language

Cypher is Neo4j's open graph query language. Cypher's syntax provides a familiar way to match patterns of nodes and relationships in the graph.

Here is a simple example of a Cypher query (cast of movies starting with T)

MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)
WHERE movie.title STARTS WITH "T"
RETURN movie.title AS title, collect(actor.name) AS cast
ORDER BY title ASC LIMIT 10;

If you're be able to read and understand it without any training, we achieved our goal.

Cypher is a declarative, SQL-inspired language for describing patterns in graphs visually using an ascii-art syntax. It allows us to state what we want to select, insert, update or delete from our graph data without requiring us to describe exactly how to do it.

Cipher Nodes

Cypher uses ASCII-Art to represent patterns. We surround nodes with parentheses which look like circles, e.g. (node). If we later want to refer to the node, we'll give it a variable like (p) for person or (t) for thing. In real-world queries, we'll probably use longer, more expressive variable names like (person) or (thing). If the node is not relevant to your question, you can also use empty parentheses ().

Usually, the relevant labels of the node are provided to distinguish between entities and optimize execution, like (p:Person).

We might use a pattern like (person:Person)-->(thing:Thing) so we can refer to them later, for example, to access properties like person.name and thing.quality.

The more general structure is:


MATCH (node:Label) RETURN node.property

MATCH (node1:Label1)-->(node2:Label2)
WHERE node1.propertyA = {value}
RETURN node2.propertyA, node2.propertyB

Cipher Relationships*


              


              


              

Titan Graph Database

Titan is a distributed, real-time, transactional graph database that can use either Cassandra or HBase as its distributed data store. It's Apache 2.0 licensed, and it's the first native Blueprints implementation, so it integrates with the entire tinkerpop, which is BSD.

Titan is capable of supporting tens of thousands of concurrent users interacting with a single massive-scale graph represented over a cluster of machines, You can use any programming language to connect to via the new binary RexPro Protocol.

Flock DB, Simple, Distributed

Flock DB is an open source graph database for tinkerers. FlockDB was created by a technical team at Twitter who wanted a distributed, fault-tolerant database that was simple to use. The FlockDB source code is available for download at GitHub.