Wednesday, April 25, 2018

Introduction of Graph Database


Graph Database:

Formally, a graph is just a collection of vertices and edges—or, in less intimidating language, a set of nodes and the relationships that connect them.

A labeled property graph has the following characteristics:
  •  It contains nodes and relationships.
  •  Nodes contain properties (key-value pairs).
  •  Nodes can be labeled with one or more labels.
  •  Relationships are named and directed, and always have a start and end node.
  •  Relationships can also contain properties.
A graph database management system (henceforth, a graph database) is an online database management system with Create, Read, Update, and Delete (CRUD) methods that expose a graph data model. Graph databases are generally built for use with transactional (OLTP) systems.

There are two properties of graph databases we should consider when investigating graph database technologies:
  • The underlying storage: Some graph databases use native graph storage that is optimized and designed for storing and managing graphs. Not all graph database technologies use native graph storage, however. Some serialize the graph data into a relational database, an object-oriented database, or some other general-purpose data store.
  • The processing engine: Some definitions require that a graph database use index- free adjacency, meaning that connected nodes physically “point” to each other in the database.  
    • A variety of different types of graph compute engines exist. Most notably there are in-memory/single machine graph compute engines like Cassovary and distributed graph compute engines like Pegasus or Giraph. Most distributed graph compute engines are based on the Pregel white paper, authored by Google, which describes the graph com‐pute engine Google uses to rank pages.

The Power of Graph Databases:
  1.  Performance: One compelling reason, then, for choosing a graph database is the sheer performance increase when dealing with connected data versus relational databases and NOSQL stores. In contrast to relational databases, where join-intensive query performance deteriorates as the dataset gets bigger, with a graph database performance tends to remain relatively constant, even as the dataset grows.
  2. Flexibility: As developers and data architects, we want to connect data as the domain dictates,thereby allowing structure and schema to emerge in tandem with our growing
    understanding of the problem space, rather than being imposed upfront, when we
    know least about the real shape and intricacies of the data. Graphs are naturally additive, meaning we can add new kinds of relationships, new nodes, new labels, and new subgraphs to an existing structure without disturbing existing queries and application functionality
  3.  Agility:  We want to be able to evolve our data model in step with the rest of our application, using a technology aligned with today’s incremental and iterative software delivery practices. Modern graph databases equip us to perform frictionless development and graceful systems maintenance. In particular, the schema-free nature of the graph data model, coupled with the testable nature of a graph database’s application program‐ming interface (API) and query language, empower us to evolve an application in a controlled manner.

One of the most popular structures for representing geospatial coordinates is called an R-Tree. An R-Tree is a graph-like index that describes bounded boxes around geographies. Using such a structure we can describe overlapping hierarchies of locations. For example, we can represent the fact that London is in the UK, and that the postal code SW11 1BD is in Battersea, which is a district in London, which is in southeastern England, which, in turn, is in Great Britain. And because UK postal codes are fine-grained, we can use that boundary to target people with somewhat similar tastes.

Such pattern-matching queries are extremely difficult to write in SQL, and laborious to write against aggregate stores, and in both cases they tend to perform very poorly. Graph databases, on the other hand, are optimized for precisely these types of traversals and
pattern-matching queries, providing in many cases millisecond responses.

The Labeled Property Graph Model

A labeled property graph is made up of nodes, relationships, properties, and labels.
  •  Nodes contain properties. Think of nodes as documents that store properties in the form of arbitrary key-value pairs. In Neo4j, the keys are strings and the values are the Java string and primitive data types, plus arrays of these types.
  •  Nodes can be tagged with one or more labels. Labels group nodes together, and indicate the roles they play within the dataset. 
  •  Relationships connect nodes and structure the graph. A relationship always has a direction, a single name, and a start node and an end node—there are no dangling relationships. Together, a relationship’s direction and name add semantic clarity to the structuring of nodes.
  •  Like nodes, relationships can also have properties. The ability to add properties to relationships is particularly useful for providing additional metadata for graph algorithms, adding additional semantics to relationships (including quality and weight), and for constraining queries at runtime.

Query Languages for Graph Database
  1.  Cypher (Most popular)
  2. SPARQ
  3. Gremlin 


Cypher Philosophy:
Cypher is designed to be easily read and understood by developers, database professionals, and business stakeholders. Its ease of use derives from the fact that it is in
accord with the way we intuitively describe graphs using diagrams.












This pattern describes three mutual friends. Here’s the equivalent ASCII art represen‐
tation in Cypher:
           (emil)<-[:KNOWS]-(jim)-[:KNOWS]->(ian)-[:KNOWS]->(emil)

The previous Cypher pattern describes a simple graph structure, it doesn’t yet refer to any particular data in the database. To  the pattern to specific nodes and relationships in an existing dataset we must specify some property values and node labels that help locate the relevant elements in the dataset. For example:

(emil:Person {name:'Emil'})
 <-[:KNOWS]-(jim:Person {name:'Jim'})
 -[:KNOWS]->(ian:Person {name:'Ian'})
 -[:KNOWS]->(emil)

Like most query languages, Cypher is composed of clauses. The simplest queries consist of a MATCH clause followed by a RETURN clause (we’ll describe the other clauses you can use in a Cypher query later in this chapter). Here’s an example of a Cypher query that uses these three clauses to find the mutual friends of a user named Jim :

MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c),
(a)-[:KNOWS]->(c)
RETURN b, c


Cypher Clauses:
  1. MATCH: The MATCH clause is at the heart of most Cypher queries.We draw nodes with parentheses, and relationships using pairs of dashes with greater-than or less-than signs ( --> and <-- ). The < and > signs indicate relationship direction. Between the dashes, set off by square brackets and prefixed by a colon, we put the relationship name. Node labels are similarly prefixed by a colon. Node (and relationship) property key-value pairs are then specified within curly braces (much like a Javascript object) .
  2. RETURN: This clause specifies which nodes, relationships, and properties in the matched data should be returned to the client.
  3. WHERE: Provides criteria for filtering pattern matching results
  4. CREATE and CREATE UNIQUE: Create nodes and relationships.
  5. MERGE: Ensures that the supplied pattern exists in the graph, either by reusing existing nodes and relationships that match the supplied predicates, or by creating new nodes and relationships. 
  6. DELETE : Removes nodes, relationships, and properties.
  7. SET: Sets property values.
  8. FOREACH : Performs an updating action for each element in a list.
  9. UNION: Merges results from two or more queries.
  10. WITH: Chains subsequent query parts and forwards results from one to the next. Similar to piping commands in Unix
  11. START: Specifies one or more explicit starting points—nodes or relationships—in the
    graph. ( START is deprecated in favor of specifying anchor points in a MATCH clause.)






Reference Book:
  1.  Oreilly Graph Databse book




1 comment: