Work Notes: April 2018

Data Modeling:

The entities and relationships that we’ve surfaced in analyzing the user story quickly translate into a simple data model, as shown in Figure 4-1. Figure 4-1. Data model for the book reviews user story Because this data model directly encodes the question presented by the user story, it lends itself to being queried in a way that similarly reflects the structure of the ques‐ tion we want to ask of the data, since Alice likes Dune, find books that others who like Dune have enjoyed:

MATCH (:Reader {name:'Alice'})-[:LIKES]->(:Book {title:'Dune'})
<-[:LIKES]-(:Reader)-[:LIKES]->(books:Book)
RETURN books.title

Nodes for Things, Relationships for Structure:

Though not applicable in every situation, these general guidelines will help us choose
when to use nodes, and when to use relationships:

Use nodes to represent entities—that is, the things in our domain that are of interest to us, and which can be labeled and grouped.
Use relationships both to express the connections between entities and to estab‐lish semantic context for each entity, thereby structuring the domain.
Use relationship direction to further clarify relationship semantics. Many rela‐tionships are asymmetrical, which is why relationships in a property graph arealways directed. For bidirectional relationships, we should make our queries ignore direction, rather than using two relationships.
Use node properties to represent entity attributes, plus any necessary entity meta‐data, such as timestamps, version numbers, etc.
Use relationship properties to express the strength, weight, or quality of a rela‐tionship, plus any necessary relationship metadata, such as timestamps, version numbers, etc.

Fine-Grained versus Generic Relationships:

It’s the difference between using DELIVERY_ADDRESS and HOME_ADDRESS versus
ADDRESS {type:'delivery'} and ADDRESS {type:'home'} .

Addresses are a good example. Follow‐ing the closed-set principle, we might choose to create HOME_ADDRESS , WORK_ADDRESS , and DELIVERY_ADDRESS relationships. This allows us to follow specific kinds of address relationships ( DELIVERY_ADDRESS , for example) while ignoring all the rest. But what do we do if we want to find all addresses for a user? There are a couple of options here. First, we can encode knowledge of all the different relationship types in our queries: e.g., MATCH (user)- [:HOME_ADDRESS|WORK_ADDRESS| DELIVERY_ADDRESS]->(address) . This, however, quickly becomes unwieldy when there are lots of different kinds of relationships. Alternatively, we can add a more generic ADDRESS relationship to our model, in addition to the fine-grained relation‐ ships. Every node representing an address is then connected to a user using two rela‐ tionships: a fined-grained relationship (e.g., DELIVERY_ADDRESS ) and the more generic ADDRESS {type:'delivery'} relationship.

Iterative and Incremental Development:

Graph databases provide for the smooth evolution of our data model. Migrations and denormalization are rarely an issue. New facts and new compositions become new nodes and relationships, while optimizing for performance-critical access patterns typically involves introducing a direct relationship between two nodes that would otherwise be connected only by way of intermediarie.

We will quickly see how different relationships can sit side-by-side with one another, catering to different needs without distorting the model in favor of any one particular need. Addresses help illustrate the point here. Imagine, for example, that we are developing a retail application. While developing a fulfillment story, we add the abil‐ity to dispatch a parcel to a customer’s delivery address, which we find using the fol‐lowing query:

MATCH (user:User {id:{userId}})
MATCH (user)-[:DELIVERY_ADDRESS]->(address:Address)
RETURN address

Later on, when adding some billing functionality, we introduce a BILLING_ADDRESS relationship. Later still, we add the ability for customers to manage all their addresses. This last feature requires us to find all addresses—whether delivery, billing, or some other address. To facilitate this, we introduce a general ADDRESS relationship:

MATCH (user:User {id:{userId}})
MATCH (user)-[:ADDRESS]->(address:Address)
RETURN address

By this time, our data model looks something like the one shown in Figure 4-8. DELIVERY_ADDRESS specializes the data on behalf of the application’s fulfillment needs; BILLING_ADDRESS specializes the data on behalf of the application’s billing needs; and
ADDRESS specializes the data on behalf of the application’s customer management needs.

Just because we can add new relationships to meet new application goals, doesn’t mean we always have to do this. We’ll invariably identify opportunities for refactoring the model as we go. There’ll be plenty of times, for example, where an existing rela‐tionship will suffice for a new query, or where renaming an existing relationship will allow it to be used for two different needs. When these opportunities arise, we should take them.

Graph Database:

Formally, a graph is just a collection of vertices and edges—or, in less intimidating language, a set of nodes and the relationships that connect them.

A labeled property graph has the following characteristics:

It contains nodes and relationships.
Nodes contain properties (key-value pairs).
Nodes can be labeled with one or more labels.
Relationships are named and directed, and always have a start and end node.
Relationships can also contain properties.

A graph database management system (henceforth, a graph database) is an online database management system with Create, Read, Update, and Delete (CRUD) methods that expose a graph data model. Graph databases are generally built for use with transactional (OLTP) systems.

There are two properties of graph databases we should consider when investigating graph database technologies:

The underlying storage: Some graph databases use native graph storage that is optimized and designed for storing and managing graphs. Not all graph database technologies use native graph storage, however. Some serialize the graph data into a relational database, an object-oriented database, or some other general-purpose data store.
The processing engine: Some definitions require that a graph database use index- free adjacency, meaning that connected nodes physically “point” to each other in the database.

A variety of different types of graph compute engines exist. Most notably there are in-memory/single machine graph compute engines like Cassovary and distributed graph compute engines like Pegasus or Giraph. Most distributed graph compute engines are based on the Pregel white paper, authored by Google, which describes the graph com‐pute engine Google uses to rank pages.

The Power of Graph Databases:

Performance: One compelling reason, then, for choosing a graph database is the sheer performance increase when dealing with connected data versus relational databases and NOSQL stores. In contrast to relational databases, where join-intensive query performance deteriorates as the dataset gets bigger, with a graph database performance tends to remain relatively constant, even as the dataset grows.
Flexibility: As developers and data architects, we want to connect data as the domain dictates,thereby allowing structure and schema to emerge in tandem with our growing
understanding of the problem space, rather than being imposed upfront, when we
know least about the real shape and intricacies of the data. Graphs are naturally additive, meaning we can add new kinds of relationships, new nodes, new labels, and new subgraphs to an existing structure without disturbing existing queries and application functionality
Agility: We want to be able to evolve our data model in step with the rest of our application, using a technology aligned with today’s incremental and iterative software delivery practices. Modern graph databases equip us to perform frictionless development and graceful systems maintenance. In particular, the schema-free nature of the graph data model, coupled with the testable nature of a graph database’s application program‐ming interface (API) and query language, empower us to evolve an application in a controlled manner.

One of the most popular structures for representing geospatial coordinates is called an R-Tree. An R-Tree is a graph-like index that describes bounded boxes around geographies. Using such a structure we can describe overlapping hierarchies of locations. For example, we can represent the fact that London is in the UK, and that the postal code SW11 1BD is in Battersea, which is a district in London, which is in southeastern England, which, in turn, is in Great Britain. And because UK postal codes are fine-grained, we can use that boundary to target people with somewhat similar tastes.

Such pattern-matching queries are extremely difficult to write in SQL, and laborious to write against aggregate stores, and in both cases they tend to perform very poorly. Graph databases, on the other hand, are optimized for precisely these types of traversals and
pattern-matching queries, providing in many cases millisecond responses.

The Labeled Property Graph Model

A labeled property graph is made up of nodes, relationships, properties, and labels.

Nodes contain properties. Think of nodes as documents that store properties in the form of arbitrary key-value pairs. In Neo4j, the keys are strings and the values are the Java string and primitive data types, plus arrays of these types.
Nodes can be tagged with one or more labels. Labels group nodes together, and indicate the roles they play within the dataset.
Relationships connect nodes and structure the graph. A relationship always has a direction, a single name, and a start node and an end node—there are no dangling relationships. Together, a relationship’s direction and name add semantic clarity to the structuring of nodes.
Like nodes, relationships can also have properties. The ability to add properties to relationships is particularly useful for providing additional metadata for graph algorithms, adding additional semantics to relationships (including quality and weight), and for constraining queries at runtime.

Query Languages for Graph Database

Cypher (Most popular)
SPARQ
Gremlin

Cypher Philosophy:

Cypher is designed to be easily read and understood by developers, database professionals, and business stakeholders. Its ease of use derives from the fact that it is in
accord with the way we intuitively describe graphs using diagrams.

This pattern describes three mutual friends. Here’s the equivalent ASCII art represen‐
tation in Cypher:
(emil)<-[:KNOWS]-(jim)-[:KNOWS]->(ian)-[:KNOWS]->(emil)

The previous Cypher pattern describes a simple graph structure, it doesn’t yet refer to any particular data in the database. To the pattern to specific nodes and relationships in an existing dataset we must specify some property values and node labels that help locate the relevant elements in the dataset. For example:

(emil:Person {name:'Emil'})
<-[:KNOWS]-(jim:Person {name:'Jim'})
-[:KNOWS]->(ian:Person {name:'Ian'})
-[:KNOWS]->(emil)

Like most query languages, Cypher is composed of clauses. The simplest queries consist of a MATCH clause followed by a RETURN clause (we’ll describe the other clauses you can use in a Cypher query later in this chapter). Here’s an example of a Cypher query that uses these three clauses to find the mutual friends of a user named Jim :

MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c),
(a)-[:KNOWS]->(c)
RETURN b, c

Cypher Clauses:

MATCH: The MATCH clause is at the heart of most Cypher queries.We draw nodes with parentheses, and relationships using pairs of dashes with greater-than or less-than signs ( --> and <-- ). The < and > signs indicate relationship direction. Between the dashes, set off by square brackets and prefixed by a colon, we put the relationship name. Node labels are similarly prefixed by a colon. Node (and relationship) property key-value pairs are then specified within curly braces (much like a Javascript object) .
RETURN: This clause specifies which nodes, relationships, and properties in the matched data should be returned to the client.
WHERE: Provides criteria for filtering pattern matching results
CREATE and CREATE UNIQUE: Create nodes and relationships.
MERGE: Ensures that the supplied pattern exists in the graph, either by reusing existing nodes and relationships that match the supplied predicates, or by creating new nodes and relationships.
DELETE : Removes nodes, relationships, and properties.
SET: Sets property values.
FOREACH : Performs an updating action for each element in a list.
UNION: Merges results from two or more queries.
WITH: Chains subsequent query parts and forwards results from one to the next. Similar to piping commands in Unix
START: Specifies one or more explicit starting points—nodes or relationships—in the
graph. ( START is deprecated in favor of specifying anchor points in a MATCH clause.)

Reference Book:

Oreilly Graph Databse book

Work Notes

Sunday, April 29, 2018

Data Modeling of Graph Database

Wednesday, April 25, 2018

Introduction of Graph Database

Graph Database:

The Labeled Property Graph Model

Autoboxing and Unboxing