TopBraid Data Platform

  

The TopBraid Data Platform is a high availability solution for TopBraid EDG servers. It enables continuous operation of business functions by replicating data across a cluster of EDG servers. Each EDG is kept up to date and server can respond to a client/application request. Together with a load balancer to direct requests, user and application access to the EDG data remains available even if some of the servers are offline.

This availability is achieved by having a cluster of EDG servers communicating with a data coordinator (DC) server that receives changes from any clustered server whenever it has updates. The DC server propagates the changes to all other EDG servers, keeping them all up-to-date. Additionally, whenever a new EDG server joins the cluster, it is first brought up-to-date before it starts servicing user/application requests. Each EDG server keeps a local, persistent cache of application data so that start-up only involves applying changes missed by that server.

Each EDG or EDG Explorer server has its own copy of a database of all RDF graphs managed by the Data Platform. When changes to the data are made, the changes are recorded locally as RDF Patches.

At the end of the operation (the end of the HTTP request), any changes are sent to the Data Coordinator and made safe. When the patches are confirmed to be safe, the EDG server triple store declares the request has committed and responds to the HTTP request. It is the state of the RDF Patch Log that determines the state of the EDG server for Data Platform backed graphs.

When operations arrive at EDG server, the EDG server checks that its local database is up-to-date and, if it is not, fetched patched from the Data Coordinator and applies them to make sure it is at the latest version.

The Data Coordinator can be a single machine, with file-backed persistence or a cluster of servers with storage in a cloud object store. There are tradeoffs in complexity of operation as discussed below.

The TopBraid Data Platform is based on the open source (Apache License) RDF Delta system.

Obtaining the Software

The Data Coordinator (the patch log server) can be downloaded from:

https://download.topquadrant.com/dp/

EDG server and EDG explorer already have the data platform client software included and no extra software or is required for EDG. You must purchase a license to run Data Platform from TopQuadrant. You will then be sent the appropriate number of licenses for your EDG installations.

Note that the version corresponding to the EDG version as given in the table on the download page must be used.

Data Coordinator Requirements

The DC server is a Java webserver process that coordinates changes across a cluster of EDG servers. For production use, it should be run on a machine separated from all of the clustered EDG machines. For development and experimentation, it can be run on a shared machine.

The DC server stores patches on-disk. The disk storage must be backed-up, which can be done by file backup on a live server. The robustness of the system is determined by the robustness of the file storage so choosing local disk in the DC-server is limiting. It is primarily a patch server for patch files and is not computationally demanding.

 

EDG Failover Configuration

In a high-availability configuration, two or more EDG servers provide the service. Each has a complete copy of the replicated graphs.

Typically, a deployment will put a load balancer in front of the two servers so that the clients (web browsers or other applications) use the same URL to access either server. It also allows a deployment to move EDG servers between physical hardware for maintenance and system upgrade.

There are two choices for Patch Log Server: either run with a file-backed single server or run with replicated servers for continuous operation.

The file-based patch log server can use any OS-supported filesystem – local storage or remote disk-array and the reliability and durability of files written are determined by the choice of filesystem. If the single server style is used, then operation is interrupted if the patch log server is unavailable. The EDG servers keep running but will be unable to update data until the patch log server restarts. Patch log server startup is very fast. There is no need to restart EDG servers.

The advantage is the simplicity of operation so there is a tradeoff for small deployments of simplicity and continuous operation.

For continuous operation, three or more patch log servers run, on separate hardware; the system uses Apache Zookeeper to manage a system wide database of the patch state of the deployment. Storage of patches is in an object store, Amazon S3. There are publicly available adapters to provide the Amazon S3 API over other storage choices such as Apache Cassandra.

 

Example RDF Patch

H id <uuid:6e0288f5-ed79-4ab7-b4cc-8386ee32fadf> .
H prev <uuid:3e346531-777e-4169-abcd-284a7a5b22ac> .
TX .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/shacl#NodeShape> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://www.w3.org/2002/07/owl#Thing> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/2000/01/rdf-schema#label> "SubClass" .
TC .

 

Example Data Platform connector file.

 

# baseURI: urn:x-evn-master:ontconfig
# imports: http://topbraid.org/sparqlmotionlib
@prefix dp: <http://org.topbraid/dataplatform#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sm: <http://topbraid.org/sparqlmotion#> .
@prefix sml: <http://topbraid.org/sparqlmotionlib#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> . 

<urn:x-evn-master:ont> a dp:Graph ;
    dp:coordinator "http://localhost:1066/" ;
    dp:datasource "ont" ;
    dp:localData "ont/data" ;
    dp:localDataLocation "/home/afs/TQ/EDG/TBS/build/workspaces/TBCME/Zone/ont/data" ;
    sml:baseURI "urn:x-evn-master:ont" ;
    sml:graphName "urn:x-evn-master:ont" ;
    sml:imports () ;
    . 

<urn:x-evn-master:ontconfig> a owl:Ontology ;
    owl:imports <http://topbraid.org/sparqlmotionlib> ;
    .

 

Create an asset collection in DP

The EDG UI will have an additional checkbox to allow placement of the graph into the DataPlatform if you have not enabled Data Platform as the default data store or are running EDG prior to version 6.3. If you have, you will not see this option.

Changes to the ontology will be recorded in the patch log. Patches are handled by the system and not displayed to the user but here is an example patch creating a new subclass in an ontology:

H id <uuid:6e0288f5-ed79-4ab7-b4cc-8386ee32fadf> .
H prev <uuid:3e346531-777e-4169-abcd-284a7a5b22ac> .
TX .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/shacl#NodeShape> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://www.w3.org/2002/07/owl#Thing> .
A <http://example.org/ontologies/ont#SubClass> <http://www.w3.org/2000/01/rdf-schema#label> "SubClass" .
TC .

 

Verify Log Operation

To illustrate that the EDG server is indeed running with the Data Platform, stop TopBraid Composer, restart with a clean workspace, and restart EDG “personal edition”. If running EDG in Tomcat, stop Tomcat, delete the following directories and files, or completely delete the EDG workspace as if reinstalled:

<Workspace>/Zone/ont/
<Workspace>/Zone/ont.tch/
<Workspace>Repositories/ont.dpc
<Workspace>Repositories/ont.tch.dpc

When the EDG: server starts up, the log will show ontology (called “ont” below) restored from the patch log server.

[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: New graph: [id:969bb7, ont.tch, <urn:x-evn-master:ont.tch>]
[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: Create DP Graph [id:969bb7] : datasource=ont.tch, baseURI=urn:x-evn-master:ont.tch, coordinator=http://localhost:1066/
[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: Attach graph ont.tch in 0.166s
[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: New graph: [id:84d7ff, ont, <urn:x-evn-master:ont>]
[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: Create DP Graph [id:84d7ff] : datasource=ont, baseURI=urn:x-evn-master:ont, coordinator=http://localhost:1066/
[TopBraid Info] 2019-05-08T10:56:27+0100 - DataPlatform: Attach graph ont in 0.261s

 

The graph has been created and rebuilt from the patch log.

Data Platform Setup 

Data Platform can be used as the default data store for all of the collections in the workspace or as the storage for nominated asset collections, with an option selected when the asset collection is created. The default data store option was introduced with EDG 6.3. To use the Data Platform default data store, please follow the steps outlined below.  If Data Platform default data store is not used then the system graphs that contain governance and user data will not be synced between servers. 

Any projects uploaded to EDG that were created in TBCME or given to you by TopQuadrant professional services for customizations or sample data will not be synced between servers. These are uploaded individually to each server. EDG configuration files will also not be stored via Data Platform. Changes to configurations will need to be replicated on each EDG node. 

This section gives a short walkthrough for setting up a EDG system with the Data Platform. It covers the use of the TopBraid Data Platform with file-backed storage of patches. The Data Coordinator is the RDF Delta patch log server.

File-based persistent patch storage

The simplest configuration of a patch log server uses the servers file system for patch storage. This is an OS-supported filesystem – local storage or remote disk-array and the reliability and durability of files written are determined by the choice of filesystem. 

Data Platform default data store steps

Be sure to follow the steps in the following order. Details will be below. If you would like a sample docker image, please contact TopQuadrant support.

  1. Download and run the patch log server.
  2. Enable logging for the patch log server.
  3. Install the first instance of EDG using the below database options. We recommend the file method of installation so the file will be copied to new nodes.
    1. Database: Data Platform
    2. DP server URL: http://localhost:1066/ 
    3. Is primary node: true
  4. You can verify that your setup was successful by creating any collection. If this succeeds, you can check the Base URI Management page in the Server Administration area. Your Repositories folder collections should have an extension of .dpc.
  5. Now you can clone this EDG instance to set up multiple EDG nodes. Before doing so, you may want to change any other configuration settings that you will want on all the EDG servers. Otherwise you will have to repeat the configurations. EDG configuration files are not synced via Data Platform. Be sure the EDG workspaces are different locations. You cannot use the same workspace on multiple EDG nodes. You will need to clone the entire directory containing the workspace, vault, setup file, and replace the license file. TopQuadrant support will provide a unique license for each EDG node. 
  6. Update the additional nodes to have Is primary node: false. 
  7. Last and most important step, set up a backup mechanism and server monitoring for memory and disk space at minimum. You will want to save regular back ups of the workspace for EDG as well as the Data Coordinator. Optionally, set up EDG Failover Configuration (instructions above).

*If you do not want to use Data Platform as the default data store, then choose another default data store for the graphs not stored and synced via Data Platform. The rest of the steps can still be followed, you will need to use the check box on the create screen of each collection to signify what gets synced and stored.

*If you would like to migrate to the Data Platform default data store, you will need to follow the steps above with a new workspace and migrate your data over using export/import. Your change history will not be retained in this method.

 

Run the patch log server

 

In production, the patch log server should be run on a separate machine to the server running the TopBraid server, not just a separate virtual machine. 

For development and testing, and this tutorial, it can run on the localhost for convenience.

After obtaining the software, run the patch log server with:

    java -jar rdf-delta-server-VERSION.jar –file –base DIRECTORY

where VERSION is the version number of the RDF Delta software corresponding to the version of the EDG server in use and DIRECTORY is an initially empty, existing directory where the patch log server stores the patch logs in files.

Example patch server output:

[2019-05-08 10:46:59] Delta INFO Provider: file
[2019-05-08 10:46:59] Config INFO Delta Server port=1066
[2019-05-08 10:46:59] Delta INFO RDF Delta 0.7.0 2019-03-12T14:54:40+0000
[2019-05-08 10:46:59] Delta INFO No data sources

 

Configure an EDG server

The EDG server needs to be setup to use the Data Platform. In “Server Administration” -> “EDG Configuration Parameters”, press “Edit” and in the “Data Platform Configuration Parameters” section, set “Enable Data Platform feature” to true and set the “Data Platform server URL” settings.

Restart the EDG server.

These settings are recorded in the dynamic/config.ttl file in the <workspace>/server.topbraidlive.org/dynamic/config.ttl. If preferred, this file can be edited directly when the EDG server is not running.

:configuration a cfg:ServerConfiguration ;
    . . .
    cfg:dpFeature true ;
    cfg:dpServerURL "http://HOST:1066/" ;
    cfg:dpZone "Zone" ;
    . . .

 

Property Type Value
cfg:dpFeature Boolean true for Data Platform.
cfg:dpServerURL String The URL for the patchlog server.
cfg:dpZone String The directory location of the servers local Data Platform state.

The feature setting is set to true (no quotes) to enable the Data Platform.

HOST is the DNS name or IP address of the patch log server machine. This should be a separate machine to the server running the TopBraid server.

The “Zone” is the place within the server workspace where the Data Platform software keeps the local state information. This does not usually needed to be changed.

The log file (catalina.log) should show the Data Platform is running including an entry such as:

DataPlatform: Data Platform : server http://localhost:1066/

Creating Shared (DP-managed) Graphs

 

When the DP feature is enabled for an EDG server and the data coordinator server is running, creation of new asset collections will show a Data Platform checkbox option. Selecting this will cause the new asset collection to be shared among all DC-connected EDG servers.

Create New Asset Collection

This creates a new Asset Collection with yourself as the manager.

There may be a few-second delay in propagating collection changes.

If the data platform is the current default application storage choice, the Data Platform checkbox option does not appear and the asset collection is automatically placed under Data Platform control. An error occurs if the Data Coordinator server is not contactable.

If the data platform is enabled but not the current default application storage choice, the Data Platform checkbox option appears when the Data Coordinator server is contactable.

Asset collections created without the DP option will exist only on the original server.

Data Coordinator Outages

If the Data Coordinator is not running then:

  • shared collections can be read (viewed, queried)
  • updates to shared collections are not possible
  • new, shared collections cannot be created.

Restarting the Data Coordinator re-enables the update and sharing features. Restarting EDG is not necessary.

If the default storage is set to be “data platform” then the Data Coordinator server must be running and contactable when an EDG server is initially configured.

 

Backup & Restore

Overview

There are 2 areas to consider when backing up a Data Platform enabled EDG environment

  • Data Coordinator
    • Filesystem snapshot of this server – while the Data Coordinator is shutdown
  • EDG
    • The workspace, web.xml and Secure Storage files.

Backup

The first step to perform a backup of Data Platform environment is to shutdown the Data Coordinator.  Once shutdown, the EDG instance(s) will become read-only mode, preventing any changes to the system.

Backup of the Data Coordinator can be done by snapshot of the system, or rsync of the files serving the patch store.  These should only be performed while the Data Coordinator is shut down.

Backup of the EDG server can also be done the same way, while Data Coordinator is shut down and EDG is in readonly mode, snapshot of the system or rsync of the workspace, web.xml and secure storage files.

Restore

If restoring a Data Platform enabled EDG environment, first start by restoring the Data Coordinator files and bring the server online.

Next, restore the backed up files of EDG and start that container.  If EDG can communicate with the Data Coordinator server, the system should resume operation.