Corpora Asset Collection
Licensing and Enablement
The availability of any asset collection is determined by what is (a) licensed and (b) configured under Server Administration. To install a license or to view the currently licensed features, see Setup > Product Registration. To configure which licensed collection types are currently enabled or disabled, see Setup - EDG Configuration Parameters.
For general licensing information and available asset collections and packages, see the TopQuadrant website.
Overview of Corpora
A Corpus is a collection of textual assets, such as documents, excerpts, web pages, etc.. The original items are typically imported from external sources, such as content management systems or web sites and typically are not created nor edited within EDG. The textual content of Corpus assets provides the foundation for manual or automated tagging and annotation using Content Tag Sets. Corpora serve as the content graphs for Content Tag Sets.
Selecting the Corpora link in the left-navigation pane of TopBraid EDG lists all of the Corpus collections currently available to the user and allows authorized users to create new ones.
When working with Corpus collections users will have access to the same functionality as the functionality available for other asset collections e.g., ability to search, import, export, etc. If a Corpus is created without specifying a connector to some content repository, users will be able to use EDG editor to create and edit documents in the Corpus. Otherwise, the documents will come from an external repository and users will not be able to modify them. Please see the Asset Collection Guide for all the general features of asset collections such as import/export, editing, user permissions, reports and settings.
Create New Corpus
When a new Corpus is created, EDG requires the user to select its data source, as Corpora can be configured to connect to an external source. EDG will then harvest content from that external source and store it in the project graph. Harvesting can be repeated later on demand, and changes to the external source’s documents will be picked up. Harvesting needs to be triggered manually in the Corpora management UI. A Corpus does not synchronize automatically with its external source, but only when requested.
Six types of connectors to such sources are currently offered, their respective creation wizard pages showing different forms depending on parameters that must me indicated for the connector to operate. These parameters can be adjusted later on by accessing the Manage Tab -> Corpus-Connector-Type Configuration.
- No connector if content documents are available as RDF already, these can be imported into the Corpus with the usual RDF import function. Similarly, raw documents can be imported singly from local files as described in Import Single Document. No external source will be configured with this connector type. Users will also be able to use the editor app to create new documents.
- sitemap.xml If a website supports the sitemaps protocol, a configured sitemap.xml connector will harvest its content accordingly.
- URL list This connector will simply fetch content from all of the URLs listed in its configuration. *Note the site must not block crawlers or they will be skipped.
- CMIS If a website is an interface to a Content Management System and offers a Content Management Interoperability Services (CMIS) service endpoint as defined by the standard, a configured CMIS connector will harvest its content accordingly.
- Amazon S3 Creates a new Corpus and imports documents from a(n) S3 bucket(s).
- Local directory Creates a new Corpus from a directory of files in the local files system. Can only be created by system administrators.
Depending on the connector type, you will be asked to provide different settings as required to connect to the source.
Using Manage tab of a Corpus, you could later modify connector parameters specified during the creation process – as external data sources can be on remote networks not necessarily under the creator’s control and connectors should reflect these changes. Connector-specific configuration panel is not available for Corpora configured with No Connector. While the connector’s parameters can be edited post creation, the type of data source (one of the six options above) cannot be changed after creation.
If you will be using this Corpus for Auto Classification, you will need to keep the box checked for Store copy of all documents in EDG. The documents are stored in a “Corpora” folder in the workspace for EDG. Be sure your server has enough disk space for this storage.
Once you have finished configuring your new Corpus, it will appear on the page listing all Corpora. This is the page displayed when you click on the Corpora link in the blue left-navigation pane in EDG. Clicking on an individual Corpus will display its content in the Corpus editor. The first tab for Corpus (the tab that gives you access to the editor) is called Documents.
Importing for a Corpus
Import Single Document
This allows manually importing an external file in the corpus, rather than going through a Connector or importing an existing RDF representation of the Corpus. This will show a screen where the Browse… button opens a dialog for picking a source file. Its text and metadata will be parsed by the Apache Tika content analysis toolkit, which can handle these supported formats. The Show Imported data button on the next screen allows reviewing retrieved information. Most supported file formats will present three sections:
- common Metadata Properties such as file name, media type, title, creator;
- Content, which is the actual document’s text (where applicable);
- Other Properties, which include various ones the importer was unable to label and are therefore referred to with their URIs.
Corpus Manage Tab
In addition to the settings available with other collection types, Corpus collection has a few additional options:
Note these are not available for Corpora configured with No Connector or local directory. To pick up new or changed files for local directory after the initial creation, clear the corpus on the Manage tab.
Corpus contents Report
In addition to the normal reports for an asset collection, for any Corpus with a connected data source (production or working copy), the Reports tab > Corpus contents action lists all documents that were either manually imported or retrieved from a remote location with a connector.
Each line in the table represents a single document, identified with its URL to the original document no matter it being a web page or downloadable file, its media type, the date of the last time it was downloaded from its remote location to the EDG cache, and a hyperlink shown as a page icon to download this cached copy.
Note this report is not available for Corpora configured with No Connector. The Corpus editor can be used instead.