RWS and Pantopix conducted a joint webinar on July 12, 2022, to discuss taxonomies and knowledge graphs, one of the top trends in applied artificial intelligence today. It covered the benefits of applied AI in technical documentation and the various ways in which semantic AI is used to develop taxonomies and classify content. The speakers included Karsten Schrempp, Founder and MD at Pantopix and Jorg Schmidt, Senior Solution Consultant from RWS, each having 20+ years of experience in the industry.
Pantopix offers intelligent solutions and optimal information processes in technical communication. RWS offers IP services, regulatory services, language services, and language & content technology (Tridion Docs for technical documentation).
Karsten briefed the audience about knowledge graphs and taxonomies. Starting from the basics:
- Introduction to knowledge graphs
- Building taxonomies
- Case Study: Pool Party & Tridion Docs
- Building knowledge graphs
Basics of Knowledge Graphs and Taxonomies
Knowledge graphs and taxonomies are complementary components of an enterprise data ecosystem. A knowledge graph delivers an overall structure to an enterprise knowledge domain by interlinking data points, while taxonomy provides a hierarchical structure to key lists and terms attached to the underlying data/documents.
Karsten recommends that CCMS (component content management system) and XML language should be the starting point for building knowledge graphs and taxonomies as these simplify the way in which one creates technical documentation. Another advantage of starting with CCMS is that it promotes modularization (adds structure and guidelines) and translation management of content in a collaborative environment.
One can start using taxonomies to drive efficiencies around:
- Conditional content/filtering (an intelligent set of rules and conditions to hide or display specific content in search results)
- Simplified folder and storage structure
There are a lot of systems located between CCMS and the content delivery platform (CDP), which contain a lot of data that needs to be connected and fed into an application, e.g., to generate catalogs, which is a lot to be done manually. Many times, the content is not even accessible as there are access restrictions. And hence there is a need for a platform or plain (depicted in the figure below, which can connect all these data points from different systems.)
This plan is called “knowledge graph”. Essentially, a knowledge graph is a model for a certain knowledge domain: a network of semantically linked concepts, entities, relationships, and content objects. A knowledge graph connects various data based on semantic metadata and semantic relations.
Metadata is used to assign descriptive data (file size, author name, author gender, date created, etc.) to a particular file or page. Semantic metadata is a more intelligent form of metadata as it interconnects the potential alternate names of the same data point (e.g., one system may use ‘sex’ as metadata while other might use ‘gender’ as metadata, semantic metadata will be able to connect the two and display results).
Taxonomy provides the vocabulary to assign metadata. Taxonomy in documentation is a collection of “Document Types”. A Document Type is the definition of a ‘logical type of document’ that is used by different business processes. Simple examples of Document Types are Invoices, Medical Records, IRS Forms W-2, Contracts, etc. In technical documentation, the Document Types can be product manuals, repair manuals, user guides, API documentation, etc.
Thus, taxonomy and semantic metadata work together to provide an easy way for users to find content.
Karsten explained how to build a taxonomy:
- Define a goal to build the taxonomy (scheme of classification)
- Analyze what is needed to reach the goal (website, technical documentation, web shop, Product Information Management)
- Extract a long list of metadata and hierarchies (manually and/or using a tool)
- Store it in a transparent central system
- Harmonize (requires maximum effort as usually different departments communicate in different ways about market groups, products, components, etc.)
Building taxonomies with Pool Party tool in Tridion Docs (covers steps 3-5 above)
Jorg mentioned that RWS has a long-standing partnership with Pool Party and has it integrated into Tridion Docs. Pool Party suite has different components, highlighted in separate colors in the figure below. The components in blue manage the knowledge graphs. While others use knowledge graphs to deliver results.
Corpus analysis[i] in the text mining & NLP section is the tool used to extract the underlying contents’ metadata (step 3 of how to build taxonomy, explained above). Pool Party provides an option to upload documents or simply enter a web URL to crawl, where one can even specify the level of depth (there are 3 levels) to crawl.
Once the crawling is complete, the data is extracted. Pool Party then performs the corpus analysis and displays results in four different tabs: Metadata & Statistics (summary/overview), Extracted Concepts, Extracted Terms, and Corpus Documents as depicted in the figure below.
One can then choose which terms should be added to the candidate term or move it to the blacklist to exclude it. Karsten reminded us that there are two ways to build the taxonomy, 1) manually (look at the structures that are already there) and 2) to use a tool. In this case, we are using a tool (Pool Party) to understand what the content is proposing as metadata, which is displayed in the results. Karsten mentioned that often, both manually extracted metadata and the metadata provided by the tool are used in tandem. And then by using the results from both methods one can grow or build the taxonomy.
There will be a long list of terms as visible from the figure above. People or experts from each department are needed then to blacklist the terms that are not required. The Pool Party tool not just looks for single words but also terms that have two or three words as demonstrated. Once the terms have been shortlisted, one arrives at the taxonomies as depicted inside the red rectangle below:
Leveraging Tridion Docs with semantic AI
With the taxonomy in place, semantic AI layer works with Tridion Docs to intelligently deliver search results to the users looking for content and on the backend to help authors identify the content that has already been created around those terms.
Metadata can be applied within Tridion Docs at any level and can be managed within Tridion or in external systems (ones highlighted in yellow tags) or to suggest tags based on knowledge graphs.
Jorg demonstrated the benefits of smart tagging that helps display relevant results to the end user, with options available for both consumers and employees (figure below):
How to build a Knowledge Graph?
Knowledge graph is the next step in the evolution of smart content. Knowledge graphs when evolved serve as an expert in the domain and serve the most relevant results and recommendations. Knowledge graphs work based on a knowledge model, which is in simple terms a collection of interlinked descriptions of concepts, entities, relationships and events. Knowledge graphs add context to data via linking and semantic metadata and thus provide a framework for data integration, unification, analytics, and sharing.
Karsten listed the various steps or questions to build a knowledge graph:
- Which knowledge do you want to provide?
- With which goal?
- For or to which target group?
- On which way?
- Content
- Does it exist and where? (e.g. in case of auto sector, do you have the spare part and service information to be able to connect both)
- Is it classified? If not then classify. (need auto-classification systems to quicken the process)
- Build ontology
Ontology is a set of concepts and categories in a domain that show their properties and relations between them.
- Semantic base for the knowledge graph (e.g., in the case of the auto sector, need to have the connections between the topic and the spare part)
- Build knowledge graph
- Develop application
Karsten mentioned that knowledge graphs grow over time as these are just like plants which evolve over time. Hence, at the outset, it is difficult to estimate the amount of effort and time it will need to build. Karsten rounded up the webinar by listing a few factors that influence the time and effort to build the knowledge graph if one starts with a taxonomy for technical documentation:
- Products (how complex: number of variants and easy or not to configure?)
- Content (content model ready for classification?)
- Metadata (already there at department level or company level?)
- Goal (content delivery or customized documents, etc.?)
To watch the webinar recording here