How can I search for an exact phrase containing white space using Tridion Docs?

Hello!

I am trying to track down topic objects in our Tridion Docs repository that contain a specific, incorrectly written phrase, "System Designer". When I do a search for this phrase surrounded by quotation marks, I get many results where the only topic text that roughly matches that query is "SystemDesigner", which just happens to be the correctly-written text. I have tried the following queries in hopes I could narrow down the results to the subset of objects that contain the incorrectly written text:

Query # Hits in browser (SDL Content Manager) # Hits in Publication Manager
"System Designer" 678 678
System*Designer 356 356
"System Designer" -"SystemDesigner" 0 0
"System Designer" -SystemDesigner 0 0
"System Designer" -Systemdesigner 165 165

There are some things that are inconvenient or strange about these results:

  • White space does not appear to be respected in exact phrase searches. "System Designer" returns results such as "SystemDesigner". 
  • Some objects returned do not contain "System Designer". Oftentimes results will only contain text such as "system design". This looks like stemming to me, but the behavior is inconvenient from within an exact phrase search.
  • According to documentation, search is case insensitive, but as is demonstrated in the final two rows of the table, this does not appear to be the case.

Questions:

  1. Is there a way to get Tridion Docs to respect white space in exact phrase searches?
  2. Does anyone have recommendations on forming a search query to return exact hits?
  3. Is search only case insensitive some of the time? Are there undocumented rules for this?

Thank you in advance for your help!

Note on software:

  • SDL Tridion Docs - Content Manager - Build 13.0.4115.1
  • Publication Manager 13.0.4115.1
  • Hi Daniel,

    First of all, thank you for your clear and complete description.

    I assume you found TD13SP2 - Search for content where the mix of text, attributes, attribute values, languages, xml-syntax is explained with examples. In essence stemming is applied and by default the full text index is configured like below. So upon "query" (so the searching act), the query is lower-cased on a default system.

    <!-- A text field with defaults appropriate for English: it
             tokenizes with StandardTokenizer, removes English stop words
             (lang/stopwords_en.txt), down cases, protects words from protwords.txt, and
             finally applies Porter's stemming.  The query time analyzer
             also applies synonyms from synonyms.txt. -->
        <!-- English -->
        <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
          <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <!-- in this example, we will only use synonyms at query time
            <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
            -->
            <!-- Case insensitive stop word removal.
              add enablePositionIncrements=true in both the index and query
              analyzers to leave a 'gap' for more accurate phrase queries.
            -->
            <filter class="solr.StopFilterFactory"
                    ignoreCase="true"
                    words="stopwords.txt"
                    enablePositionIncrements="true"
                    />
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EnglishPossessiveFilterFactory"/>
            <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.StopFilterFactory"
                    ignoreCase="true"
                    words="stopwords.txt"
                    enablePositionIncrements="true"
                    />
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EnglishPossessiveFilterFactory"/>
            <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
          </analyzer>
        </fieldType>
    Expecting exact 'human' results of a full text index is walking the edge - whoever goes to the 10,000 hit on Google.
     
    Another way how you can see this is, to use the full text index as your first filter. And then in those result sets go and retrieve the DITA xml files in question and check with an exact Find.
     
    Best wishes,
    Dave