How to get word count for all topics in SDL in a particular period of time?

Hi Developers,


Is there any way to get word count of all topics in SDL repository from a particlaur period of time. Ex: Last one year?

Thanks

Roopesh N

  • Hi Roopesh,

    As always, it is nice to know for which product version you are making this request. As that could give more options. So assuming the latest fielded release of Knowledge Center 2016SP1 Content Manager (12.0.1).

    Typically there is an IWrite* Plugin configured which you can see in the Web Client > Settings > Write Plugin XML configuration. It is called ISHSYSWORDCOUNT and is responsible to fill the language-level FISHWORDCOUNT field which is present on every content object (so Maps, Topics, Libraries,...).

    The implementation of the plugin is a balance between performance and accuracy. It doesn't do full language segmentation. It simply knows how to treat xml and counts space characters (as word delimeter) - so a simple word count. (KC2016SP1 - IWriteMetadataAndBlobPlugin - BlobWordCount)

    So now you know if and where we store a word count on every content object (so Maps, Topics, Libraries,...). Next is to know what time period means, I would suggest to use the language-level field FISHLASTMODIFIEDON which is updated every time a new blob is submitted into the repository.

    Something worth thinking about is if you only want to do this on the latest version, or all versions per object. And if you want to do this for one language (e.g. source language) or for all languages. As said the word count is space-based, so not valuable for all languages.

    Those fields can be added in some Client Tools list views, but this probably wouldn't meet your aggregation expectations. Also adding them to Web Client CSV reports still leaves a gap with your suggested "Last one year" as these are typically scoped to per-publication-export.

    Then we have the API

    • A very simplistic view could be to use DocumentObj25.Find and provide "<ishfield name='FISHLASTMODIFIEDON' level='lng' ishoperator='greaterthan'>01/01/2016</ishfield>". This one is at your own risk as depending on the size of your database this could be a lot of information resulting in heavy queries, load, memory-usage, timeouts,..
    • More advised to do this 'FISHLASTMODIFIEDON' filter request on smaller groups and not your entire repository. For example iterating over your folder structure and applying the date filter in DocumentObj25.RetrieveMetadata

    I hope this gets you started.

    Best wishes,

    Dave

  • HI Dave,

    Thank you for the good suggestion. I am using 2016SP1. Now i am able to get word count for topics and implemented in publication level. Now my issue is, i do not have any option to get how many publications/topics are available in repository. what i am trying to do is get all publication GUID and publish word count object in Publication manager and get the total count. Is there any way to get all GUID available in SDL repository.?
    Thanks
    Roopesh
  • Hi Roopesh,
    As to the question of how many publications are available in the repository, if you run Baseline2.5GetList with no filter attached, that will give you a list of all baselines in the system. Iterate over each of those identifiers calling PublicationOutput2.5UsingBaseline and you can aggregate a list of all publications in the repository.

    That still would not give you a list of all topic identifiers / versions in the system. I guess you could recursively iterate over the repository folder tree for that.

    At any rate, as far as I understand it would not make sense to have an API call that would return all GUIDs available in the repository, because some objects using GUID may be internal.

    Hope this helps,
    Joakim
  • I agree with Joakim... as initialy stated the repository size is something to take into account when writing your solution. Compare it to listing all files on your C:\Windows\-folder, doable, might be slow. Now consider listing all files on all network-drives, it requires a different approach probably.

    * Recursively iterate over the repository folder tree could work.
    * Use DocumentObj25.Find or PublicationOutput25.Find for that matter and provide "<ishfields><ishfield name='MODIFIED-ON' level='lng' ishoperator='greaterthanorequal'>01/01/2016</ishfield><ishfield name='MODIFIED-ON' level='lng' ishoperator='lessthan'>02/01/2016</ishfield</ishfields>". Querying with a datetime range in the filter might make the result sets workable, and you can concatenate the results.
  • Hi
    I tried this and end up with a mess!!

    Is there any way to get using ISHRemote?

    Thanks in Advance
    Roopesh