How to handle mixed-language HTML source files

Question

Hi,

We've been asked to hook up a new content source with content created across multiple countries in multiple languages. Right now it looks like the source files will be mixed-language HTML documents and I'm wondering how to best handle them.

The only out-of-the-box way I can think of is using an html attribute condition to skip unwanted source languages (e.g. only extract if attribute "lang=en"). However, I've tried to do this in a different scenario a while ago and found that--while this functionality is available in the HTML filter--it doesn't actually work. For SDL staff reference: this is my support case 00118003, defect number WS-7935 WS-7934. Has this issue been fixed in a recent WS release? I can't find a mention of this in any of the release notes (neither under fixed nor under known issues).

Can you think of any other ways to go about this?
Thanks!
Stephan

Eric Bishop · Answer

Another option that don't know would work but might is the use of a namespace. It may be that you could only count elements when they were pre-pended with a given namespace. I don't know if this will work but you could give it a try.

WorldServer > 1. WorldServer - General Question Forum

How to handle mixed-language HTML source files