How to handle mixed-language HTML source files


We've been asked to hook up a new content source with content created across multiple countries in multiple languages. Right now it looks like the source files will be mixed-language HTML documents and I'm wondering how to best handle them.

The only out-of-the-box way I can think of is using an html attribute condition to skip unwanted source languages (e.g. only extract if attribute "lang=en"). However, I've tried to do this in a different scenario a while ago and found that--while this functionality is available in the HTML filter--it doesn't actually work. For SDL staff reference: this is my support case 00118003, defect number WS-7935 WS-7934. Has this issue been fixed in a recent WS release? I can't find a mention of this in any of the release notes (neither under fixed nor under known issues).

Can you think of any other ways to go about this?