How can I handle tags in QA Checker 3.0->Regular expressions?

Hi everyone!

 

I am a fanatic user of QA Checker 3.0->Regular Expressions feature. Since as of late I am working with Pretranslated Machine Translations which I have to edit, there are many embedded mistakes in the translation, so it's extremely useful to me to use custom Regular Expressions for checks.

 

For example, it's great to have a check that says that, if there are 5 consecutive digits, they should be split in order to use a space as a thousands separator. However, if there are only 4 digits, no thousands separator must be used.

 

That said, I find I have a problem with tags. Nothing I use seems to be able to detect or work with them. For example, if I want to check whether there IS some text in target NEXT TO A TAG (for example, "está a 3<TAG> de distancia" there is nothing I can do as far as I know).

 

It's funny that advanced display filter DOES detect tags in their regular expressions (for example I can filter target by <.*> and it will filter ALL segments including at least ONE tag (as tags are enclosed in <>". However, that trick does not work in QA->Regular Expressions.

 

Or am I mistaken?

 

And, if there is a limitation here, here's also an idea for a great improvement for the future!!! It must not be that complicated and would take QA Regular Expression Checks to the next level of usefulness!

 

Is there any way to mark as error, for example, any target segments that END in a tag?

I have tried this...

And it DOES not work. However, when used in Advanced Display Filter

It does a FANTASTIC job of filtering as expected (that is, segments ENDING in a tag)

 

So.. is there such a limitation in QA Checker that it's incapable of handling tags, or I am doing something wrong? Any ideas on how to create a Regular Expression check that handles tags? (for example, check if there is a target segment ENDING in a tag...)?

 

Thanks!

  • Personally I don't believe that the behavior of Advanced Display Filter is intentional... I rather believe that due to its rather 'simple-minded' approach it's not treating tags as tags, but rather as plain text... and that's why your construct with pointy brackets works.
    Your construct with pointyt brackets simply SHOULD NOT work, as there aren't any pointy brackets in the segments on your screenshot.

    Regarding the "to be able to search for tags" requirement - there was already some discussion about it and maybe even an idea for Studio enhancement created...
  • Well, first of all, that use is extremely useful to filter (or filter out) segments with tags or otherwise handle segments with tags. I have a translation engine which is ruined by tags, so I filter all segments with tags, change their status to Translation Rejected, then filter those out (or lock them, or whatever) and then I can work ONLY with segments with no tags at all.

    So yes, that use is extremely useful, so I hope it's not unintentional!

    About the second part of your reply, this is not about "searching for tags".. it's about being able to handle tags in QA as happens in Display Filter. I am not particularly interested in Searching for Tags in the use I am thinking of. Being able to handle tags in such a way with Regex Expressions would be... well... a life saver for me!
  • The thing is that in order to be able "to handle tags" (i.e. "to handle some generic tokens") you need to be able "to identify the tags/tokens" in the first place... i.e. to be able to identify somehow that "this is a tag/token".
    And this is not - or should not - be possible currently. Simply because there isn't any built-in way to identify "tag/token" in the regexes... they work just a standard regexes, without any such additional functionality.
    As I said, the fact that the pointy brackets constructs currently work for you is apparently just coincidence and not intentional... and actually a BUG, since that actually prevents from filtering segments containing only actual text enclosed in pointy brackets, without polluting the results by segments containing tags, but not text in pointy brackets!.

    I do understand that the current functionality is very useful for you and I do understand why. Am just saying "enjoy it while you still can"... but (hopefully!) do NOT expect this apparently incorrect behavior to be carried over to regex QA checker.

  • Hi Evzen,

    I agree with you that using pointy brackets for identifying tags is not a good idea.

    For example, if I am searching for "foo bar", I will only find it if there are no unexpected tags in the text. A segment with "foo UNEXPECTED TAG bar" will not be found by the so-called "Advanced" display filter.

    But there was an outpouring of relief by Studio users in another thread. They were so happy that they had a way to identify tags. As a result, I limited myself to suggesting that a "Tag" button be added so that you could specify whether or not you wanted to identify tags.

    If you did not want to identify tags, then tags would be deleted before the text was sent for regex processing. If you wanted to identify tags, then tags would be converted to "<tag text>" before sending for regex processing. So "foo TAG bar" would be converted to "foo bar" or "foo <tag> bar" depending on the "Tag" button.

    Adding a "Tag" button would also inform Studio users that the "Advanced" display filter was doing something different. The way things are now, unless you read one of these threads there is no way you could possibly know that the "Advanced" display filter is unable to reliably find "foo bar".

    The current flawed solution is popular, so I doubt it will disappear. I think it will very likely be expanded into other areas where regex is used.

    SDL is probably also happy with this solution because they only have to modify the text sent for regex processing, not the actual regex code itself. For example, using "\<" and "\>" to identify the start and end of tags would require modifying the regex code, which could introduce bugs into regex processing -- if it was not done carefully.

    Another possibility is outlined below. I have not thought this through fully, but imagine that we produce two versions of the input string, one without tags and one with tags, using "<" and ">" to delimit the tags.

    Then start matching the string without tags against the regex expression until it fails. If it fails because the regex expression is expecting a tag, then switch to the version of the input string that includes tags and see whether this string satisfies the regex expression.

    If the string with tags does match, then continue checking the match until either the end of the string or the end of the tag. At the end of the tag, switch back to the string without tags and continue matching until the match fails. If it fails because the regex expression is expecting a tag, then ... and so on.

    From my point of view, this has the advantage that "unexpected" tags are simply ignored. After all, if you want to match a tag, then you presumably know where it is going to be and can include it properly in your regex search expression. You cannot predict where any unexpected tags might occur in your text, so there is no way you can include them in a regex expression.

    So if I am looking for "foo bar TAG1", the "Advanced" display filter will currently not find "foo UNEXPECTED TAG bar TAG1".

    However, the solution I outlined above would find a match.

    For example, if we create two versions of the input string, namely "foo bar" and "foo <unexpected tag> bar <tag1>" and then start matching "foo bar" against the regex expression "foo bar <tag1>", then the match will not fail until "<tag1>" is reached. This means that the "<unexpected tag>" section of the second input string is just ignored, since we are processing the first string at that point. When the match fails, we switch to the second string and start matching again. Since the regex expression is looking for "<tag1>" at this point, and that is precisely what is in the second string, the regex expression is considered a match.

    Given the unpredictable location of many tags in the files I work with, an approach like this is the only proper way I can imagine for handling them.

    Of course, someone else might say that they only want to find segments that are exact matches, i.e. they don't want to skip over anything. So they only want the match to succeed on "foo bar TAG1" and fail on "foo UNEXPECTED TAG bar TAG1".

    Finding a solution that handles every conceivable situation is the difficulty here. Do we need another button: "Ignore the unexpected" ??

    Hopefully SDL will consider questions like these before they go any further with regex tag matching.

    Best regards,
    Bruce Campbell
    ASAP Language Services

  • It's no surprise it is a useful feature. As things are now, we can only do a few things with tags, namely, check their existance, order and spacing. As important as they are, we can't do nothing more with them. We cannot check whether they are incorrectly placed with a regular expression, for example, so I, for one, would go for a way to be able to include them in all possible Trados features: Advanced Display Filter, Searches and Regex QA. Advanced Display filter does not really work well, as I have NOT been able to find a way, for example, to detect target segments ending in a period and a tag. I don't remember the exact necessity I had right now so I can't show you the exact difficulty I had, but it's true that Advanced Display Filter works well for basic searches, but if you want to expand the possibilities, it tends to fail.

    I, for one, think that your tag button solution would working well, as long as "FOO UNEXPECTED TAG1 BAR" can be matched by just a default "FOO BAR" search. I checked, and it's as you say, if there is an unexpected tag, Advanced Display Filter will NOT find the string "FOO BAR".

     

    Imagine the possibilities if we could integrate tags to our Regular Expressions in a successful way? ADF is just flawed, because it adds a very interesting functionality, but wrongly and incompletely.

     

    I now remember the problem I had with ADF. I wanted to find source segments that ended in a tag and BEFORE that, NOT a period. In other words, segments that ch as this:

    For that, I used this 

    And it does not work, as it still does match segments such as this one as well.

    so its behaviour is quite erratic and including tags in ADF is not really well implemented. But remember, I want to be able to use Tag identification BOTH in Advanced Display Filter AND in QA Regular Expression checks!!!

     

  • Hi  ,  ,

    I don't intend to get drawn into this conversation too much (need some time to digest it properly as you guys write so much ;-)), but I think you are probably correct wrt the Advanced Display Filter. It was never intended to find tags in the way you are suggesting. But we do have the Community Advanced Display Filter where we are adding things all the time and we have actually added an option to filter on tags. It doesn't search inside the tags (although if this is what people want we, or any other developer, could add it as it is opensource) but it allows you to filter on segments with tags. I have not published it yet because Dan Lucas raised an issue with the colour filtering which we are trying to resolve (proving difficult as the API seems to be inconsistent in how it handles segment information in this regard and we want to avoid using the sdlxliff if we can), but I may publish the new things we already added anyway in the meantime. Perhaps at the end of the day.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Unknown said:
    an issue with the colour filtering which we are trying to resolve (proving difficult as the API seems to be inconsistent in how it handles segment information in this regard and we want to avoid using the sdlxliff if we can)

    So I hope that internal high-priority ticket for immediate fix of that inconsistency has been already logged...

  • Well... once we get to the bottom of why we are seeing this inconsistency we'll be able to log it. Not everything is a bug. Studio is a big program and there are many ways to skin a cat and we have many teams working on different parts of the application. Once my team, who only focus on apps using the public APIs since we like to use the work we do as something to support external developers efforts as well, can see how this should be handled we'll know which way to address this. Might just be a knowledge thing... we also don't know everything and we sometimes come across code written in 2009 or earlier that only becomes an issue when we try to do something that the core application does not need to do.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hmm... this got a little off-topic :) The initial query was to know whether there is a way to use tags in the way I said (that is, Regex Expressions in QA Checker 3.0) and I just mentioned the Advanced Display Filter as an add-on, as the Advanced Display Filter did allow some tinkering with tags, but apparently QA checker REgex Expressions did not...

    Any ideas, Paul?
  • Hi ,

    There is no way to check whether a segment ends in a tag in the QA Checker.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub