Looking for 'special' numbers in the TM

In 2016, a long-standing client introduced a style guide for numbers/measures used in their manuals. Before then, they were using the AmE number 'spelling', i.e. 34,000 for 34 thousands, 12.15 for 12 units and 15 decimal points. In 2016 they decided to match some technical standard, for which the thousands separator should be a non-breaking space (i.e. 34 000) and the decimal separator should still be the point (12.15). They decided that also the translated manuals should stick to this rule, regardless of the local custom -- for example, in my country we use the comma as the decimal separator, but I should stick to the point (no pun intended) for this client.

After 5 years, my TM has a mix of bad sources and bad targets due to these style changes. I would like to fix my TM so that anything I pre-translate is pre-translating according to the new style guide.

 

1) How do I look for numbers such as 34,000 or 16,700 in the TM? I have tried [0-9],[0-9] to no avail. I'd rather avoid to export the TM in *.tmx in XBench as it is quite large -- I am sure Studio can handle this. But how?

2) How do I implement a QA (in either Studio or XBench) to check that the same number formatting in the source should match? I.e. if the source reads 12.15 the target should read 12.15 as well, and not 12,15 as it did before.

 

Thanks!

Parents
  • Hi  

    Unknown said:
    1) How do I look for numbers such as 34,000 or 16,700 in the TM? I have tried [0-9],[0-9] to no avail.

    Unfortunately you can't search a Studio TM very efficiently.  Regex is not supported so you only have wildcards and these are pretty useless really.

    Unknown said:
    2) How do I implement a QA (in either Studio or XBench) to check that the same number formatting in the source should match? I.e. if the source reads 12.15 the target should read 12.15 as well, and not 12,15 as it did before.

    This question is sort of linked to the second part of your first question because the solution is the same for both:

    Unknown said:
    I'd rather avoid to export the TM in *.tmx in XBench as it is quite large -- I am sure Studio can handle this. But how?

    First of all, if you want to QA the TM you need to break it into bitesized chunks.  Best way to do this is TMX and there is a very handy little app on the appstore called SDLTmConvert that can do this for you.  You can find an article here all about the exact process you need to do this and how to use QA with your TM to improve it:

    https://multifarious.filkin.com/2013/03/15/memory_wisdom/

    Then to add the QA you wanted.  You need to use a "Grouped Search Expression - report if source matches but not target".  You can use this to search for the source:

    (?=\d)([0-9.,]+)

    Then this in the target:

    $1

    Keep in mind that this doesn't do any verification on whether the source or the target is correct.  It just finds any pattern containing numbers, commas, periods in the source and checks in the target to make sure that they are exactly the same.  So should achieve what you wanted.  I added the lookahead just to avoid finding commas, or periods on their own.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • , thank you for your answer. I have read your post and since I would be losing context, that's not an option. These manuals are 90% repeated as my client adds some paragraphs here and there whenever they issue a new version.

    I have tried exporting the TM (which has 25,000 TUs) in *.tmx and to load it in Xbench. I tried several RegEx expressions but yours returns an error (incomplete expression, it reads). I cannot find a working RegEx string for Xbench. 

    I have tried to this RegEx string in the Editor (in Studio, on the bilingual files): [0-9]+[.,]?[0-9]*

    I found it on the internet, but I am not sure it does what I am looking for. It finds some numbers, but of course there is no flagging if the number format in the source does not matches the one in the target.

     

    1) In xBench, what RegEx expression should I use? 

    2) Why a RegEx expression in Studio Editor is not Working as a RegEx expression in Xbench?

     

    I really need to have this sorted before continuing my translation. Please help! 

  • Unknown said:

    I have tried to this RegEx string in the Editor (in Studio, on the bilingual files): [0-9]+[.,]?[0-9]*

    I found it on the internet, but I am not sure it does what I am looking for. It finds some numbers, but of course there is no flagging if the number format in the source does not matches the one in the target. 

    Hi Paola,

    If I can chime-in, in order for an error message to be triggered when the number format in the target doesn't match the source, you need to add the expression suggested by Paul in your Verification settings, not in the Editor.

    So you need to go to Project Settings - Verification - QA Checker - Regular Expressions, enter a description, paste Paul's regex in the Regex source box and $1 in the Regex target box, select the Condition dropdown and choose Grouped search expression - report if source matches but not target, then select Add Item in the Action dropdown, and you should be all set.

    After doing this, whenever you enter the wrong number format in the target and confirm the segment, you will see an error symbol displayed nex to the segment status.

  • Thank you both!

      - I was able to QA the files, thank you! 

      - I read again your post and I was able to export my TM from *.sdltm to *.xlf, load it in Studio and run a QA using the RegEx expression you originally provided. It works, but it is a very slow process. I made 6 *.xlf files with 5000 TUs each and it takes time for the automatic LookUp, the fixes, etc. In every file, there are over 1000 segments that fail this RegEx check. I believe acting on the *.tmx file in xBench would be faster, as it is a no-frills software.  I hope there is someone on this forum who could help me fix my TM (paid).

     

    For example, the expression Paul suggested needs to be fixed.

    Currently it flags instances like:

    For further details see Chapter 5, in Section 2, where the engine faults are explained.  as it recognized 5 comma and 2 comma as numbers. I am interested with numbers with a separator and followed by numbers (i.e. 34,000. or 12.15).

    I also realized that now the customer is writing 34 000, so I would need also to catch non-breaking spaces used as a thousands separator.

     

    Maybe we should tackle the issue from the target side, and check that:

    1) the decimal separator is the point

    2) the thousands separator is the non-breaking space, but only if the numbers has more than 4 integers (i.e. 1500 is okay, but 20 000 and 100 000)

  • A suggestion to prevent matching 5 comma, 2 comma. Try this instead of the original expression:

    (?=\d)(\d+[,.]\d+)

    This will match, for example 34,000 and 12.15, but not "5, word".
  • To check that the decimal separator has been used, assuming you only have two decimal places, you could create a rule with a target check with this expression:

    (?=\d)(\d+\.\d{2}\D)
  • Hi ,

    It's late and my brain is getting tired... but would this work?

    (?!\d,\s)(?=\d)([0-9.,\s]+)
    $1

    Regards

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Paul, I really don't know how you do it, it's not even 6 pm here and I'm ready to turn off my computer!
  • Me too... all done now :-)

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hi Paola,

    I noticed you also placed your questions also in the Xbench forum, so I answered your questions there.

    Regards,

    Josep.

  • Yes, and thank you so much for it. As Xbench has powerful search capabilities but is not an editor, it is possible to search directly in *.sdltm files (no need to convert them to *.tmx!) but any editing should be done in Studio. Josep also made a suggestion on this forum to add this editing capability in the future. Please vote it if you'd like to be able to edit *.sdltms more easily.

    So, even if I was finding all the instances that needed fixing, I was not able to run the same search in Studio as I could not use the same RegEx strings in the TM Editor. What I did was as follows:

    • For the thousands separator: I looked in the target for       .1 / .2 / .3 / .4 / ...  .0 and then run a find and replace (find the point, replace it with a non breaking space). So any 34.000 became 34 000.
    • The the decimal separator: I looked in the target for     ,1 / ,2 / ,3 / ,4 / ...  ,and then run a find and replace (find the comma, replace it with a point). So any 12,15 became 12.15

    Then I used Xbench to see if I had missed something, and manually fixed those few occurrences in Studio TM Editor.

     

    So thank you everybody, I am pretty happy with what I was able to do today. I fixed my entire TM and when the client will send their amended sources (i.e. in 2018 their documents read 34 000), pre-translating will already propose the correct translation (34 000).

     

     

     

Reply
  • Yes, and thank you so much for it. As Xbench has powerful search capabilities but is not an editor, it is possible to search directly in *.sdltm files (no need to convert them to *.tmx!) but any editing should be done in Studio. Josep also made a suggestion on this forum to add this editing capability in the future. Please vote it if you'd like to be able to edit *.sdltms more easily.

    So, even if I was finding all the instances that needed fixing, I was not able to run the same search in Studio as I could not use the same RegEx strings in the TM Editor. What I did was as follows:

    • For the thousands separator: I looked in the target for       .1 / .2 / .3 / .4 / ...  .0 and then run a find and replace (find the point, replace it with a non breaking space). So any 34.000 became 34 000.
    • The the decimal separator: I looked in the target for     ,1 / ,2 / ,3 / ,4 / ...  ,and then run a find and replace (find the comma, replace it with a point). So any 12,15 became 12.15

    Then I used Xbench to see if I had missed something, and manually fixed those few occurrences in Studio TM Editor.

     

    So thank you everybody, I am pretty happy with what I was able to do today. I fixed my entire TM and when the client will send their amended sources (i.e. in 2018 their documents read 34 000), pre-translating will already propose the correct translation (34 000).

     

     

     

Children
No Data