Duplicate TUs in TM (Studio 2014) w/ & w/o values/attributes - How to dedupe?

Hi everyone,
My setup : Studio 2014 SP2 Pro and Win 7 Pro.

I have a TM with duplicates (same source, same target), only thing that differs are the attributes and fields, etc. One TU has no fields, custom values, the other has some defined. Several people work on this TM (translator/proofreader and me doing final QA before delivery). Translators do not use the "Add as new translation" function used to have multiple targets.
Before sending TM to client I import all sdlxliff files from the project, specify attributes/field values, etc. and choose "Replace" when importing so only the latest validated TU is kept.

 

 

I want to dedupe and keep only the TU that has the values/attributes defined, for each pair of duplicate TUs. How can i do that?

Using "The Search in Potential Duplicates Only" in Studio seems to also return false positives and i don't know if it is possible to add the filter that says "if target is 100% identical" AND one TU has no fields defined whereas the other has.
Thanks a lot.

Parents
  • ok - I give up!! I have spent half an hour trying to create the situation you have got and I can't. Studio won't allow duplicates if I import TMs, if I manually try to create them in the TM Maintenance View, if I export to TMX and create them in there, or even if I edit the SDLTM with a SQL Editor (actually I corrupted the TM trying that!).

    I never thought it would be this difficult!!

    Can you share your TM with me so I can have a play with the process to remove them?

    Thanks

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Using the batch task "Update Main TM" or "Finalize" with default option "Merge translation units" usually leads to unwanted duplicates in my experience.
  • Thanks Raphaël. At the end of project, I always import all sdlxliff files (with "replace" option and values defined), but indeed some users may during the project use the "Update Main TM" batch task...
  • Hi all,

    I cannot reproduce this problem at all.  I do see the issue with TUs not really being 100% duplicate as most of the ones returned in the potential duplicate search are slightly different.

    So I had a little play with a SQLite editor as I think the only way to do a smart deletion of the true duplicates with null field values would be to investigate this.  Dangerous I think, but once you had the query probably simple enough to run whenever you needed it.  Now, I don't know my way around SQL queries well enough to do this yet, but I think this is an interesting approach.  For example, I can run this script:

    And it returns all the 100% source duplicates based on there being 2 or more of them like this:

    Quite a few duplicates based on source segments only.  But I think if this script was extended so it did these things:

    1. reported on duplicates in source AND target only

    2. filtered these to only list those with empty field values

    3. deleted them

    Then you'd have what you need.  Perhaps someone in the community is already familiar with writing SQL scripts for these sorts of things and can help to develop it further.  Could be interesting to have a bunch of scripts for doing other maintenance operations as well where it would be possible through SQL to do all kinds of complex queries with no limitations imposed by software.

    Maybe not a solution for everyone, and definitely back up your TM before trying this, but I like this idea.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

Reply
  • Hi all,

    I cannot reproduce this problem at all.  I do see the issue with TUs not really being 100% duplicate as most of the ones returned in the potential duplicate search are slightly different.

    So I had a little play with a SQLite editor as I think the only way to do a smart deletion of the true duplicates with null field values would be to investigate this.  Dangerous I think, but once you had the query probably simple enough to run whenever you needed it.  Now, I don't know my way around SQL queries well enough to do this yet, but I think this is an interesting approach.  For example, I can run this script:

    And it returns all the 100% source duplicates based on there being 2 or more of them like this:

    Quite a few duplicates based on source segments only.  But I think if this script was extended so it did these things:

    1. reported on duplicates in source AND target only

    2. filtered these to only list those with empty field values

    3. deleted them

    Then you'd have what you need.  Perhaps someone in the community is already familiar with writing SQL scripts for these sorts of things and can help to develop it further.  Could be interesting to have a bunch of scripts for doing other maintenance operations as well where it would be possible through SQL to do all kinds of complex queries with no limitations imposed by software.

    Maybe not a solution for everyone, and definitely back up your TM before trying this, but I like this idea.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

Children