How to split .tmx/translation memory by size?

Hi!

I would like to split a large .tmx by size into small amounts. How to do that?

emoji
Parents
  •  

    A TMX is a flat file so the easiest way is probably to simply use a text editor.  Just work out how many you need based on your size requirements, then convert this into how many lines each file when split would contain, and split it up like that.  Manual but not difficult.

    Is that enough of an explanation for you?  I don't know how comfortable you might be with xml (a TMX is an xml file) files or working with a text editor.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  •  

    If it helps... I just used ChatGPT to create a powershell script to do this.  Script is here along with the sample TMX I used:

    https://github.com/paulfilkin/Powershell_scripts/tree/main/split_TMX

    Also a short video to explain how to use it:

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  • I received these errors:

    At C:\Users\shpctac0fffe\SplitTMX.ps1:41 char:60
    + ... script type="application/json" id="client-env">{"locale":"en","featur ...
    +                                                             ~~~~~
    Unexpected token ':"en"' in expression or statement.
    At C:\Users\shpctac0fffe\SplitTMX.ps1:41 char:65
    + ... cript type="application/json" id="client-env">{"locale":"en","feature ...
    +                                                                 ~
    Missing argument in parameter list.
    At C:\Users\shpctac0fffe\SplitTMX.ps1:41 char:716
    + ... tions","custom_inp","remove_child_patch","kb_source_repos"]}</script>
    +                                                                 ~
    The '<' operator is reserved for future use.
    At C:\Users\shpctac0fffe\SplitTMX.ps1:240 char:84
    + ... tion/json" data-target="react-partial.embeddedData">{"props":{"docsUr ...
    +                                                                 ~
    Unexpected token ':' in expression or statement.
    At C:\Users\shpctac0fffe\SplitTMX.ps1:240 char:95
    + ... ":{"docsUrl":"https://docs.github.com/get-started/accessibility/keybo ...
    +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Unexpected token ':"https://docs.github.com/get-started/accessibility/keyboard-shortcuts"' in expression or statement.
    At C:\Users\shpctac0fffe\SplitTMX.ps1:240 char:168
    + ... s.github.com/get-started/accessibility/keyboard-shortcuts"}}</script>
    +                                                                 ~
    The '<' operator is reserved for future use.
    At C:\Users\shpctac0fffe\SplitTMX.ps1:279 char:205
    + ... ink Button--medium Button d-lg-none color-fg-inherit p-1">  <span cla ...
    +                                                                 ~
    The '<' operator is reserved for future use.
    At C:\Users\shpctac0fffe\SplitTMX.ps1:302 char:45
    +             <ul class="list-style-none f5" >
    +                                             ~
    Missing file specification after redirection operator.
    At C:\Users\shpctac0fffe\SplitTMX.ps1:500 char:13
    +       CI/CD & Automation
    +             ~
    The ampersand (&) character is not allowed. The & operator is reserved for future use; wrap an ampersand in double quot
    ation marks ("&") to pass it as part of a string.
    At C:\Users\shpctac0fffe\SplitTMX.ps1:575 char:45
    +             <ul class="list-style-none f5" >
    +                                             ~
    Missing file specification after redirection operator.
    Not all parse errors were reported.  Correct the reported errors and try again.
        + CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException
        + FullyQualifiedErrorId : UnexpectedToken

    emoji
  •  

    Most likely due to content that is in your TMX.  I did this in a few minutes and only tested one 33k TU TMX that is probably a pretty good content after coming from the EU.  The errors might be caused by the way the TMX content is handled or by the presence of special characters that are not correctly escaped.  Additionally, if the TMX file contains HTML or JavaScript-like content, it might cause parsing issues.

    If you can't fix this yourself then I'd need to have your TMX to resolve it properly.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  •  

    You can zip it and send to pfilkin at sdl dotcom.  But if zipped it's still over 15 Mb please just send me a download link for it with dropbox, googledrive or whatever file sharing application you have.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  •   

    Thanks, I downloaded it this morning.  I don't know if you made a mistake but you sent me a 2.6Gb SDLTM and not a TMX.  Anyway, if you tried to use the script on that it would cause a problem!

    So I exported the SDLTM to a TMX and tested my script.  It didn't error at all, but it did fail to extract any segments as it found none.  So this prompted me to make a few changes to the script and now it works.  I updated it in Github and here's how it works with the TMX created from the SDLTM you provided.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  •  

    In the meantime, since you seem to be starting with an SDLTM and not a TMX, this application is probably useful for you:

    https://appstore.rws.com/Plugin/111

    I think you will need a paid version of this for the size of the files you are handling, but it's quick and easy to use.  Here's how:

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  •   

    Please confirm if the size of .tmx was 626.5 MB after the conversion from SDLTM? I thought that the size of these same content files are almost same (in GB in my scenario).

    I tried to run your script in Powershell that I installed to Linux Ubuntu and I received this error: Screenshot of a PowerShell error message in a Linux terminal indicating 'Unexpected token '"en"' in expression or statement at line 40.'.

    I got the following error via Powershell of Windows 11:

    At C:\Users\olavm\Desktop\TMX\SplitTMX.ps1:40 char:60
    + ... script type="application/json" id="client-env">{"locale":"en","featur ...
    +                                                             ~~~~~
    Unexpected token ':"en"' in expression or statement.
    At C:\Users\olavm\Desktop\TMX\SplitTMX.ps1:40 char:65
    + ... cript type="application/json" id="client-env">{"locale":"en","feature ...
    +                                                                 ~
    Missing argument in parameter list.
    At C:\Users\olavm\Desktop\TMX\SplitTMX.ps1:40 char:903
    + ... ite_metered_billing_update","ignore_hidden_in_quote_reply"]}</script>
    +                                                                 ~
    The '<' operator is reserved for future use.
    At C:\Users\olavm\Desktop\TMX\SplitTMX.ps1:248 char:84
    + ... tion/json" data-target="react-partial.embeddedData">{"props":{"docsUr ...
    +                                                                 ~
    Unexpected token ':' in expression or statement.
    At C:\Users\olavm\Desktop\TMX\SplitTMX.ps1:248 char:95
    + ... ":{"docsUrl":"docs.github.com/.../keybo ...
    +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Unexpected token ':"'">docs.github.com/.../keyboard-shortcuts"' in expression or statement.
    At C:\Users\olavm\Desktop\TMX\SplitTMX.ps1:248 char:168
    + ... s.github.com/get-started/accessibility/keyboard-shortcuts"}}</script>
    +                                                                 ~
    The '<' operator is reserved for future use.
    At C:\Users\olavm\Desktop\TMX\SplitTMX.ps1:272 char:254
    + ... ink Button--medium Button d-lg-none color-fg-inherit p-1">  <span cla ...
    +                                                                 ~
    The '<' operator is reserved for future use.
    At C:\Users\olavm\Desktop\TMX\SplitTMX.ps1:315 char:45
    +             <ul class="list-style-none f5" >
    +                                             ~
    Missing file specification after redirection operator.
    At C:\Users\olavm\Desktop\TMX\SplitTMX.ps1:529 char:13
    +       CI/CD &amp; Automation
    +             ~
    The ampersand (&) character is not allowed. The & operator is reserved for future use; wrap an ampersand in double quot
    ation marks ("&") to pass it as part of a string.
    At C:\Users\olavm\Desktop\TMX\SplitTMX.ps1:653 char:45
    +             <ul class="list-style-none f5" >
    +                                             ~
    Missing file specification after redirection operator.
    Not all parse errors were reported.  Correct the reported errors and try again.
        + CategoryInfo          : ParserError: (:) [], ParentContainsErrorRecordException
        + FullyQualifiedErrorId : UnexpectedToken

    Should I put the .tmx to the same folder as the script file? If so, I could not find such guide at "4. Run the Script:" of https://github.com/paulfilkin/Powershell_scripts/tree/main/split_TMX .

    emoji


    Generated Image Alt-Text
    [edited by: RWS Community AI at 11:09 PM (GMT 1) on 24 Aug 2024]
  •  

    Please confirm if the size of .tmx was 626.5 MB after the conversion from SDLTM? I thought that the size of these same content files are almost same (in GB in my scenario).

    It was 626.5 Mb indeed.  Definitely won't be the same... an SDLTM is a SQLite database and a TMX is a flat file.

    Looking at your errors... not sure.  May be related to transferring the script between different operating systems, so check the file encoding. Ensure it's saved in UTF-8 without a BOM (Byte Order Mark). In some Unix-like environments, a script with a BOM might cause errors if the BOM is read as part of the shebang (#!) line, or if it precedes other characters (the interpreter might try to read the BOM as part of the script code, potentially leading to syntax errors or unexpected behaviour).

    Should I put the .tmx to the same folder as the script file? If so, I could not find such guide at "4. Run the Script:" of

    Noted.  I enhanced the README.  A also added another version of the script, although I'd recommend you don't run it with your file.  The performance is awful, and I can't figure out how to speed it up.

    The original script works fine though... put the script in the same folder as the TMX and the split files will also appear in the same folder as the TMX.  As shown in the video I created for you.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
  •  

    Noted.  I enhanced the README.  A also added another version of the script, although I'd recommend you don't run it with your file.  The performance is awful, and I can't figure out how to speed it up.

    I couldn't sleep with that bugging me... I updated it, and now it runs as fast as the original but with a better option for the location of the split files.

    Neither script require the TMX to be in the same location.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

    emoji
Reply Children
No Data