Segmentation rule - in TMs? What about before creating a TM? Segmenting is being done!

I'm reading the segmentation documentation. I'm missing something - it seems to suggest that segmentation rules are in TMs. But before one ever creates a TM, one begins translating in Studio. And it's breaking the original text into segments.  For literary translation I want the segments to be smaller - much smaller. Not sentences, but phrases. In short, what am I missing? Thanks in advance.

  • Hi John,
    When you begin translating in Studio, you will probably have created your very first TM already. Any single file opened will follow the default rules of the TM that appears at the top of the list under General Options. (File>Options>Language Pairs>All language pairs>Tm). Click the TM at the top of the list (or the only one there), then click Settings>Language Resources>Segmentation Rules>Edit.

    Here, you can select Paragraph based segmentation. Perhaps this is what you're looking for? Studio will start a new segment every time it comes to a hard carriage return.

    After changing any default segmentation rules, you will need to process your file again for the new rules to be applied.
    HTH,
    Emma

  • Hi ,

    In addition to the answer from Emma you might find this useful:

    multifarious.filkin.com/.../

    Old article now but the principles have not changed.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • I see, that makes sense. FYI, I'm looking for segmentation by the smallest possible unit. The units of translation, functionally, are words and phrases. In truth, I find the basic principle of choosing longer segments surprising. What I mean is: the program specifically defaults to NOT breaking at a semicolon - and yet you should have complete grammatical sentences before and after a semicolon. Why would you want to establish a standard only for this COMBINATION of sentences, and not for each sentence in the combination?
  • Paul, fascinating. I'm reading this and (i) on a certain level it makes no sense to me at all (I'll explain why in a minute), and (ii) on another level, it raises issues that I'd want to resolve.
    It makes not sense because the result of a paragraph-based segment seems incredibly unlikely ever to be useful again, so what's the point of a program trying to remember it? What use would a TM containing such segments be to anyone?
    But I do agree that I often find myself facing a dilemma when working in Studio: for one reason or another, I want to translate it in a way that doesn't necessary limit itself to reproducing what's in the source text, but I don't want to "pollute" my TM (and surely I'll never want to vary the original in the same way again). Ideally, I'd be able to have two entries - the one that'll be expressed in "Save target as" and a (potentially very different) one that would go into the TM.
  • One more question that occurs to me - if I wanted to use paragraph segmentation for literary translation and sentence segmentation for legal translation, is there any easy way to shift between them? What are the options that come to mind? Thanks for advising on this!
  • Unknown said:
    if I wanted to use paragraph segmentation for literary translation and sentence segmentation for legal translation, is there any easy way to shift between them? What are the options that come to mind? 

    Two separate translation memories is the way to go. One with paragraph segmentation and the other with sentence segmentation.

    Then, to organise your work even better, set up two separate projects, one literary, one legal, and simply add new files to those projects as you go.

    If, for some reason you want to, say, refer to your legal TM when you're doing a literary translation, simply add the legal TM as the 2nd in the list in that project. It won't affect text segmentation but you'll be able to use it for reference (concordance).

  • Unknown said:
    I see, that makes sense. FYI, I'm looking for segmentation by the smallest possible unit. The units of translation, functionally, are words and phrases. In truth, I find the basic principle of choosing longer segments surprising. What I mean is: the program specifically defaults to NOT breaking at a semicolon - and yet you should have complete grammatical sentences before and after a semicolon. Why would you want to establish a standard only for this COMBINATION of sentences, and not for each sentence in the combination?

     
    Paragraph segmentation is ideal if you want to restructure the paragraph, which is why it could be useful in literary translation.
    A semicolon doesn't break the sentence into two segments by default because a translator might well want to build the sentence differently, without a semicolon, and want the whole sentence in one segment to work on it. Of course this depends on the language pair, the genre and the author's style. Hence the option to split and merge segments on an individual basis.
     

    What use would a TM containing such (long) segments be to anyone?
    The TM itself may not be much use, although it will give you a bigger view of a chunk, that may help in the future. The whole idea is to be able to restructure the target language paragraph as you translated. 
     

     Ideally, I'd be able to have two entries - the one that'll be expressed in "Save target as" and a (potentially very different) one that would go into the TM. 
    No problem: translate the segment and add it to the TM by clicking "confirm but do not move to next segment". Then translate it again and click Ctrl+down arrow to move to the next segment. If you like, you can change the segment status to "translated" by right-clicking it; this doesn't save it to the TM.
  • Unknown said:
    One more question that occurs to me - if I wanted to use paragraph segmentation for literary translation and sentence segmentation for legal translation, is there any easy way to shift between them?

    Surely you'd use two TMs in this case?  I doubt you'll get much leverage for a literary translation out of a legal TM anyway, and vice versa.  If you don't then perhaps you should.

    And the easiest way to manage this is to use Project Templates... one for legal and one for literary work.  Then creating projects is simple as the templates are already set up to reflect the type of translation it is.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Unknown said:
    FYI, I'm looking for segmentation by the smallest possible unit.

    Can you precisely define the term "smallest possible unit" using "TECHNICALLY" RECOGNIZABLE terms like number of characters, kind of characters, sequence of certain characters, etc.?! Simply something TECHNICALLY analyzable/recognizable by the segmentation engine... I doubt so. And that's exactly the reason why it cannot be done like you want and why it's done in CAT tools like it is done.

    What you are asking for is some AI superengine which would actually UNDERSTAND THE MEANING of the text and divide it to LINGUISTICALLY "logical" parts (segments). But then you are in a completely wrong forum... you should perhaps check some Google or MIT labs or similar...

  • Emma, Paul, thanks. This is immensely helpful. Two templates it is. (Previously I had one template for each language pair.) Is there a way I can make use of the old templates? Would that way (the only way?) be what Emma suggested - putting it in as a second TM? It does seem a great deal of work to just discard.