How do I set up the Document structure for an .md file?

I actually posted on another thread, but I figured maybe I'd get more help if I started my own post. I'm trying to set up a new file type for an .md file. I'm a veteran Trados user, but new to file types and regex, so forgive my extremely basic question, but how do I configure the Document structure so that Trados knows what is translatable text? I've read Paul's post about the inline tags and I think I might be able to figure those out (I'm sure I'll be back if I can't), but I can't even get Trados to display any text at all if I attempt to process the file. 

Parents
  • Hi Beatriz,

    We need to see a sample of the file to answer that one.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hi Beatriz,

    Just to follow up on my short response from my phone. The reason we need to see a sample is because of the following:

    1. Whether something is translatable or not is not related to the document structure (unless this is embedded content?)
    2. How you handle this could depend on whether it's a text file format, xml file, html file etc.

    So if you can provide a small sample it would be very helpful. You can also email me the file if you like?

    Regards

    Paul
    pfilkin@sdl.com

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hi Beatriz,

    I have no idea how that would work? Studio segments the lines on the paragraph marks in the file so each line of code is in a separate segment. The regex rule then only applies within each segment.

    I think what you might be able to do is create structure rules instead of the inline tags and this could prevent any of the code between these characters from being parsed at all.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Hi Paul,

    Yeah, I think you're right about that. I removed the rule and it didn't change anything. I probably marked it as non translatable with a different rule. I notice it didn't work with another file that had the same structure.

    How would I go about creating structure rules? According to the instructions on Trados, the structure rules are to mark the text that is translatable. I am guessing I would have to create a rule indicating the text is transalable from start (^) to ``` and then from ``` to end ($)? I'm sure it's not that easy! :) There's probably a more complicated regex for that.

    Beatriz
  • Hi Beatriz,

    You've got it. That's exactly how you'd have to tackle it. I was actually just sitting here wishing we had a mechanism for marking up the structure you did not want translated which feels easier!!

    It's going to take a couple of rules I think as opposed to just one, so I'll have a play and if I can't do it will ask the filetype developer tomorrow if it's even possible. In which case the first approach I gave you with a library of rules you continually build on is going to be the best approach.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Thank you so much! If it's possible, it would be the easiest solution here. I think that's really the biggest thing that needs to be hidden from these files. Everything else is just a matter of tags and placeholders that follow pretty straightforward regex rules.

    Btw, do you think SDL is eventually going to incorporate this file type into its default selections? I spoke to a programmer involved in this project yesterday and he indicated that *.md files are becoming more popular in web development and that we are likely to see more of these in the future.
  • Hi Paul,

    Just wanted to follow up and see if you managed to get any info regarding whether it's possible to rearrange the document structure so that I can omit the ``` code tags and everything in between from my *.md documents? I'd like to know either way, whether it can be done or if it's impossible, because if there's no way around it, I may have to go back and talk to the programmers who designed the original files to see if I can work with them in any way to make this easier to feed into Trados. Thanks again for all your help so far!

    Beatriz
  • Hi Beatriz,

    Sorry for the late reply. I have not had a response from development on this yet which tells me it's not so easy! It may be that the best approach is to use a developer to create a custom filetype specifically for your needs. If you expect to have a lot of these files I think that's worth investigating and maybe not too hard using the API.

    Alternatively, your idea to see if the files themselves can be simplified as you are generating them is probably a very good one!

    Either way, if I get a response I will share it in here.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Paul,

    Thanks so much for following up. I've spent the last few days playing around with different regex for the document structure, with help from some colleagues who are better versed in regex than I am, but this seems to go beyond our scope. Because the text between the code breaks does not always follow the same pattern, and in fact sometimes follows a pattern that looks just like translatable text, it's extremely challenging to write regex rules that will gather them all. I've been able to process the document in a way that will filter out the first instance of the ``` code breaks, but I can't figure out how to get Trados to detect them as a repeating pattern. And I also noticed that when I use the "Multiline" feature in order to get it to pick up line breaks, it messes with the entire document's segmentation, something we don't want. I do appreciate your follow up and indeed, if you do get a response (whether positive or negative), please let me know.

    Regarding your information about using a developer to create a custom file type... do you mean a developer from SDL? If so, how do we make such a request? I do suspect we will see more of these file types in the future, as Markdown documents are becoming more popular in web development, according to some conversations I've had with a few different programmers.

    Thanks,
    Beatriz
  • Unknown said:
    Regarding your information about using a developer to create a custom file type... do you mean a developer from SDL? If so, how do we make such a request? I do suspect we will see more of these file types in the future, as Markdown documents are becoming more popular in web development, according to some conversations I've had with a few different programmers.

    Hi Beatriz,

    I meant your developer, not one of ours.  I could also help you find a developer to work on this if you like, one who has experience with creating filetypes?  Alternatively you could use SDL via our Professional Services as they could also help you with this.  So three options I guess.

    I think Markdown documents in themselves are ambiguous... in fact I believe there are at least nine different "flavours" already and probably a host of user defined variants.  So creating a "standard" would be tricky.  I think your guys seem to be following the original syntax rules  but even here handling the code blocks with hard returns on every line is tricky and this is where a developer could do a better job.

    Regards

    Paul

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Paul,

    Got it. Makes sense, thanks! I will go back to our developers with this information and see if they could help, then.

    We had come up with one single regex that actually captures the text we want and does not capture what we don't want: (?s)((.+?)(?:```.*?```)+?)+

    Unfortunately, it doesn't work because we need opening and closing patterns, and then Trados marks as translatable everything in between those patterns. I managed to write a pair that takes the beginning and takes out everything in between codeblocks and marks as translatable everything in between, which is perfect... except that it also leaves out the rest of the document that is translatable. All in all, I keep thinking there is a way to write several different rules that might work, but it's beyond my grasp. If we end up stumbling into it, I'll post it here for future reference.

    But I suspect I may have to follow your advice and talk to the developers instead to find a way around it.

    I do appreciate all your help! This has been quite the learning experience. Have a great day!

    Beatriz
  • Hi again, Paul!

    So, I actually found a partial solution to this issue. In the Document structure, I wrote the following regex:

    Opening pattern: ^
    Closing pattern: (?s)(```.*?```)

    This works to filter out the codebreaks, but only as long as the document ENDS in a code break. If this is not the case, then it stops after the last codebreak and does not start again where it left off. Does this make sense? I think I might be able to make this rule work if I can alter the closing pattern regex to take this into account OR if it's possible to write another rule that will account for this. So far, I have not had any luck, but I will continue to investigate.

    If this is successful, there is another issue that comes up. It's relatively minor, but I was wondering if you knew a way around it. Because I have to set these rules as Multiline in order to account for line breaks, it seems to mess up my usual segmentation rules of cutting the segments by line breaks. I can have the line breaks show up as tags, which is one solution, but I'd much rather segment them instead. Is there any way to do this that you know of?

    I attempted to add an Inline tag rule that catches line breaks (\n) and then set the Advanced rules to mark them as "Is Word Stop", but the pre-translation failed, which means the rule didn't work. I also tried \r\n (for CR LF, which is how the breaks appear in Notepad++) with the same result. Do you know if this is possible to do, or if there is simply no hope of segmenting by line break if I use Multiline?

    Thanks again.
    Beatriz
Reply
  • Hi again, Paul!

    So, I actually found a partial solution to this issue. In the Document structure, I wrote the following regex:

    Opening pattern: ^
    Closing pattern: (?s)(```.*?```)

    This works to filter out the codebreaks, but only as long as the document ENDS in a code break. If this is not the case, then it stops after the last codebreak and does not start again where it left off. Does this make sense? I think I might be able to make this rule work if I can alter the closing pattern regex to take this into account OR if it's possible to write another rule that will account for this. So far, I have not had any luck, but I will continue to investigate.

    If this is successful, there is another issue that comes up. It's relatively minor, but I was wondering if you knew a way around it. Because I have to set these rules as Multiline in order to account for line breaks, it seems to mess up my usual segmentation rules of cutting the segments by line breaks. I can have the line breaks show up as tags, which is one solution, but I'd much rather segment them instead. Is there any way to do this that you know of?

    I attempted to add an Inline tag rule that catches line breaks (\n) and then set the Advanced rules to mark them as "Is Word Stop", but the pre-translation failed, which means the rule didn't work. I also tried \r\n (for CR LF, which is how the breaks appear in Notepad++) with the same result. Do you know if this is possible to do, or if there is simply no hope of segmenting by line break if I use Multiline?

    Thanks again.
    Beatriz
Children
No Data