Remove tag from source

Hi,

sorry, this sounds like a silly question, but I am trying to remove a tag (HTML entity) from my source, and replace pricing info with XXX:

Screenshot of Trados Studio showing a segment of text with a highlighted HTML entity £ and pricing information 3,010.

Source file is XML with embedded HTML.

I just updated my Cleanup Task, the old one had the "remove tag" option, recognized £ and would have removed it, but it threw an error, so I updated and the new version does not have this anymore.

Ideally, instead of [POUND]3,010 the source would say GBP X,XXX.

Tried SDLXLIFFToolkit, but that can find text in tags, but not replace them.

I should have dealt with this in the XML source, but alas ...

So my issue is that I cant find a way to handle (remove) a tag in the SDLXLIFF source.

Does anyone have an idea how to tackle this?

Daniel



Generated Image Alt-Text
[edited by: Trados AI at 8:24 PM (GMT 0) on 28 Feb 2024]
emoji
  • Maybe in the sdlxliff itself?

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • That is what I wanted to avoid - modifying the sdlxliff directly, e.g. in Npp. I was hoping for a CleanupTask - style process that can be run when needed.

    If this is not possible in the sdlxliff, I think I will make this part of my XML pre-processing rules. This time I updated the source file with one where I had removed the strings in question.

    Is there a reason why the new version of Cleanup Task does not offer removal of placeholder tags anymore?

    Daniel

  • Is there a reason why the new version of Cleanup Task does not offer removal of placeholder tags anymore?

    I'd need to look back at the behaviour of the old one to answer this question.  Assuming you are correct then no, there is probably no particular reason.  We have had to change quite a bit of the way this plugin functioned to be able to fix some of the bugs reported and in reality we should probably rewrite it altogether because maintenance is quite difficult.  So maybe we broke something... I don't know.  But will need to check and see... would also need to see your source file and ensure that this placeholder tag is one that would have been allowed as even the old version didn't allow the removal of all types of tags.

    Paul Filkin | RWS Group

    ________________________
    Design your own training!

    You've done the courses and still need to go a little further, or still not clear? 
    Tell us what you need in our Community Solutions Hub

  • Ummm, this is an HTML entity, not XML entity... which means your XML has actually HTML embedded inside... which means you are actually parsing the HTML by Embedded Content Processor... right?

    So simply turn on the entity conversion in the ECP and you should be done - the HTML entities will be converted to actual character during parsing-in the source... so there won't be any such entity-tag garbage.

    Am I missing something?

  • Ummm, this is an HTML entity, not XML entity

    Well, yes, that's what I said in my initial post:

    I am trying to remove a tag (HTML entity) from my source
    you are actually parsing the HTML by Embedded Content Processor... right?

    Right.

    So simply turn on the entity conversion in the ECP and you should be done - the HTML entities will be converted to actual character during parsing-in the source... so there won't be any such entity-tag garbage.

    Voilá! It worketh! Some entity conversion was on (default settings I guess), some was off. Is there any reason NOT to have everything converted by default?

    Daniel

  • This instance is solved, I modified the XML source, and  pointed out rightly that these entities can be converted and will then be displayed as normal text that can be cleaned up without problem.

    I sent you my files yesterday already, just in case you want to look into this. Cleanup Task is an extremly useful app, well worth keeping alive IMHO. It could almost become part of the core functionality of Studio.

    Daniel

  • IMO, in normal cases, there is hardly a reason to not convert the entities to real characters.

    In my career I actually NEVER met a case where the characters would be entitizied deliberately in the source... the entities were always a result of some clumsy process, incorrectly configured tools, plain inexperience, etc. on the client's side.

    Still, there MAY be cases where the system creating the sources (and consuming the targets) is some old-school one, unable to work with UTF-8 encoded files (remember, UTF-8 is just natural and expected by default in XML, but not mandatory) and requiring ASCII encoded ones...

    And that's where NOT converting the entities used to get handy... to be able to have entitized characters in the target as well (I know, you can convert them when loading into Studio and backconvert them when saving target... but I think that was not possible, or didn't work like that, or something back in that time)

    I may be wrong here, but I think(!) this is the story behind all this entities conversion...
    It actually dates back to Trados times when Unicode and UTF-8 used to appear just in engineer's wet dreams and the reality was dominated by individual charsets, ISO codepages competing with DOS ones, Windows ones, Mac ones, EBCDIC ones... huh...

  • In our case, a CMS dutifully encodes non-ASCII characters in HTML fields, so I get plenty of that in my source files. But if I understand you correctly, I should never have to encode non-ASCII characters in my target files, which would solve a couple of issues for me, so I would like to do that.

    But: How can I convert entities to real characters when parsing but NOT convert back to entities when writing the target file? As far as I can see, the "Entities" settings govern both. I vaguely remember reducing the number of entitiy conversions I permitted because the CMS would not read all of them correctly.

    At the moment, I avoid encoding as entities for some XML elements by not passing them on the embedded content processor. The CMS treats some fields as plain text, so entities would be displayed as they are, such as "Ärgernis".

    I usually parse by field name, but in some cases e.g. the "Title" field will expect (and contain) HTML in some templates but plain text in other templates. So I guess I will have to parse all fields first which I want to pass on the the embedded content processor, then use some catch-all to parse the content for signs of HTML, then parse for plain text fields...

    Daniel