Remove tag from source

Hi,

sorry, this sounds like a silly question, but I am trying to remove a tag (HTML entity) from my source, and replace pricing info with XXX:

Screenshot of Trados Studio showing a segment of text with a highlighted HTML entity £ and pricing information 3,010.

Source file is XML with embedded HTML.

I just updated my Cleanup Task, the old one had the "remove tag" option, recognized £ and would have removed it, but it threw an error, so I updated and the new version does not have this anymore.

Ideally, instead of [POUND]3,010 the source would say GBP X,XXX.

Tried SDLXLIFFToolkit, but that can find text in tags, but not replace them.

I should have dealt with this in the XML source, but alas ...

So my issue is that I cant find a way to handle (remove) a tag in the SDLXLIFF source.

Does anyone have an idea how to tackle this?

Daniel



Generated Image Alt-Text
[edited by: Trados AI at 8:24 PM (GMT 0) on 28 Feb 2024]
emoji
Parents Reply Children
  • Ummm, this is an HTML entity, not XML entity

    Well, yes, that's what I said in my initial post:

    I am trying to remove a tag (HTML entity) from my source
    you are actually parsing the HTML by Embedded Content Processor... right?

    Right.

    So simply turn on the entity conversion in the ECP and you should be done - the HTML entities will be converted to actual character during parsing-in the source... so there won't be any such entity-tag garbage.

    Voilá! It worketh! Some entity conversion was on (default settings I guess), some was off. Is there any reason NOT to have everything converted by default?

    Daniel

  • IMO, in normal cases, there is hardly a reason to not convert the entities to real characters.

    In my career I actually NEVER met a case where the characters would be entitizied deliberately in the source... the entities were always a result of some clumsy process, incorrectly configured tools, plain inexperience, etc. on the client's side.

    Still, there MAY be cases where the system creating the sources (and consuming the targets) is some old-school one, unable to work with UTF-8 encoded files (remember, UTF-8 is just natural and expected by default in XML, but not mandatory) and requiring ASCII encoded ones...

    And that's where NOT converting the entities used to get handy... to be able to have entitized characters in the target as well (I know, you can convert them when loading into Studio and backconvert them when saving target... but I think that was not possible, or didn't work like that, or something back in that time)

    I may be wrong here, but I think(!) this is the story behind all this entities conversion...
    It actually dates back to Trados times when Unicode and UTF-8 used to appear just in engineer's wet dreams and the reality was dominated by individual charsets, ISO codepages competing with DOS ones, Windows ones, Mac ones, EBCDIC ones... huh...

  • In our case, a CMS dutifully encodes non-ASCII characters in HTML fields, so I get plenty of that in my source files. But if I understand you correctly, I should never have to encode non-ASCII characters in my target files, which would solve a couple of issues for me, so I would like to do that.

    But: How can I convert entities to real characters when parsing but NOT convert back to entities when writing the target file? As far as I can see, the "Entities" settings govern both. I vaguely remember reducing the number of entitiy conversions I permitted because the CMS would not read all of them correctly.

    At the moment, I avoid encoding as entities for some XML elements by not passing them on the embedded content processor. The CMS treats some fields as plain text, so entities would be displayed as they are, such as "Ärgernis".

    I usually parse by field name, but in some cases e.g. the "Title" field will expect (and contain) HTML in some templates but plain text in other templates. So I guess I will have to parse all fields first which I want to pass on the the embedded content processor, then use some catch-all to parse the content for signs of HTML, then parse for plain text fields...

    Daniel