Under the hood: TM technology ran locally by CATs, anything new?
Thread poster: Philippe Locquet
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 08:14
English to French
+ ...
Sep 22, 2021

Hi all,
I was chatting with a colleague the other day, and the question of Translation Memory technology improvements came up.
Although TM had been considered earlier, it started to when four commercial TM systems appeared on the market in the early 1990s: The TranslationManager from IBM, the Transit system from Star, the Eurolang Optimizer and the Translator’s Workbench from Trados (according to a paper I dug up).
Since then, improvements have been made to the way a TM is se
... See more
Hi all,
I was chatting with a colleague the other day, and the question of Translation Memory technology improvements came up.
Although TM had been considered earlier, it started to when four commercial TM systems appeared on the market in the early 1990s: The TranslationManager from IBM, the Transit system from Star, the Eurolang Optimizer and the Translator’s Workbench from Trados (according to a paper I dug up).
Since then, improvements have been made to the way a TM is searched, leveraged etc. But have there been significant improvements?
TMX has been widely used for sharing TMs, and it’s quite a good format. But that doesn’t mean that the CAT runs the local TM in that format.
So, I decided to start this thread to see what is the current landscape with the technology regarding what format/approach each tool is using natively, locally on the machine (not import-export).
If you know what’s under the hood of what you’re using, please share 😊
Collapse


 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 08:14
English to French
+ ...
TOPIC STARTER
Wordfast Sep 22, 2021

So here are the native formats that are run locally by each CAT:

_Wordfast Pro 6: solr-Lucene (originally an Apache format. Not a file. It’s a collection of files with an index.)
_Wordfast Pro 3: txt
_Wordfast Classic: txt
_Wordfast Server: txt
_Wordfast Anywhere: txt


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 09:14
Member (2006)
English to Afrikaans
+ ...
TMX Sep 22, 2021

Philippe Locquet wrote:
TMX has been widely used for sharing TMs, and it’s quite a good format.


TMX is a terrible format. Firstly, it's not very extensible. Secondly, it's an extremely wasteful format.

I took a random TMX file off of my computer and did some math:
- file size: approx. 1.5 million characters
- number of TUs: 3250
- meta data (date, time, user ID, language codes, client codes etc.): 170 000 characters
- actual content: 650 000 characters
- unnecessary codes and stuff: 630 000 characters

The same TM in Wordfast Classic's TXT format (with no data loss): 820 000 characters (45% smaller than the TMX file).

For your list:
* Wordfast Classic: Tab-delimited TXT file with header in first line and thereafter one TU per line. Each column has a specific function (e.g. one column is the date/time, another column is the source language code, another is the source language text, etc.)
* Wordfast Pro 3: The format is practically identical to that of Wordfast Classic.
* Wordfast Pro 6: It's a set of folders and subfolders with various files in them that only Wordfast Pro can read, as far as I can tell. I have not been successful in figuring out how to read a WFP6 TM in any other application.
* Wordfast Anywhere: Unknown (it's an online TM, so from a user's perspective, it doesn't really have a TM format -- only various import and export formats)
* Trados 2007: It's a TMW file, with four additional files MTF, MWF, IIX and MDF. They are all binary files and none of them are zip files.
* Trados 2009+: It's an SDLTM file. Some kind of database, possibly "SQLite format 3".
* MemoQ: It is a mystery where MemoQ stores its TM files.
* OmegaT: TMX file only.
* CafeTran: AFAIK it's a TMX file.

[Edited at 2021-09-22 20:45 GMT]


Hans Lenting
 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 08:14
English to French
+ ...
TOPIC STARTER
Yes and no Sep 22, 2021

Thanks! for tracking, it's better if we stick with the topic: what the CAT is running natively on the machine.

Samuel Murray wrote:
* Trados 2007: It's a TMW file, with four additional files MTF, MWF, IIX and MDF. They are all binary files and none of them are zip files.
* Trados 2009+: It's an SDLTM file. Some kind of database, possibly "SQLite format 3".
* MemoQ: It is a mystery where MemoQ stores its TM files.
* OmegaT: TMX file only.
* CafeTran: AFAIK it's a TMX file.

[Edited at 2021-09-22 20:45 GMT]

Thanks for that Hopefully someone will know more about MemoQ

Samuel Murray wrote:
* Wordfast Anywhere: Unknown

As my above post states, it runs on txt. It can be accessed via Wf Pro too, in which case it will be read/written in txt.

Samuel Murray wrote:
* Wordfast Pro 6: I have not been successful in figuring out how to read a WFP6 TM in any other application.

Please refer the above post. The format is solr Lucene. There are tools to read the data, but they are not CATs. It also allows to store context match in the data.

Samuel Murray wrote:
an extremely wasteful format.

There have been different generations of TMX. But in my experience, you loose more when converting to txt (I've seen that with creation and modification date). The issue with txt is that not every tool puts the metadata in the same spots (except for most important data off-course) which involves editing columns or data loss. TMX seems more resilient to this between tools, but it's heavy. Cleaning up TMs for my customers has never been a problem with TMX up to 1GB in size. Then either the tmx editing tool find its limits or the computer is chocked. This can off-course be overcome using other tools or power text editors but that's leaving the realm or what a vast number translators do.

Be well


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
CafeTran Sep 23, 2021

Samuel Murray wrote:

* CafeTran: AFAIK it's a TMX file.

[Edited at 2021-09-22 20:45 GMT]


Correct. It’s a valid TMX file, with every TU in a separate paragraph. No line breaks after the individual items of the TU. Looks messy. Saves a little space and thus RAM.


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Waste Sep 23, 2021

Samuel Murray wrote:

TMX is a terrible format. Firstly, it's not very extensible. Secondly, it's an extremely wasteful format.

...

The same TM in Wordfast Classic's TXT format (with no data loss): 820 000 characters (45% smaller than the TMX file).



The waste only occurs when you want to store additional info in properties. When you limit TUs only to the source and target segment content (no formatting etc. stored), the waste isn't that large. BTW: You can also create separate TMX files per co-worker, subject field, document version etc. (kind of the Transit approach), thus making the TMX properties unnecessary.

But, true, Wordfast Classic's TXT format is beautiful. I've even suggested that CafeTran Espresso would get a way to store TMs in such a compact format (which, I guess, reduces the RAM load significantly). The fact that you can run operations on the source and target column independently, is very handy.

Transit has a different approach: source and target are saved in XML files, a TM is created on the fly and it is binary.


Philippe Locquet
 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
MSJet? Sep 23, 2021

Philippe Locquet wrote:

Thanks for that Hopefully someone will know more about MemoQ



Isn't that, just like with Déjà Vu, an Access database, created with MSJet technology?

If I'm not mistaken, these databases can be opened with Ms Access.


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 09:14
Member (2006)
English to Afrikaans
+ ...
@Philippe Sep 23, 2021

Philippe Locquet wrote:
Samuel Murray wrote:
* Wordfast Anywhere: Unknown

As my above post states, it runs on txt.

How do you know this?


 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 08:14
English to French
+ ...
TOPIC STARTER
Dev Sep 23, 2021

Samuel Murray wrote:

Philippe Locquet wrote:
Samuel Murray wrote:
* Wordfast Anywhere: Unknown

As my above post states, it runs on txt.

How do you know this?


I never reveal my sources JJJJJJJJJJJJ
I'm in touch with the devs, so, as good as it gets in terms of reliability...


 
Giovanni Guarnieri MITI, MIL
Giovanni Guarnieri MITI, MIL  Identity Verified
United Kingdom
Local time: 08:14
Member (2004)
English to Italian
Devs Sep 23, 2021

is that short for "devils"?

Adieu
 
expressisverbis
expressisverbis
Portugal
Local time: 08:14
Member (2015)
English to Portuguese
+ ...
Yes, Sep 23, 2021

Giovanni Guarnieri MITI, MIL wrote:

is that short for "devils"?


the devil developers!

[Edited at 2021-09-23 12:09 GMT]


 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 08:14
English to French
+ ...
TOPIC STARTER
jjj Sep 23, 2021

Giovanni Guarnieri MITI, MIL wrote:

is that short for "devils"?


Looks like there are some nasty programmers out there... XD


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Under the hood: TM technology ran locally by CATs, anything new?







CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »