Sunday, August 8, 2010

Programming tips - translating docx and xlsx files using google translate

UPDATE: FORGET THIS POST! I wrote an MS Excel macro do to this, which you can add as an "add-in" to your MS Excel Worksheet. You can find instructions on how to do this and how to obtain the .xlam file here.

(This is basically a note to myself reminding me how to do something that I may need to do again)

For my job I (sometimes, many times) have to do programming. Mostly I use STATA, but also use perl and sometimes VB to get tasks done. I often get stuck, and then spend many hours google searching on how to do something. If I'm still stuck, I'll ask one of programmers in the programming group in another unit at work.

Sometimes I want to translate a docx file for an xlsx file using google translate, but preserving the formatting exactly. Right now google doesn't support the xml based docx and xlsx files for upload for translation. I don't understand why, since it is super-easy to do. I spent less than a week writing code in perl to enable this. Here are the steps:

1. change the docx or xlsx file extension to zip
2. unzip the file. This will create a folder with directories and a file.
3. find the document.xml file in the word or excel directory. (you may wish to do the same with the footnotes & endnotes files for word documents)
4. my code pulls the contents of this file for translation, replacing them with placeholders, and putting the contents file in a separate file
5. upload this contents file to google translate (translate.google.com/toolkit)
6. down load the translation
7. I also have code to replace the placeholders file with the translated content, putting the results in a new file
8. replace the document.xml file with the file from the previous step
9. Zip the files back up. Here is where I was having lots of problems.

When I was zipping the files back up, I would zip the containing folder. This is the wrong way to do it - word/excel can't open this. The _rels, docprops, and word directories, as well as the [Content_Types].xml have to be in the root directory, not some containing folder. So then I tried instead selecting the _rels, docprops, and word directories, as well as the [Content_Types].xml file directly and zipping them, and after changing the extension to docx word was able to open it as a word document (word still had to repair the document, but it was able to. If I use Yemuzip to zip the files, then word doesn't have to repair the document to open it.)

The result is a word or excel file that has been translated, while preserving exactly the original formatting. Email shafique.jamal@gmail.com if you want the code. (its in perl)

2 comments:

Alex said...

Exist numerous programs in the I-net, but only one of them could help me, which was found on a soft forum. Above on my view it might be usable in like situation as well - repair docx file.

Shafique Jamal said...

Thanks very much Alex!