Convert Microsoft Word to Docbook XML using Ruby and OpenOffice

August 10, 2006
Tags: XML

The following script shows how to convert Microsoft Word files to DocBook XML using OpenOffice on Windows. The batch script uses OLE (Object Linking and Embedding) to transform an unlimited number of files.

It is assumed that you have OpenOffice installed. You need the ruby programming language (the script was tested with the most recent version Ruby 1.8.4).

require 'win32ole'

# Path to directory with Word Files.
PATH = "file:///c|/path/to/doc/files/"

# converts a word file to docbook XML. 
# The XML file is named after the original file
# e.g.: ABC.doc -> ABC.xml
def convert_word_to_docbook(file, path)
  serviceManager = WIN32OLE.new("com.sun.star.ServiceManager")
  desktop = serviceManager.createInstance("com.sun.star.frame.Desktop")

  url = path + file
  document = desktop.loadComponentFromURL(url, "_blank", 0, [])
  url_to = path + file.gsub(/\.doc/, ".xml")
  fprops = []
  property = serviceManager.Bridge_GetStruct("com.sun.star.beans.PropertyValue")
  property["Name"] = "FilterName"
  property["Value"] = "DocBook File"  
  fprops << property
    document.storeToUrl(url_to, fprops) # this line works!
    document.close true

# convert all ".doc" files to DocBook XML
Dir.glob("*.doc").each do |file|
  print "converting #{file}...\n"
  convert_word_to_docbook file, PATH

Original script by Julian Elve: http://www.synesthesia.co.uk/blog/.../openoffice-and-ruby/.


This is the defunct blog of Stefan Saasen.