juretta.com

Convert Microsoft Word to Docbook XML using Ruby and OpenOffice

August 10, 2006
Tags: XML

The following script shows how to convert Microsoft Word files to DocBook XML using OpenOffice on Windows. The batch script uses OLE (Object Linking and Embedding) to transform an unlimited number of files.

It is assumed that you have OpenOffice installed. You need the ruby programming language (the script was tested with the most recent version Ruby 1.8.4).

require 'win32ole'

# Path to directory with Word Files.
PATH = "file:///c|/path/to/doc/files/"

# converts a word file to docbook XML. 
# The XML file is named after the original file
# e.g.: ABC.doc -> ABC.xml
def convert_word_to_docbook(file, path)
  serviceManager = WIN32OLE.new("com.sun.star.ServiceManager")
  desktop = serviceManager.createInstance("com.sun.star.frame.Desktop")

  url = path + file
  document = desktop.loadComponentFromURL(url, "_blank", 0, [])
  url_to = path + file.gsub(/\.doc/, ".xml")
  fprops = []
  property = serviceManager.Bridge_GetStruct("com.sun.star.beans.PropertyValue")
  property["Name"] = "FilterName"
  property["Value"] = "DocBook File"  
  fprops << property
  begin
    document.storeToUrl(url_to, fprops) # this line works!
  ensure
    document.close true
  end
end

# convert all ".doc" files to DocBook XML
Dir.glob("*.doc").each do |file|
  print "converting #{file}...\n"
  $stdout.flush
  convert_word_to_docbook file, PATH
end

Original script by Julian Elve: http://www.synesthesia.co.uk/blog/.../openoffice-and-ruby/.


About

This is the defunct blog of Stefan Saasen.