Tuesday, September 20, 2011

Convert a batch of .doc files to .docx using C# and Word 2007

I recently inherited a bunch of Microsoft Word files that are in the .doc format. I want to convert them all to .docx so I can easily parse them later without needing MS Word installed (i.e. on a server). You can do the conversion with no code at all if you have the time. All you have to do is open the file up in MS Word 2007 and save the file as a .docx; Word will do the work for you. This is great, but I had hundred of files to convert, and I could not bear doing something that many times (I’m a programmer after all). Unbelievably there are products that cost $150 and more to do this. There are so trial editions that do 5 at a time, etc, and even some command line ones. Command line might work, but it still involves me figuring out where in a bunch of nested directories where the .doc files are and coming up with the command line arguments. That isn’t much better than opening Word, though I could script that solution at least.

In the end, I decided it really wasn’t that difficult to just sit down and write the code to do this. The code is very simple. I have put it in one class so that you can easily include it in your own project. It could be a command line or WPF or WinForms. It doesn’t really matter. All the code does is

  1. Take the directory path that you pass it and recursively finds all the .doc files (even if they are in sub-directories of sub-directories)
  2. Open MS Word in the background (You can see winword.exe in your Processes under Task Manager).
  3. Loop through each file found
  4. Open the current file
  5. Tell MS Word 2007 to save the file as .docx
  6. Close the File
  7. Close MS Word when all files have been processed.

You will find all the new files right next to the .doc files. You can then search in Windows for .doc and delete them quickly once you have comfortable everything went smoothly.

Things you will need to use the class below.

  • Visual Studio
  • MS Word installed on the same machine as you run your program you create

When you create your project you will need to add a reference to your project for Microsoft.Office.Interop.Word. Besure you choose the version 12 and not version 11 like I did initially. If you do you will get a compiler error.

Below is the actual code you need.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using Microsoft.Office.Interop.Word;

namespace ConvertDocToDocx
{
    public class DocToDocxConverter
    {

        List<string> AllWordFiles { get; set; }
       
        DirectoryInfo StartingDir { get; set; }

        public DocToDocxConverter(DirectoryInfo startingDir)
        {
            StartingDir = startingDir;
        }

        public void ConvertAll()
        {
            AllWordFiles = new List<string>();
           
            // NOTE: Since .xls is a also in .xlsx this search will find .xls and .xlsx files
            // If the extension is different then this can be called again to include them.
            FindWordFilesRecursively(StartingDir.FullName, "*.doc");

            // only open and close Word once to maximize performance
            Application word = new Application();

            try
            {

                foreach (string filename in AllWordFiles)
                {
                    // exclude the .docx (only include .doc) files as we don't need to convert them. :)
                    if (filename.ToLower().EndsWith(".doc"))
                    {
                        try
                        {
                            var srcFile = new FileInfo(filename);

                            // convert the source file
                            var doc = word.Documents.Open(srcFile.FullName);
                            string newFilename = srcFile.FullName.Replace(".doc", ".docx");

                            // Be sure to include the correct reference to Microsoft.Office.Interop.Word
                            // in the project refences. In this case we need version 12 of Office to get the new formats.
                            doc.SaveAs(FileName: newFilename, FileFormat: WdSaveFormat.wdFormatXMLDocument);
                        }
                        finally
                        {
                            // we want to make sure the document is always closed
                            word.ActiveDocument.Close();
                        }
                    }
                }
            }
            finally
            {
               
                word.Quit();
            }
        }

      

        void FindWordFilesRecursively(string sDir, string filter)
        {

            foreach (string d in Directory.GetDirectories(sDir))
            {
                foreach (string f in Directory.GetFiles(d, filter))
                {
                    AllWordFiles.Add(f);
                }
                FindWordFilesRecursively(d, filter);
            }
        }


      
       
    }
}

1 comment:

electronic signatures said...

This post describes how to convert a batch of .doc files to .docx using C#. I don't find the method easy at first glance. I hope it seems easy when I apply it in my application. I am going to try this as it will reduce our efforts.