Friday, September 23, 2011

A look at the topic of Artificial Intelligence (AI)

General AI

Artificial Intelligence Wiki – a good place to start to understand what AI encompasses and its different areas.

Journal of Machine Learning Research – Amazingly complex scientific papers of AI algorithms. Not for light reading.

Machine Learning Open Source Software

 

Areas of interest for me:

Reinforcement Learning – concerned with how an agent ought to take actions in an environment so as to maximize some notion of cumulative reward.

Data Mining

Data Mining – good starting point for learning about data mining

Text Mining

Basically Text Mining is data mining on unstructured data like documents, web pages, etc. instead of a database.

Natural Language Processing – Imagine a machine being able to scour the internet and actually extract knowledge about what it read.

Text Mining – an area of Natural Language Processing, has many commercial uses even today. i.e. Bing

Semantic Web – a machine readable version of the web.

DBpedia – An effort to translate Wikipedia into machine understandable format to facilitate complex queries and meanings of words, not just matches.

Freebase – Similar to DBpedia, but it is hand crafted. DBpedia has lots of links to it as well. A database of “bar codes” for all entities on the web. Aliasing… Also powers Bing

What is Text Mining – describes how text mining is more complex than data mining since it involves natural language processing.

Carrot2 – text and search results clustering framework. Very cool way to browse search results and get to what you are looking for

Wednesday, September 21, 2011

Reading a MS Word 2007 Document in .docx format using C#

If you want need to read (or write) from a MS Word 2007 Document that has been saved in the Open XML format (.docx) then you can use the Open XML SDK 2.0 for Microsoft Office to do just. The first thing you will need to do is download and install the SDK. In particular, you must download and install the OpenXMLSDKv2.msi. In addition, you can download the OpenXMLSDKTool.msi if you want. It has some VERY nice features like generating code from an existing .docx file.

Now that you have the files you need, open Visual Studio (2008 or 2010 works fine), open the project you want to use, and add a reference to the DocumentFormat.OpenXml (I had to browse to it in the adding references windows by going to C:\Program Files (x86)\Open XML SDK\V2.0\lib) and WindowsBase (mine was located in the list of .NET tab when adding references). Please note, this code does not require MS Word be installed and is safe to run on the server such as with ASP.NET.

Now that you have the api, the rest is just working with the document. To get a better understand on how to work with the parts (structure) of the Word Document, click here.  For a list of “How do I…” code samples, click here.

Here is example code on how to get the body of the document.

using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;

using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(Filename, false))
{
        var body = wordDocument.MainDocumentPart.Document.Body;
}

 

Here is an example of a more complex line of code that can be used to navigate the structure of the document using LINQ. In this case the document has a table in it and we are getting the first row and first cell of that row and the second element.

wordDocument.MainDocumentPart.Document.Body.ChildElements.First<Table>().ChildElements.First<TableRow>().ChildElements.First<TableCell>().ElementAt(2).InnerText

I hope this gives you an idea of how to get started. There are lots of good links, examples, etc here.

Tuesday, September 20, 2011

Convert a batch of .doc files to .docx using C# and Word 2007

I recently inherited a bunch of Microsoft Word files that are in the .doc format. I want to convert them all to .docx so I can easily parse them later without needing MS Word installed (i.e. on a server). You can do the conversion with no code at all if you have the time. All you have to do is open the file up in MS Word 2007 and save the file as a .docx; Word will do the work for you. This is great, but I had hundred of files to convert, and I could not bear doing something that many times (I’m a programmer after all). Unbelievably there are products that cost $150 and more to do this. There are so trial editions that do 5 at a time, etc, and even some command line ones. Command line might work, but it still involves me figuring out where in a bunch of nested directories where the .doc files are and coming up with the command line arguments. That isn’t much better than opening Word, though I could script that solution at least.

In the end, I decided it really wasn’t that difficult to just sit down and write the code to do this. The code is very simple. I have put it in one class so that you can easily include it in your own project. It could be a command line or WPF or WinForms. It doesn’t really matter. All the code does is

  1. Take the directory path that you pass it and recursively finds all the .doc files (even if they are in sub-directories of sub-directories)
  2. Open MS Word in the background (You can see winword.exe in your Processes under Task Manager).
  3. Loop through each file found
  4. Open the current file
  5. Tell MS Word 2007 to save the file as .docx
  6. Close the File
  7. Close MS Word when all files have been processed.

You will find all the new files right next to the .doc files. You can then search in Windows for .doc and delete them quickly once you have comfortable everything went smoothly.

Things you will need to use the class below.

  • Visual Studio
  • MS Word installed on the same machine as you run your program you create

When you create your project you will need to add a reference to your project for Microsoft.Office.Interop.Word. Besure you choose the version 12 and not version 11 like I did initially. If you do you will get a compiler error.

Below is the actual code you need.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using Microsoft.Office.Interop.Word;

namespace ConvertDocToDocx
{
    public class DocToDocxConverter
    {

        List<string> AllWordFiles { get; set; }
       
        DirectoryInfo StartingDir { get; set; }

        public DocToDocxConverter(DirectoryInfo startingDir)
        {
            StartingDir = startingDir;
        }

        public void ConvertAll()
        {
            AllWordFiles = new List<string>();
           
            // NOTE: Since .xls is a also in .xlsx this search will find .xls and .xlsx files
            // If the extension is different then this can be called again to include them.
            FindWordFilesRecursively(StartingDir.FullName, "*.doc");

            // only open and close Word once to maximize performance
            Application word = new Application();

            try
            {

                foreach (string filename in AllWordFiles)
                {
                    // exclude the .docx (only include .doc) files as we don't need to convert them. :)
                    if (filename.ToLower().EndsWith(".doc"))
                    {
                        try
                        {
                            var srcFile = new FileInfo(filename);

                            // convert the source file
                            var doc = word.Documents.Open(srcFile.FullName);
                            string newFilename = srcFile.FullName.Replace(".doc", ".docx");

                            // Be sure to include the correct reference to Microsoft.Office.Interop.Word
                            // in the project refences. In this case we need version 12 of Office to get the new formats.
                            doc.SaveAs(FileName: newFilename, FileFormat: WdSaveFormat.wdFormatXMLDocument);
                        }
                        finally
                        {
                            // we want to make sure the document is always closed
                            word.ActiveDocument.Close();
                        }
                    }
                }
            }
            finally
            {
               
                word.Quit();
            }
        }

      

        void FindWordFilesRecursively(string sDir, string filter)
        {

            foreach (string d in Directory.GetDirectories(sDir))
            {
                foreach (string f in Directory.GetFiles(d, filter))
                {
                    AllWordFiles.Add(f);
                }
                FindWordFilesRecursively(d, filter);
            }
        }


      
       
    }
}

Wednesday, September 14, 2011

Email Aliases in Gmail

Gmail doesn't offer traditional aliases, but you can receive messages sent to your.username+any.alias@gmail.com. For example, messages sent to jane.doe+notes@gmail.com are delivered to jane.doe@gmail.com.

You can set up filters to automatically direct these messages to Trash, apply a label or star, skip the inbox, or forward to another email account.

This is great for testing user registration in your app.