Friday, September 8, 2006

Making a SharePoint Document Library Google friendly

SharePoint comes in two flavors: Windows SharePoint Services and SharePoint Portal Server 2003. Both of them have Document Libraries, and both are unfriendly to Google indexing. In particular if youh are on a corporate intranet and have a Google Search Appliance (GSA) such as Google Mini you will quickly realize that your GSA will not find any of the documents on your SharePoint site. The reason for this is because the Document Library by default uses javascript or ActiveX control to navigate the directory structure of the Document Library. GSA will not follow links embedded in javascript. In other words if you have a page that has something like the following: <a href="javascript:myGotoFunction(">This File</a> One limited fix is to create or edit existing view for every document library you will every create and add the column called "Name (for use in forms)". This will put a column on the page that shows the url to the document as a link without javascript. Google can then index the documents. The problem with this approach is these settings are configurable at runtime by users (typically power users, but still allows end users to break indexing of a site). The more systematic approach is to create a site that lists all the files from all document libraries from all sites including the portal. This will allow documents to be indexed easily. I recommend you configure your GSA to exclude the list page we are creating unless you want people to be able to casually see how many docs exist and what they are. Security is still on the documents as it is defined at the site, not on the page. The title is will be shown here so don't put anything of private nature in title. This article assumes you want to do the systematic approach. There are at least a couple of ways to tackle this problem. The approach we will take because it requires the least amount of SharePoint specific knowledge and doesn't require access to SharePoint installation directly.

Overview of implementation

  • Create ASP.NET web site
  • Connect to Content Database for SharePoint
  • Execute query
  • Display data on page

Requirements

Microsoft Visual Studio 2005 (You could conceptually use Visual Studio 2003, or notepad also)

Implementation Open up Microsoft Visual Studio 2005 and create a new web. Create a connection to your content database. If you don't know which one that is, just look through your SharePoint database until you find one that has a "Docs" table.

The following query can be used to get the data. Add any file extensions that you want to index to the query. There are files that are in the document library that you don't even see, so I recommend explicitly specifying the file extensions you want to include in the indexing.

The table we are interested in is the Docs table. It has meta data and binary file content for all files in all Document Libraries in both SPS and WSS.

SELECT Docs.DirName + '/' + Docs.LeafName AS URL FROM Docs WHERE ((Docs.Type = 0) and Docs.LeafName not like 'template.%') AND ((Docs.LeafName LIKE '%.doc%') OR (Docs.LeafName LIKE '%.ppt%') OR (Docs.LeafName LIKE '%.xls%') OR (Docs.LeafName LIKE '%.pdf%') OR (Docs.LeafName LIKE '%.vsd%') OR (Docs.LeafName LIKE '%.txt%') )

You will probably want to specify this in code so that the following can be prepended to the url

<a href="http://hostnameHere/">http://hostnameHere/</A> Bind results to a GridView or some other control that has paging built in.

At this point you should be able to run you web application and click on links in the GridView to save or open the documents that are in SharePoint. Since there is no javascript involved, GSA should now be able to index the documents on the first page of your GridView. That is right, only your first page. Why you ask? Because if you look at the paging that is output by the GridView it uses javascript to postback and GSA can't follow javascript.

This poses another issue. How do we page our results without javascript. We can do a number of things to solve the problem.

  1. Don't use the pagers and provide our own links to all the pages in the GridView
  2. Write our own pager that doesn't use javascript (really just a simple next link works also)

If you choose option 1 and write your own pager and want each page to be equal in the GSA results to start with, I strongly recommend that the pager also have direct links to all the pages. The reason is that if you only have a next button for example, GSA will see that page 20 is two hops away from what you originally wanted it to index. GSA will still include it in the index, but gives it an extrememely low page rank (basically zero) for pages over about 10 hops away. Each page from 0 to 10 get a smaller page rank, so page 10 to 20 for example have a near zero page rank.

No comments: