Tuesday, January 26, 2010

Using T-SQL to remove unwanted characters

Let’s pretend you have a column (in a table) in MS SQL Server and it has some stuff that you don’t want to be there. For example if you only wanted a column to have alpha or only numeric data you could essentially do a search a replace on that column using  a simplified regular expression syntax that the patindex in T-SQL provides.

Below is an example of the solution.

select top 1 * from Person

while @@rowcount > 0
update Person
set LName = replace(LName, substring(LName, patindex('%[^a-zA-Z. ]%', LName), 1), '')
where patindex('%[^a-zA-Z. ]%', LName) <> 0
The first line is only necessary for the @@rowcount to be set to something greater than 0. If you have an update on that table you need to do, you could do that instead. You could also add some more logic, add variables, etc.

The update statement is called multiple times until no more updates are needed. It is really pretty awesome the way that it keeps affecting less and less rows until no more are affected ad the while loop exits. It feels kind of recursive.

You don’t really have to understand it to make use of this. The important thing to note here is that the first parameter of patindex is the pattern it is matching on. In this case, I am matching (accepting) only lowercase a-z and uppercase A-Z and period and space (notice the space at the end before the ]). You can change this pattern to any valid pattern that patindex accepts.

References:
Docs for patindex
Pattern syntax
I got the idea from here.

2 comments:

Anonymous said...

I'm liking the @@rowcount in the while loop. Must implement this in some of my sql scripts

Nikhil S P said...

thanks... solved a similar problem..