Using regular expressions in SPSS

SPSS has a native set of string manipulations that will suffice for many simple situations. But with the ability to call Python routines, one can use regular expressions (or regex is often used for short) to accomplish more complicated searches. A post on Nabble recently discussed extracting zip codes from address data as an example, and I figured it would be a good general example for the blog.

So first, the SPSSINC TRANS command basically allows you to use Python functions the same as SPSS functions to manipulate a set of data. So for example if there is a function in Python and it takes one parameter, say f(a), and returns one value, you can use SPSSINC TRANS to have SPSS return a new variable based on this Python function. So say we have a variable named X1 and we want to get the result of f(X1) for every case in our dataset, as long as the function f() is available in the Python environment the following code would accomplish this:

SPSSINC TRANS RESULT=f_X TYPE=0 /FORMULA f(X1).

This will create a new variable in the active SPSS dataset, f_X, that is the result based on passing the values of X1 to the function. The TYPE parameter specifies the format of the resulting variable; 0 signifies that the result of the function is a number and if it is a string you simply pass the size of the string.

So using SPSSINC TRANS, we can create a function that does a regular expression search and returns the matching substring. The case on Nabble is a perfect example, extracting a set of 5 consecutive digits in a string, that is difficult to do with native SPSS string manipulation functions. It is really easy to do with regex’s though.

So the workflow is quite simple, import the re library, define your regular expression search, and then make your function. For SPSS if you return more than one value for a function in expects it is a tuple. If you look at the Nabble thread it discusses more complicated regex’s but here I keep it simple, \d{5}. This is interpreted as search for \d, which is shorthand for all digits 0-9, and then {n} is shorthand for search for the preceding string n times in a row.

BEGIN PROGRAM Python.
import re
s = re.compile(r'\d{5}')
def SearchZip(MyStr):
  Zip = re.search(s,MyStr)
  if Zip:
    return [Zip.group()]
  else:
    return [""]
END PROGRAM. 

Lets make up some test data to test our function within Python to make sure it works correctly.

BEGIN PROGRAM Python. 
#Lets try a couple of examples. 
test = ["5678 maple lane townname, md 20111", 
        "123 oak st #4 someplace, RI 02913-1234", 
        "9011 cedar place villagename"] 

for i in test: 
  print i 
  print SearchZip(i) 
END PROGRAM. 

And this outputs in the SPSS window:

5678 maple lane townname, md 20111 
['20111'] 
123 oak st #4 someplace, RI 02913-1234 
['02913'] 
9011 cedar place villagename 
['']

So at this point it is as simple as below (assuming Address is the string field in the SPSS dataset we are searching):

SPSSINC TRANS RESULT=Zip TYPE=5 /FORMULA SearchZip(Address).

This just scratches the surface of what regex’s can do. Based on some of the back and forth on the recent Nabble discussion this is a pretty general solution that searches an address at the end of the string, optionally finds a dash or a space, and then searches for 4 digits, re.compile(r"(\d{5})(?:[\s-])?(\d{4})?\s*$"). Note because the middle grouping is optional this would match 9 digits in a row (which I think is ok in my experience cleaning string address fields, especially since the search is limited to the end of the string).

Here is the full function for use. Note if you get errors about the None type conversion update your version of SPSSINC TRANS, see this Nabble thread for details.

BEGIN PROGRAM Python.
import re
SearchZ = re.compile(r"(\d{5})(?:[\s-])?(\d{4})?\s*$") #5 digits in a row @ end of string
                                                       #and optionally space or dash plus 4 digits
def SearchZip(MyStr):
  Zip = re.search(SearchZ,MyStr)
  #these return None if there is no match, so just replacing with
  #a tuple of two None's if no match
  if Zip:
    return Zip.groups()
  else:
    return (None,None)

#Lets try a couple of examples. 
test = ["5678 maple lane townname, md 20111", 
        "5678 maple lane townname, md 20111 \t",
        "123 oak st #4 someplace, RI 02913-1234", 
        "9011 cedar place villagename",
        "123 oak st #4 someplace, RI 029131234",
        "123 oak st #4 someplace, RI 02913 1234"] 

for i in test: 
  print [i] 
  print SearchZip(i) 
END PROGRAM. 

Because this returns two separate groups, the SPSSINC TRANS command will need to specify multiple variables, so would be something like:

SPSSINC TRANS RESULT=Zip5 Zip4 TYPE=5 4 /FORMULA SearchZip(Address).