Quick SPSS Tip: Cleaning up irregular characters in strings

This is just intended to be a quick tip on cleaning up string fields in SPSS. Frequently if I am parsing a field or matching string records (such as names or addresses) I don’t want extra ascii characters besides names and/or numbers in the field. For example, if I have a name I might want to eliminate hyphens or quotes, or if I have a field that is meant to be a house number I typically do not want any alpha character in the end (geocoding databases will rarely be able to tell the difference between Apt’s 1A and 1B).

We can use a simple loop and the PIB variable format in SPSS to clean out unwanted ascii codes in string characters. So for instance if I wanted to replace all the numbers with nothing in a string field I could use this code below (where OrigField is the original field with the numbers contained, and CleanField is the subsequent cleaned variable).

string CleanField (A5).
compute CleanField = OrigField.
loop #i = 48 to 57.
compute CleanField = REPLACE(CleanField,STRING(#i,PIB),"").
end loop.

The DEC column in the linked ascii table corresponds to the ascii character code in SPSS’s PIB format. The numbers 0 through 9 end up being 48 to 57 in decimal values, so I create a string corresponding to those characters via the string(#i,PIB) commmand and replace them with nothing in the REPLACE command. I loop through values of 48 to 57 to get rid of all numeric values.

This extends to potentially all characters, for instance if I want to return only capital alpha characters, I could use a loop with an if statement like below;

string CleanField (A5).
compute CleanField = OrigField.
loop #i = 1 to 255.
if #i < 65 or #i > 90 CleanField = REPLACE(CleanField,STRING(#i,PIB),"").
end loop.

There are (a lot) more than 255 ascii characters, but that should suffice to clean up most string fields in English.