Quick SPSS Tip: Cleaning up irregular characters in strings

This is just intended to be a quick tip on cleaning up string fields in SPSS. Frequently if I am parsing a field or matching string records (such as names or addresses) I don’t want extra ascii characters besides names and/or numbers in the field. For example, if I have a name I might want to eliminate hyphens or quotes, or if I have a field that is meant to be a house number I typically do not want any alpha character in the end (geocoding databases will rarely be able to tell the difference between Apt’s 1A and 1B).

We can use a simple loop and the PIB variable format in SPSS to clean out unwanted ascii codes in string characters. So for instance if I wanted to replace all the numbers with nothing in a string field I could use this code below (where OrigField is the original field with the numbers contained, and CleanField is the subsequent cleaned variable).

string CleanField (A5).
compute CleanField = OrigField.
loop #i = 48 to 57.
compute CleanField = REPLACE(CleanField,STRING(#i,PIB),"").
end loop.

The DEC column in the linked ascii table corresponds to the ascii character code in SPSS’s PIB format. The numbers 0 through 9 end up being 48 to 57 in decimal values, so I create a string corresponding to those characters via the string(#i,PIB) commmand and replace them with nothing in the REPLACE command. I loop through values of 48 to 57 to get rid of all numeric values.

This extends to potentially all characters, for instance if I want to return only capital alpha characters, I could use a loop with an if statement like below;

string CleanField (A5).
compute CleanField = OrigField.
loop #i = 1 to 255.
if #i < 65 or #i > 90 CleanField = REPLACE(CleanField,STRING(#i,PIB),"").
end loop.

There are (a lot) more than 255 ascii characters, but that should suffice to clean up most string fields in English.

Advertisements
Leave a comment

2 Comments

  1. Jon Peck

     /  July 9, 2013

    Be careful, since there are lots of Unicode characters that you might want to keep that are outside the 0-255 range and also should not be inspected byte by byte. For example, in Unicode mode, the British pound symbol has hex value C2A9, taking two bytes, and the Euro symbol is E282A0 – three bytes. C2 in the usual English Windows code page 1252 is capital A with hat, but in Unicode mode it is the first half of the pound character.

    A useful way to see what the codes really are for character data is to set the variable format to AHEX. The size field needs to be double the A width, so A10 would become AHEX20. This has to be done in syntax (FORMAT s(AHEX20)), since AHEX is not supported in the Data Editor type list.

    Note also that the replace function shown above can actually change the character locations in Unicode mode, although since strings are fixed width, this would result in blank padding or truncation.

    Reply
    • Yep agree Jon – if you want search/clean such characters not byte by byte you should prob. have a strong preference for using Python and regex’s unless you only have a small list. (On my to-do list to write a blog post illustrating that)

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: