Turning SPSS data into Python data

Previously I blogged about how to take Python data and turn it back into SPSS data. Here we are going to do the opposite — turn SPSS data into Python objects. First to start out we will make a simple dataset of three variables.

DATA LIST Free /X Y (2F1.0) Z (A1). 
BEGIN DATA
1 2 A
4 5 B
7 8 C
END DATA.
DATASET NAME Test.
EXECUTE.

To import this data into Python, we need to import the spss class of functions, which then you can read cases from the active dataset using the Cursor attribute. Here is an example of grabbing all of the cases.

*Importing all of the data.
BEGIN PROGRAM Python.
import spss
dataCursor = spss.Cursor()
AllData = dataCursor.fetchall()
dataCursor.close()
print AllData
END PROGRAM.

What this then prints out is ((1.0, 2.0, 'A'), (4.0, 5.0, 'B'), (7.0, 8.0, 'C')), a set of nested tuples. You can also just grab one case by replacing dataCursor.fetchall() with dataCursor.fetchone(), in which case it will just return one tuple.

To only grab particular variables from the list, you can pass a set of indices in the spss.Cursor object. Remember, Python indices start at zero, so if you want the first and second variables in the dataset, you need to grab the 0 and 1 indices.

*Only grabbing certain variables.
BEGIN PROGRAM Python.
dataNum = spss.Cursor([0,1])
spNumbers = dataNum.fetchall()
dataNum.close()
print spNumbers
END PROGRAM.

This subsequently prints out ((1.0, 2.0), (4.0, 5.0), (7.0, 8.0)). When grabbing one variable, you may want just a list of the objects instead of the nested tuples. Here I use list comprehension to turn the resulting tuples for the Z variable into a nice list.

*Converting to a nice list.
BEGIN PROGRAM Python.
dataAlp = spss.Cursor([2])
spAlp = dataAlp.fetchall()
dataAlp.close()
spAlp_list = [i[0] for i in spAlp] #convert to nice list
print spAlp
print spAlp_list
END PROGRAM.

The first print object is (('A',), ('B',), ('C',)), but the second is ['A', 'B', 'C'].

The above code works fine if you know the position of the variable in the file, but if the position can change this won’t work. Here is a one liner to get the variable names of the active dataset and plop them in a list.

*Way to get SPSS variable names.
BEGIN PROGRAM Python.
varList = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
print varList
END PROGRAM.

Now if you have your list of variable names you want, you can figure out the index value. There are two ways to do it, iterate over the list of variable names in the dataset, or iterate over the list of your specified variables. I do the latter here (note this will result in an error if you supply a variable name not in the dataset).

*Find the indices of specific variables.
BEGIN PROGRAM Python.
LookVars = ["X","Z"]
VarInd = [varList.index(i) for i in LookVars]
print VarInd
END PROGRAM.

Now you can just supply VarInd above to the argument for spss.Cursor to grab those variables. Here I wrapped it all up in a function.

*Easy function to use.
BEGIN PROGRAM Python.
import spss
def AllSPSSdat(vars):
  if vars == None:
    varNums = range(spss.GetVariableCount())
  else:
    allvars = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
    varNums = [allvars.index(i) for i in vars]
  data = spss.Cursor(varNums)
  pydata = data.fetchall()
  data.close()
  return pydata
END PROGRAM.

You can either supply a list of variables or None, in the latter case all of the variables are returned.

BEGIN PROGRAM Python.
MyDat = AllSPSSdat(vars=["Y","Z"])
print MyDat
END PROGRAM.

This set of nested tuples is then pretty easy to convert to other Python objects. Panda’s dataframes, Numpy arrays, and NetworkX objects are all one liners. Here is turning the entire dataset into a panda’s data frame.

*Turn into pandas data frame.
BEGIN PROGRAM Python.
import pandas as pd
MyDat = AllSPSSdat(vars=None)
allvars = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
PanDat = pd.DataFrame(list(MyDat),columns=allvars)
print PanDat
END PROGRAM.

Which prints out.

   X  Y  Z 
0  1  2  A 
1  4  5  B 
2  7  8  C
Advertisements
Leave a comment

4 Comments

  1. Jignesh Sutar

     /  January 12, 2016

    I never remember how to specify names rather than indices (or how to get the indices from the names) as you do here. So now I have this blog post as a reference. Thank you very much Andy!

    Reply
    • Part of the reason why I write the blog posts is to reference it myself later. Same as you Jignesh it is one of those things I’ve written (and re-written) a few times but never committed to memory.

      Reply
  2. Horatiu Halmajan

     /  May 20, 2016

    Assuming that you have a data set with Value Labels attached to the numeric values,how would you turn the Value Labels into Python data ? (e.g.: “Male” / “Female” for the Gender variable, instead of the corresponding 1 / 2 numeric codes?

    Reply
    • You could change the SPSS data to the label strings before you import it to Python via the VALUELABEL function.

      You could do it all in python, but what you want the python data for will likely determine how exactly you do it. For example you can have labels within a Panda’s data frame. NetworkX you may just want them for plots, so can use a dictionary. Either of these you can still import the original data, but then just grab the value labels.

      Numpy is all numeric so is not applicable.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: