Using local Python objects in SPSSINC TRANS – examples with network statistics

When using SPSSINC TRANS, you have a wider array of functions to compute on cases in SPSS. Within the local session, you can create your own python functions within a BEGIN PROGRAM and END PROGRAM block. In SPSSINC TRANS you pass in the values in the current dataset, but you can also create functions that use data in the local python environment as well. An example use case follows in which you create a network in the local python environment using SPSS data, and then calculate several network statistics on the nodes. Here is a simple hierarchical network dataset that signifies managers and subordinates in an agency.

*Edge list. 
DATA LIST FREE / Man Sub (2F1.0). 
BEGIN DATA 
1 2 
2 3 
2 4 
3 5 
3 6 
4 7 
4 8 
END DATA. 
DATASET NAME Boss. 

We can subsequently turn this into a NetworkX graph with the code below. Some of my prior SPSS examples using NetworkX had a bit more complicated code using loops and turning the SPSS dataset into the network object. But actually the way SPSS dumps the data in python (as a tuples nested within a list) is how the add_edges_from function expects it in NetworkX, so no looping required (and it automatically creates the nodes list from the edge data).

BEGIN PROGRAM Python. 
import networkx as nx
import spss, spssdata

alldata = spssdata.Spssdata().fetchall()  #get SPSS data
G = nx.DiGraph()                          #create empty graph
G.add_edges_from(alldata)                 #add edges into graph
print G.nodes()
END PROGRAM.

Note now that we have the graph object G in the local python environment for this particular SPSS session. We can then make our own functions that references G, but takes other inputs. Here I have examples for the geodesic distance between two nodes, closeness and degree centrality, and the average degree of the neighbors.

BEGIN PROGRAM Python.
#path distance
def geo_dist(source,target): 
  return nx.shortest_path_length(G,source,target)
#closeness centrality
def close_cent(v):
  return nx.closeness_centrality(G,v)
#degree
def deg(v):
  return G.degree(v)
#average degree of neighbors
def avg_neideg(v):
  return nx.average_neighbor_degree(G,nodes=[v])[v]
END PROGRAM.

Here is the node list in a second SPSS dataset that we will calculate the mentioned statistics on. For large graphs, this is nice because you can select out a smaller subset of nodes and only worry about the calculations on that subset. For a crime analysis example, I may be monitoring a particular set of chronic offenders, and I want to calculate how close every arrested person within the month is to the set of chronic offenders.

DATA LIST FREE / Employ (F1.0). 
BEGIN DATA 
1 
2
3
4
5
6
7
8
END DATA. 
DATASET NAME Emp. 
DATASET ACTIVATE Emp.

Now we have all the necessary ingredients to calculate our network statistics on these nodes. Here are examples of using SPSSINC TRANS to calculate the network statistics in the local SPSS dataset.

*Geodesic distance from 1.
SPSSINC TRANS RESULT=Dist TYPE=0
  /FORMULA "geo_dist(source=1.0,target=Employ)".

*closeness centrality.
SPSSINC TRANS RESULT=Cent TYPE=0
  /FORMULA "close_cent(v=Employ)".

*degree.
SPSSINC TRANS RESULT=Deg TYPE=0
  /FORMULA "deg(v=Employ)".

*Average neighbor degree.
SPSSINC TRANS RESULT=NeighDeg TYPE=0
  /FORMULA "avg_neideg(v=Employ)".

Laplacian Centrality in NetworkX (Python)

The other day I read a few papers on a new algorithm for calculating centrality in networks. Below are the two papers describing the Laplacian Centrality metric. The first is for non-weighted networks, and the second for weighted networks.

  • Qi, X., Duval, R. D., Christensen, K., Fuller, E., Spahiu, A., Wu, Q., Wu, Y., Tang, W., and Zhang, C. (2013). Terrorist networks, network energy and node removal: A new measure of centrality based on laplacian energy. Social Networking, 02(01):19-31.
  • Qi, X., Fuller, E., Wu, Q., Wu, Y., and Zhang, C.-Q. (2012). Laplacian centrality: A new centrality measure for weighted networks. Information Sciences, 194:240-253. PDF Here.

The metric is fairly intuitive I think. The centrality parameter is a function of the local degree plus the degree’s of the neighbors (with different weights for each). I figured it would be a quick programming exercise (which means I spent way too long trying to implement it!). To follow is some code that replicates the measures for both weighted and non-weighted graphs, using the Python networkx library.

The non-weighted graph code is easy, and is a near copy-paste from some igraph code snippet that was already available. Just some updates to idiom’s for NetworkX specifically. The norm option specifies whether you want solely the numerator value, the difference between the energy in the full graph versus the graph with the node removed (norm=False), or whether you want to divide this value by the energy for the full graph. Note this function ignores the weights in the graph. nbunch is if you want to pass the function only a subset of points to calculate the centrality. (If you do that you might as well have norm=False for time savings as well.)

def lap_cent(graph, nbunch=None, norm=False):
  if nbunch is None:
    vs = graph.nodes()
  else:
    vs = nbunch
  degrees = graph.degree(weight=None)
  if norm is True:
    den = sum(v**2 + v for i,v in degrees.items())
    den = float(den)
  else:
    den = 1
  result = []
  for v in vs:
    neis = graph.neighbors(v)
    loc = degrees[v]
    nei = 2*sum(degrees[i] for i in neis)
    val = (loc**2 + loc + nei)/den
    result.append(val)
  return result

The weighted network is a bit more tricky though. I thought coding all of the two walks seemed a royal pain, so I developed a different algorithm that I believe is quicker. Here are three functions, but the last one is the one of interest, lap_cent_weighted. The options are similar to the unweighted version, with the exception that you can pass a weight attribute (which is by default named ‘weight’ in NetworkX graphs).

def lap_energy(graph, weight='weight'):
  degrees = graph.degree(weight=weight)
  d1 = sum(v**2 for i,v in degrees.items())
  wl = 0
  for i in graph.edges(data=True):
    wl += (i[2].get(weight))**2
  return [d1,2*wl]

def cw(graph,node,weight='weight'):
  neis = graph.neighbors(node)
  ne = graph[node]
  cw,sub = 0,0
  for i in neis:
    we = ne[i].get(weight)
    od = graph.degree(i,weight=weight)
    sub += -od**2 + (od - we)**2
    cw += we**2
  return [cw,sub]

def lap_cent_weighted(graph, nbunch=None, norm=False, weight='weight'):
  if nbunch is None:
    vs = graph.nodes()
  else:
    vs = nbunch
  if norm == True:
    fe = lap_energy(graph,weight=weight)
    den = float(fe[0]+fe[1])
  else:
    den = 1
  result = []
  for i in vs:
     d2 = graph.degree(i,weight=weight)
     w2 = cw(graph,i,weight=weight)
     fin = d2**2 - w2[1] + 2*w2[0]
     result.append(fin/den)
  return result

For a brief overview of the new algorithm (in some quick and dirty text math), to define the energy of the entire graph it is:

sum(di^2) + 2*sum(wij^2)   (1)

Where di are the degrees for all i nodes, and the second term is 2 times the sum of the weights squared. So when you take out a particular node, say ‘A’, the drop in the second term is easy, just iterate over the neighbors of ‘A’, and calculate 2*sum(waj^2), then subtract that from the second term in equation 1.

The first term is slightly more complex. First there is a decrease due to simply the degree of the removed node, da^2. There is also a decrease in the degree on the neighboring nodes as well, so you need to calculate their updated contribution. The necessary info. is available when you iterate over the neighbor list though, and if the original contribution is di^2, and the weight of wia, then the updated weight is -di^2 + (di - wia)^2. You can calculate this term at the same time you calculate the decrease in the weights in the second term.

I believe this algorithm is faster than the one originally written in the second paper. It should be something like O(n*a), where a is the average number of neighbors for the entire graph, and n are the number of nodes. Or in worst case a is the maximum number of neighbors any node has in the graph (which should be less than or equal to the max degree in a weighted graph).

Here is an example of using the functions with the small, toy example in the weighted network paper. Note that the lap_cent function ignores the weights.

import networkx as nx

Gp = nx.Graph()
ed = [('A','B',4),('A','C',2),('C','B',1),('B','D',2),('B','E',2),('E','F',1)]
Gp.add_weighted_edges_from(ed)

x = lap_cent(Gp)
xw = lap_cent_weighted(Gp, norm=True)
for a,b,c in zip(Gp.nodes(),x,xw):
  print a,b,c

Which prints out at the console:

A 18 0.7 
C 18 0.28 
B 34 0.9 
E 16 0.26 
D 10 0.22 
F 6 0.04

If you want to see that the graph is the same as the graph in the weighted Qi paper, use below.

import matplotlib.pyplot as plt
pos=nx.spring_layout(Gp) # positions for all nodes
nx.draw(Gp,pos=pos)
nx.draw_networkx_labels(Gp,pos=pos)
nx.draw_networkx_edge_labels(Gp,pos=pos)
plt.show()

Finding subgroups in a graph using NetworkX and SPSS

This is a task I’ve have to conduct under several guises in the past. Given a set of edges, reduce those edges into unique subgroups based on the transitive closure of those edges. That is, find a group in which all nodes can reach one another (via however many steps are necessary) but are completely separated from all other nodes.

This is steeped in some network language jargon, so I will give a few examples in data analysis where this might be useful:

  • Find cliques of offenders (that may resemble a gang) given a set of co-offenders in a police incident database.
  • Reduce a large set of items that appear together into smaller subsets. An example may be if you have a multiple response set with a very large number of possible choices. You may identify subgroups of items that occur together.
  • Given a set of linked near match names, reduce the database so all of those near match names share the same unique id.
  • For my dissertation I aggregate crime incidents to street midpoints and intersections. This creates some overlap or near overlap points (e.g. at T intersections). You might want to further aggregate points that are within a prespecified distance, but there may be multiple nodes all within a short distance. These create a string of networked locations that are probably not appropriate to simply aggregate – especially when they include a large number of locations.

One (typical) way to find the transitive closure is to represent your edges in a binary adjacency matrix and then take subsequent higher powers of that matrix until the diffusion ceases. This is difficult to impossible though with node lists of any substantial size. In this post what I will do is use the NetworkX python library, which contains a handy function named components.connected that solves this problem for me.

So first for illustration lets make a small edge list in SPSS.

DATA LIST FREE / A B.
BEGIN DATA
1 2
2 3
3 4
5 6
7 8
4 9
7 9
8 10
END DATA.
DATASET NAME Test.
FORMATS A B (F5.0).
EXECUTE.

Now using the functions described in this StackOverflow post, we will be able to turn a set of nested lists into a NetworkX object in python.

BEGIN PROGRAM.
import networkx 
from networkx.algorithms.components.connected import connected_components

def to_graph(l):
    G = networkx.Graph()
    for part in l:
        # each sublist is a bunch of nodes
        G.add_nodes_from(part)
        # it also imlies a number of edges:
        G.add_edges_from(to_edges(part))
    return G

def to_edges(l):
    """ 
        treat `l` as a Graph and returns it's edges 
        to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
    """
    it = iter(l)
    last = next(it)

    for current in it:
        yield last, current
        last = current    
END PROGRAM.

Now this python code 1) imports our edge list from the SPSS dataset and turn it into a networkx graph, 2) reduces the set of edges into connected components, 3) makes a new SPSS dataset where each row is a list of those subgraphs, and 4) makes a macro variable to identify the end variable name (for subsequent transformations).

DATASET DECLARE Int.
BEGIN PROGRAM.
#grab SPSS data
import spss, spssdata
alldata = spssdata.Spssdata().fetchall()

#turn SPSS data into graph
G = to_graph(alldata)
results = connected_components(G)
print results
ml = max(map(len,results))

#now make an SPSS dataset
spss.StartDataStep()
datasetObj = spss.Dataset(name='Int')
for i in range(ml):
  v = 'V' + str(i+1) 
  datasetObj.varlist.append(v,0)
for j in results:
  datasetObj.cases.append(j)
spss.EndDataStep()

#make a macro value to signify the last variable
macroValue=[]
macroName="!VEnd"
macroValue.append('V' + str(ml)) 
spss.SetMacroValue(macroName, macroValue)
END PROGRAM.

Now we can take that subgroup dataset, named Int, reshape it so all of the nodes are in one column and has a second column identifying the subgraph to which it belongs, and then merge this info back to the edge dataset named Test.

DATASET ACTIVATE Int.
COMPUTE Group = $casenum.
FORMATS Group (F5.0).
VARSTOCASES
  /MAKE A FROM V1 TO !VEnd.
FORMATS A (F5.0).
SORT CASES BY A.

DATASET ACTIVATE Test.
SORT CASES BY A.
MATCH FILES FILE = *
  /TABLE = 'Int'
  /BY A.
EXECUTE.

From here we can make some nice sociogram charts of our subgroups. SPSS’s layout.network is not very specific about the type of layout algorithm, but it does a good job here laying out a nice planar graph.

GGRAPH
  /GRAPHDATASET NAME="edges" DATASET = "Test" VARIABLES=A B Group
  /GRAPHDATASET NAME="nodes" DATASET = "Int" VARIABLES=A Group
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: e=userSource(id("edges"))
 DATA: Ae=col(source(e), name("A"), unit.category())
 DATA: Be=col(source(e), name("B"), unit.category())
 DATA: Groupe=col(source(e), name("Group"), unit.category())
 SOURCE: n=userSource(id("nodes"))
 DATA: An=col(source(n), name("A"), unit.category())
 DATA: Groupn=col(source(n), name("Group"), unit.category())
 GUIDE: axis(dim(1), null())
 GUIDE: axis(dim(2), null())
 GUIDE: legend(aesthetic(aesthetic.color.interior), null())
 ELEMENT: edge(position(layout.network(node(An), from(Ae), to(Be))))
 ELEMENT: point(position(layout.network(node(An), from(Ae), to(Be))), color.interior(Groupn), size(size."14"), label(An))
END GPL.

At the end of the post I have some more code to illustrate this with a slightly larger random network of 300 potential nodes and 100 random edges. Again SPSS does quite a nice job of laying out the graph, and the colors by group reinforce that our solution is correct.

My most recent use application for this had a set of 2,000+ plus edges and this code returned the solution instantaneously. Definitely a speed improvement over taking powers of a binary adjacency matrix in MATRIX code.

I wanted to make this network graph using small multiples by group, but I can’t figure out the correct code for the faceting (example commented out at the end of the code snippet). So if anyone has an example of making an SPSS graph with small multiples let me know.

*Similar graphs for larger network.
DATASET CLOSE ALL.
INPUT PROGRAM.
COMPUTE #Nodes = 300.
LOOP #i = 1 TO 100.
  COMPUTE A = TRUNC(RV.UNIFORM(0,#Nodes+1)).
  COMPUTE B = TRUNC(RV.UNIFORM(0,#Nodes+1)).
  END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Test.
FORMATS A B (F5.0).
EXECUTE.

DATASET DECLARE Int.
BEGIN PROGRAM.
#grab SPSS data
import spss, spssdata
alldata = spssdata.Spssdata().fetchall()

#turning SPSS data into NetworkX graph
#functions are already defined
G = to_graph(alldata)
results = connected_components(G)
ml = max(map(len,results))
print ml

#now make an SPSS dataset
spss.StartDataStep()
datasetObj = spss.Dataset(name='Int')
for i in range(ml):
  v = 'V' + str(i+1) 
  datasetObj.varlist.append(v,0)
for j in results:
  datasetObj.cases.append(j)
spss.EndDataStep()

#make a macro value to signify the last variable
macroValue=[]
macroName="!VEnd"
macroValue.append('V' + str(ml)) 
spss.SetMacroValue(macroName, macroValue)
END PROGRAM.

*Now merging groups back into original set.
DATASET ACTIVATE Int.
COMPUTE Group = $casenum.
FORMATS Group (F5.0).
VARSTOCASES
  /MAKE A FROM V1 TO !VEnd.
FORMATS A (F5.0).
SORT CASES BY A.

DATASET ACTIVATE Test.
SORT CASES BY A.
MATCH FILES FILE = *
  /TABLE = 'Int'
  /BY A.
EXECUTE.

GGRAPH
  /GRAPHDATASET NAME="edges" DATASET = "Test" VARIABLES=A B Group
  /GRAPHDATASET NAME="nodes" DATASET = "Int" VARIABLES=A Group
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
 SOURCE: e=userSource(id("edges"))
 DATA: Ae=col(source(e), name("A"), unit.category())
 DATA: Be=col(source(e), name("B"), unit.category())
 DATA: Groupe=col(source(e), name("Group"), unit.category())
 SOURCE: n=userSource(id("nodes"))
 DATA: An=col(source(n), name("A"), unit.category())
 DATA: Groupn=col(source(n), name("Group"), unit.category())
 GUIDE: axis(dim(1), null())
 GUIDE: axis(dim(2), null())
 GUIDE: legend(aesthetic(aesthetic.color.interior), null())
 ELEMENT: edge(position(layout.network(node(An), from(Ae), to(Be))))
 ELEMENT: point(position(layout.network(node(An), from(Ae), to(Be))), color.interior(Groupn), size(size."11"))
END GPL.

*This small multiple faceting is not working.
*Error is Groupe & Groupn are not same faceting structure.
 * GGRAPH
  /GRAPHDATASET NAME="edges" DATASET = "Test" VARIABLES=A B Group
  /GRAPHDATASET NAME="nodes" DATASET = "Int" VARIABLES=A Group
  /GRAPHSPEC SOURCE=INLINE.
 * BEGIN GPL
 SOURCE: e=userSource(id("edges"))
 DATA: Ae=col(source(e), name("A"), unit.category())
 DATA: Be=col(source(e), name("B"), unit.category())
 DATA: Groupe=col(source(e), name("Group"), unit.category())
 SOURCE: n=userSource(id("nodes"))
 DATA: An=col(source(n), name("A"), unit.category())
 DATA: Groupn=col(source(n), name("Group"), unit.category())
 GUIDE: axis(dim(1), null())
 GUIDE: axis(dim(2), null())
 GUIDE: legend(aesthetic(aesthetic.color.interior), null())
 ELEMENT: edge(position(layout.network(1*1*Groupe, node(An), from(Ae), to(Be))))
 ELEMENT: point(position(layout.network(1*1*Groupn, node(An), from(Ae), to(Be))), color.interior(Groupn), size(size."14"), label(An))
END GPL.