Making an Edge List in SPSS

I have planned a series of posts on some data manipulation for network data. Here I am going to show how to go from either a list of network relations in long or wide format to a list of (non-redundant) edges in SPSS.

So to start off lets define what I mean by long, wide and edge format. Long format would consist of a table where there is an ID column defining a shared instance, and a sperate column defining the nodes that have relations within that event. Imagine an event is a criminal incident, and a sharing relation can be offender(s) or victim(s).

So a long table might look like this;


Incident# Name Status
--------------
1 Mary O
1 Joe O
1 Steve V
2 Bob O
2 Ted V
2 Jeff V

Here, incident 1 has three nodes, Mary, Joe and Steve. The Status field represents if the node is either an offender, O, or a victim, V. They are all related through the incident number field. Wide format of this data would have only one record for each unique incident number, and the people "nodes" would span across multiple columns. It might then look something like below, where a . represents missing data.


Incident# Offender1 Offender2 Victim1 Victim2
---------------------------------------------
1 Mary Joe Steve . 
2 Bob . Ted Jeff .

I’ve encountered both of these types of data formats in police RMS databases, and the following solution I propose to make an edge list with go back and forth between the two to produce the final list. So what do I mean by an edge list? Below is an example;


Incident# FromID ToID FromStatus ToStatus
-----------------------------------------
1 Mary Joe O O
1 Mary Steve O V
1 Joe Steve O V
2 Bob Ted O V
2 Bob Jeff O V
2 Jeff Ted V V

Here we define all possible relationship among the two incidents ignoring the FromID and the ToID fields order (e.g. Mary Joe is equivalent to Joe Mary). Why do we want an edge list like this? In further posts I will show to do some data manipulation to find neighbors of different degrees using data formatted like this, but suffice to say many graph drawing algorithms need data in this format (or return data in this format).

So below I will show an example in SPSS going from the long format to the edge list format. In doing show I will transform the long list to the wide format, so it is trivial to adapt the code to go from the wide format to the edge list (instead of generating the wide table from the long table, you would generate the long table from the wide table).

So to start lets use some data in long format.


data list free / MyID (F1.0) Name (A1) Status (A1).
begin data
1 A O
1 B O
1 C V
1 D V
2 E O
2 F O
2 G V
end data.
dataset name long.

Now I will make a copy of this dataset, and reshape to the wide format. Then I merge the wide dataset into the long dataset.


*make copy and reshape to one row.
dataset copy wide.
dataset activate wide.
casestovars
/id MyID
/seperator = "".

*merge back into main dataset.
dataset activate long.
match files file = *
/table = 'wide'
/by MyID.
dataset close wide.

From here we will reshape the dataset to wide again, and this will create a full expansion of possible pairs. This produces much redundancy though. So first, before I reshape wide to long, I get rid of values in the new set of Name? variables that match the original Name variable (can’t have an edge with oneself). You could technically do this after VARSTOCASES, but I prefer to make as little extra data as possible. With big datasets this can expand to be very big – a case with n people would expand to be n^2, by eliminating self-referencing edges it will only expand n(n-1). Also I eliminate cases simply based on the sort order between Name and the wide XName? variables. By eliminating cases based on the ordering it reduces it to n(n-1)/2 total cases after the VARSTOCASES command (which by default drops missing data).


*Reshape to long again!
do repeat XName = Name1 to Name4 /XStatus = Status1 to Status4.
DO IF Name = XName OR  Name > XName. 
    compute Xname = " ".
    compute XStatus = " ".
END IF.
end repeat.
VARSTOCASES
/make XName from Name1 to Name4
/make XStatus from Status1 to Status4.

So you end up with a list of non-redundant edges with supplemental information on the nodes (note you can change the DO IF command to just Name > XName, here I leave it as is to further distinguish between them). To follow are some more posts about manipulating this data further to produce neighbor lists. I’d be interested to see in anyone has better ideas about how to make the edge list. It is easier to make pairwise comparisons in MATRIX programs, but I don’t go that route here because my intended uses are datasets too big to fit into memory. My code will certainly be slow though (CASESTOVARS and VARSTOCASES are slow operations on large datasets). Maybe an efficient XSAVE? (Not sure – let me know in the comments!) The Wikipedia page on SQL joins has an example of using a self join to produce the same type of edge table as well.


Below is the full code from the post without the text in between (for easier copying and pasting).


data list free / MyID (F1.0) Name (A1) Status (A1).
begin data
1 A O
1 B O
1 C V
1 D V
2 E O
2 F O
2 G V
end data.
dataset name long.

*make copy and reshape to one row.
dataset copy wide.
dataset activate wide.
casestovars
/id MyID
/seperator = "".

*merge back into main dataset.
dataset activate long.
match files file = *
/table = 'wide'
/by MyID.
dataset close wide.

*Reshape to long again! - and then get rid of duplicates.
do repeat XName = Name1 to Name4 /XStatus = Status1 to Status4.
DO IF Name = XName OR  Name > XName. 
    compute Xname = " ".
    compute XStatus = " ".
END IF.
end repeat.
VARSTOCASES
/make XName from Name1 to Name4
/make XStatus from Status1 to Status4.
Advertisements
Leave a comment

4 Comments

  1. Kirill

     /  February 3, 2014

    +1. This is a straightforward combinatorial task. In MATRIX, you can do shed a single line doing Cartesian product:
    MATRIX.
    COMPUTE X= {‘Mary’,’O’;’Joe’,’O’;’Steve’,’V’}.
    PRINT X /format= a8.
    COMPUTE XX= {KRONEKER(X,MAKE(NROW(X),1,1)),KRONEKER(MAKE(NROW(X),1,1),X)}.
    PRINT XX /format= a8.
    END MATRIX.
    Well, that leaves unnecessary pairing like ‘Mary’ ‘Mary’. But you can loop through and delete them.

    Reply
    • Thank you Kirill, the Kronecker product is a good trick to know. I’ve pondered a bit how to coerce the original vectors to only return the “lower half” but have never quite figured it out.

      The original datasets I performed these operations on were around 150,000+ individuals. If you check out the next post in the series, Querying Graph Neighbors in SPSS, I adapted this code to take a flag variable so it will find all neighbors of a subset of individuals (essentially egocentric networks minus the ties of the individuals outside of the subset). This was meant to work with very large sets of individuals, but I did not get much time to try it out.

      Reply
  1. Querying Graph Neighbors in SPSS | Andrew Wheeler
  2. Making and Exploring Crime Networks (Access and Excel) | Andrew Wheeler

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: