Using Card Sorts to Elicit Software Developers Categorizations of Project Risks

Using card sorts to elicit software developers’ categorizations of project risks

Satvere Sanghera & Gordon Rugg

c. 3,700 words

Abstract

Although risk is a topic which has been extensively described in several literatures, there has been surprisingly little research into people’s individual categorization of risks. This article describes how we used card sorts to elicit software developers’ own categorizations of a variety of factors which could cause risks to software development projects. The results were an interesting mixture of the expected and the unexpected: the developers’ categorizations were in some ways richer than those of the comparison group, as might be expected, but were in other ways less rich than those of the comparison group. The developers’ categorizations were largely based on project management concepts, but were phrased in idiosyncratic ways, suggesting that the developers were using these concepts as a starting point for their own categorization, rather than applying them “by the book”. We conclude that this approach to perception of risk is an effective one, and merits wider use.

Keywords: software project management; risk; card sorts

Introduction

Risk in general, and software project risks in particular, have received considerable attention from practitioners and the research community. One praiseworthy result of this is an extensive literature on risk management, together with corresponding methods and guidelines for best practice, within software engineering. Another praiseworthy result is an extensive literature on perception of risk, together with corresponding methods and guidelines for best practice, within psychology. This would be a good thing, except that in practice the two research communities have proceeded largely independently, often with quite different conclusions. A further complication is that there is yet another extensive literature, with its own body of practice, focusing on disasters and disaster management; this, like the literature on human error, has evolved its own methods and guidelines for best practice. This article describes how we used card sorts to investigate how software developers actually perceived risk, with a view to establishing how much relationship this bore to the various literatures on risk.

One relevant literature is the normative literature on risk analysis and risk management for information systems (IS). This topic is covered in most IS textbooks, usually as part of IS security. Standard approaches in this tradition include matrices of likelihood versus severity of risks. These will presumably be familiar to most readers, so are not discussed in further detail here.

The second literature which we took into account is the considerable literature on the psychology of risk perception, much of it relating to the judgement/decision-making (J/DM) in the tradition of Kahneman, Slovic & Tversky’s classic text [1]. This literature has produced a substantial set of results showing that people are prone to serious and predictable errors in areas such as estimation of probabilities. A classic example involves framing: people tend to react differently to the same underlying risk depending on the way in which it is presented, even though the different presentations are in fact formally equivalent. Another classic example is that people tend to be over-optimistic about their own prospects in situations such as estimating their own life expectancy, or the likelihood of failure when they are setting up a new company.

The J/DM literature is significant for two reasons. One is that it clearly demonstrates a predictable set of shortcomings and biases in risk estimation which apply even to experts. These findings are not included in the vast majority of IS risk literature. The second significant issue is that, although the reliability of the J/DM effects is not in dispute, there has been considerable debate within the literature about their validity. Researchers such as Gerd Gigerenzer have argued that the classic J/DM effects are largely an artefact of asking people to estimate risks as probabilities; when the same problem is re-cast as a formally equivalent frequency estimation, then the effects typically vanish. The balance of opinion in the field has recently shifted towards the frequentist position, but there remains a general consensus that people have problems with risk estimation when the task is presented in certain ways. An excellent overview of work by Gigerenzer and other leading researchers in this tradition is provided by [2].

One consistent finding from the J/DM literature is that people’s reaction to risks is dominated by a few main underlying factors, the most prominent of which are severity of outcome, the extent to which the risk is novel and unknown, and the amount of control that the individual has over the situation. In general, the more prominent these factors are in a given situation, the more likely it is that people will react in ways which might appear irrational. Given the same severity of possible outcome, for instance, people tend to be much more concerned about an unfamiliar risk which is outside their control than about a much more likely risk which is familiar and perceived to be under their control – the classic example is people’s nervousness about air travel (statistically very safe) as opposed to automobile accidents (a much more common cause of death).

Taken in conjunction with other issues such as framing effects, this has serious implications for public policy decisions involving new technological developments, such as automated aviation systems to replace human operatives, since it can be extremely difficult to predict how the public will react. However, despite the seriousness of this issue, comparatively little research has involved asking people to categorize risks in their own terms. Most research has involved researchers deriving underlying factors from statistical analysis of experimental results – even when researchers have looked at individual differences in risk-taking behavior, this has typically involved statistical analysis of individuals’ behaviors in experiments, rather than asking the individuals to explain their mental models of risk [3].

The third literature which we considered is the “disaster literature” in the tradition of researchers such as Charles Perrow [4], Nancy Leveson and Clark Turner [5]. This literature typically examines the history of a particular incident in considerable detail, and frequently uses the results to guide policy formulation. This literature can lead both into very specific technical detail and also into the realities of actual working practices, as opposed to official working practices – Leveson’s examination of the Therac-25 incident is a classic example of this, with a detailed examination both of the programming errors that led to the device killing several people, and of the working practices that led to the situation in which the programming errors occurred.

We wanted to find out which issues were perceived by developers as being important ones in software development project risk, and to see how these corresponded to the three main literatures described above, plus the literature on human error, which is discussed below.

One promising way of investigating this was via card sorts. Card sorts have been widely used on an informal basis for many years, but the formalization of one variety [6] has led to its use more systematically in areas such as elicitation of software metrics [7] and of programmers’ categorization of software problems [8]. This variety of card sorts involves showing a respondent a set of cards representing domain entities, and asking the respondent to sort the cards into groups which they considered significant. After the respondent has sorted the cards in relation to one criterion, they are asked to re-sort the same cards using a different criterion, and to repeat this process until all the relevant criteria have been covered. The cards may bear images such as screen dumps of Web pages [7], or may bear words, such as verbal descriptions of programming problems [8].

This method offered several promising advantages for research into how developers actually categorized risks. One is that the method allows respondents to choose whichever criteria and groups they want. Another is that the responses from card sorts are typically short, discrete phrases which are less ambiguous and vague than typical responses to interviews and to open questions in questionnaires. A third advantage is that the format of the method makes it possible to include quite detailed descriptions of programming problems, project problems, etc, in a tractable way. Previous researchers using this technique have consistently reported that respondents reacted positively to the method, and that respondents would often say explicitly that when they thought they had listed all the criteria worth mentioning; this provides a useful insurance against the risk of respondents giving trivial answers because of issues such as demand characteristics in a research context.

For this study, we concentrated on the three literatures described above. We did not focus primarily on the literature on human error because some important types of human error occur at a subconscious level, and are therefore not amenable to examination via card sorts. An example is frequency capture errors, where someone executes an action which they make frequently, rather than a rarer (but correct in the circumstances) action. A typical example of this is someone who usually enters their home through the front door, and who mistakenly gets out the front door key when they want to enter via the back door instead.

The method we used was as follows.

The case study

There were two groups, each of six respondents, with both groups containing a mix of Caucasian and Asian respondents, and with both groups containing three male and three female respondents to allow for possible gender effects in card sorting [8]. One group consisted of software developers with at least one year of commercial experience in information systems projects. The other was a control group consisting of respondents with no information systems experience beyond home computer use. The developers’ ages ranged from 22 to 38, and the control group’s from 20 to 36.

The materials used were 9 standard 15cm x 10cm filing cards, each numbered, and each bearing a different description of a potential source of problems on a project. The full list, with the same sequence of numbering as on the cards, is as follows:

1: Failure to understand who the project is for

2: Failure to appoint an executive user responsible for sponsoring the project

3: Failure to appoint a fully qualified and supported project manager

4: Failure to define the objectives of the project

5: Failure to secure commitments from people who are needed to assist with the project

6:Failure to estimate costs accurately

7: Failure to specify very precisely the end users’ requirements

8: Failure to provide a good working environment

9: Failure to tie in all the people involved in the project with contracts

The procedure used was the standard one described in [5]. Respondents were shown how sorting process using pictures of houses as the example (to reduce the risk of cueing respondents towards a particular set of risk-related responses). They were told that they could use categories such as: “Don’t know” or “Not applicable” if they wished. At the end of the session, they were asked to identify the card which in their opinion bore the most important risk for the outcome of a project, and the card which they felt was the least important risk.

Results

The number of sorts performed by the two groups ranged between two and twelve sorts for the developers, and between two and four sorts for the control group. This is consistent with the developers having more expertise, and therefore being able to generate more criteria for sorting. In total, the developers generated thirty-five criteria, and the control group generated eighteen. There were no significant differences between genders within each group in relation to number of criteria, with male developers generating seventeen criteria compared to eighteen from the female developers; the males in the control group generated nine criteria, as did the females.

There were some interesting differences in the number of categories (i.e. groups of cards) which respondents used within each sort. The most striking difference between developers and the control group was that the developers used diadic sorts (i.e. sorting the cards into two groups) on eighteen occasions (51% of their sorts), whereas the control group only did this on four occasions (22% of their sorts). The next most common pattern of sorting was into three groups, used by the technical developers on nine occasions and by the control group on ten occasions; the developers and control group made some use of sorts into one group, four groups, five groups and six groups, but never on more than three occasions. The reason for this difference in diadic sorting, compared with no major differences in the other sorts, is an interesting question. Similar differences in use of diadic categories were reported by Sue Gerrard [9] in the domain of perceptions of women’s working dress, with males more likely to use dichotomous sorts than females; one possible explanation is that males were less expert than females in that domain, resulting in less rich categorization, but this is the opposite to the pattern reported here. Other studies in a variety of domains have also found clear differences between groups in relation to use of diadic categories, relating to gender, expertise and ethnicity, but the results do not form a consistent pattern. For instance, gender has been clearly involved in at least one domain where there were no a priori reasons to expect a gender difference (categorization of teaching materials), but in other domains, such as this one, there is no visible gender effect. This issue is the subject of some debate in the card sorts research community, but full discussion of it and its implications is beyond the scope of this article.

Content analysis began with analysis of verbatim agreement among names of criteria. This is a useful way of checking for the use of codified knowledge, typically learned via formal instruction. In some domains, different respondents typically use identical wording for criterion names, reflecting what they have been taught on courses or at university; in others, respondents vary widely in their terminology, reflecting independent learning. In this domain, we found only four instances where verbatim agreement occurred, each of these involving only two responses. The criteria involved were:

requirements vs non-requirements problems
environment
point of failure/cause of failure
human resources

The next stage of content analysis involved grouping the individual criteria into coarser-grained superordinate criteria, by aggregating criteria which were closely related. This was done by an independent judge. The results from this are shown in Table 1 below.

Table 1: superordinate criteria

Superordinate Construct / Developers / Control Group / Total
Responsibility / 3 / 1 / 4
Outcome / 2 / 5 / 7
People involvement / 3 / 3 / 6
Project / 9 / 3 / 12
Measures / 2 / 0 / 2
Resources / 2 / 1 / 3
Objectives / 1 / 1 / 2
Problem / 2 / 0 / 2
Requirements / 1 / 1 / 2
Sponsorship / 1 / 0 / 1
Commitment / 1 / 0 / 1
Environment / 1 / 1 / 2
Analysis vs interpersonal / 1 / 0 / 1
Challenge / 1 / 0 / 1
Expenditure / 1 / 0 / 1
Consequences / 1 / 0 / 1
Relevance with regard to list* / 1 / 0 / 1
Business steps to develop a project / 1 / 0 / 1
Objective vs cost / 0 / 1 / 1
Risk / 0 / 1 / 1
Levels of management / 0 / 1 / 1
Total / 34 / 19 / 53

Note: Relevance with regard to list* refers to the relevance of the item considering the presence of other items on the list – a somewhat idiosyncratic criterion generated by one respondent.

There are various other ways in which content analysis can be performed on the criteria and categories. An interesting result emerges when the criteria are categorised into “objective” and “subjective”, where “objective” criteria involve observable and measurable factors such as “resources”, and subjective criteria do not (for instance, “interesting problems”). Table 2 shows the results from this for the verbatim criteria.

Respondent group / Objective criteria / Subjective criteria
Developers / 30 / 4
Control Group / 17 / 2
Total / 47 / 6

Discussion

The results show an interesting mixture of the expected and the unexpected. The developers used considerably more criteria than the control group, presumably reflecting greater expertise in dealing with projects and with project risks, as might be expected. However, the developers used considerably less rich categorization within their criteria than the control group, with a preference for dichotomous categories, which was a surprising result – both the literature on expertise and the findings from other work with card sorts would normally predict richer categorization by experts than by non-experts.

Similarly, the developers used a high proportion of superordinate criteria such as “measures”, “sponsorship”, “expenditure” and “business steps to develop a product” which were not used by the control group, and which are consistent with classic risk management factors. This is what might be expected if the developers are experts and have been trained in classic risk management; however, the low figures for verbatim agreement within the developer group suggest that members of this group were not simply applying well-practised standard procedures to categorizing the risks on the cards.

Another interesting finding was the high proportion of “objective” criteria used by both the developer group and by the control group. As with the absence of gender differences in dichotomous categorization, the reasons for this are obscure, but have practical implications which would merit further research.

Significantly absent from these results was any mention of the classic three factors of dread, severity and (lack of) control reported in the J/DM literature, even though the nature of card sorting appears eminently suitable for allowing respondents to use these as the criteria for sorting. A single study such as this one is not a sufficient basis for questioning the validity of these three factors, but it does demonstrate that card sorting provides a good methodological basis for further investigation of this issue.