entered into the data matrix, but the researcher decides to exclude them from the analysis. To create them for any particular variable, from the Variable View select the little blue box in the Missing column against the variable you want and obtain the Missing Values dialog box. This enables you either to pick out particular codes to be treated as missing values by clicking on the Discrete missing values radio button and entering up to three codes, or to select a range of missing values.
Open-ended questions
Most surveys contain one or more open-ended questions where responses are recorded as words, phrases, sentences or even more extended text. To be used in quantitative data analysis, the responses need to be categorized and each category given a code. The result should be either a binary or a nominal measure such that the values are exhaustive and mutually exclusive, or a fuzzy set giving degrees of membership of a defined category.
The approach to coding can be split into two situations. In the first situation, the open-ended question is being used to capture factual information, since listing all the options for responses in a closed question would take up too much space. Where respondents can give their answer in numerical form, for example putting in their age, then no additional coding is necessary. The actual age can simply be put into the data matrix. Where responses are in words, like brand purchased last time, then coding will involve creating a list of all the possible answers, assigning a code to each and recording a code for each respondent’s answer. It may be necessary to develop coding rules which specify codes to be allocated when the answer does not fit any of the obvious categories. For example, if respondents are asked ‘Not counting yourself, how many other people were you with?’ then most will give a clear number, but some may say ‘30–40’ or ‘a lot’. In this situation, one rule might be to give the mid-point of a range of values, so the answer ‘30–40’ will be coded as 35.
Where open-ended questions are being used not to capture factual information but to record respondent opinions, attitudes, views, knowledge, and so on, then creating a sensible code frame is the most important part of the analysis. By definition this is likely to get quite complex – if it were easy then the question could no doubt be pre-coded! The aim is to formulate a set of categories that accurately represents the answers and where each category includes an appreciable number of responses. Ideally, the set of categories should be exhaustive, mutually exclusive and minimize the loss of information. Furthermore, they should be meaningful, consistent and relatively straightforward to apply. There may also need to be separate codes for ‘No response’, ‘Not applicable’ and ‘Don’t know’. Where the information is very detailed there may need to be many codes.
Developing a frame may require several ‘passes’ over the data. It is probably a good idea to have all the comments collected and typed out, but this may not be possible. A method of constant comparison is probably best. Begin by looking at a few of the comments and see whether they should be put into separate categories. Then look at a few more and see if some can be put into the same category or whether more categories will need to be developed. When too many categories begin to emerge, look for similarities so that some categories can be brought together. If there are a large number of responses then it may not be sensible to look through all of them to develop the fame, but take a sample. Thus if there are 500 cases, a sample of 50–100 should enable the frame to be finalized. It also helps if more than one person develops a code frame separately; they should then work together on a final code. This maximizes the validity and reliability of the process.
It helps if the researcher sets up the objectives for which the code frame is to be used before beginning the process. Thus if the objective is to look for positive and negative statements about a situation or a product then answers will be coded along this dimension, perhaps with categories of very positive, vaguely positive, mixed, vaguely negative and very negative. Sometimes answers to open-ended questions can be coded in several ways according to different dimensions. Thus a study of injuries following an earthquake could look at the way injuries occurred, the parts of the body affected, where the injury occurred, what the person was doing at the time, and so on. Each of these aspects may need to be recorded separately in a different variable.
At one time researchers had to code all open-ended questions before data entry could begin. With modern survey analysis packages like SPSS, however, this may be done after all the pre-coded questions have been entered. This is a big advantage because researchers are not always sure how responses to open-ended questions should be coded until they have started analysis of the data. In short, it is sometimes better to delay coding of open-ended responses until they are needed for analysis.
Key points and wider issues
Before engaging in the description of a dataset, or even following an initial overview of the distribution for each variable one at a time or each case one at a time, the researcher may wish to transform variables in a number of ways: for example, regrouping values on a nominal or ordered category measure to create fewer categories, creating class intervals from metric variables, computing totals or other scores from combinations of several variables, treating groups of variables as a multiple response question, upgrading or downgrading measures, handling missing values and non-committal responses, or coding open-ended questions. Some transformations might involve creating crisp or fuzzy set memberships as was explained in Chapter 1.
Data transformation is an important part of the data analysis process. There are no ‘right’ or ‘wrong’ ways of engaging in data transformation and there are usually several different ways in which it can be done. Perhaps the best strategy is what is sometimes called ‘sensitivity analysis’, whereby transformations may be tried in different ways to see how sensitive the results are to such processes. This is particularly true for how missing cases and ‘Don’t know’ answers are handled.
Implications of this chapter for the alcohol marketing data
Many of the codings in the original dataset were illogical or inconsistent; for example, for some questions relating to whether or not they had done particular activities, respondents were given the choice between ‘Yes’, ‘No’ and ‘Don’t know’, while for others it was just ‘Yes’ and ‘No’. In a question asking respondents to indicate how often they had come across adverts for a range of different products, ‘Very often’ was coded 1 and ‘Never’ coded as 6, with ‘Don’t know’ as 7. Besides being counter-intuitive (the higher the score, the less often), this way of coding makes it impossible to use the codes as metric values for summation since ‘Don’t know’ has the highest value. Accordingly, a number of data transformations were needed before analysis could begin. It was also necessary to create new variables additional to those in the questionnaire, for example the number of channels on which respondents had seen adverts for alcohol.
Chapter summary
Before researchers can proceed with the next stages of data analysis, the data need to be prepared by checking questionnaires or other instruments of data capture for usability, editing responses for legibility, completeness and consistency, coding any responses that are not pre-coded, and assembling the data together by entering all the values for all the variables for all the cases into a data matrix. Data entry into the survey analysis package SPSS was explained in some detail.
Before analysis of the data can begin, some of the variables may need to be transformed in various ways and decisions may have to be made about how to handle missing values. The careful preparation of data ready for analysis should never be neglected. If poor-quality data are entered into the analysis, then no matter how sophisticated the statistical techniques applied, a poor or untrustworthy analysis will result. Handled with care, data preparation can substantially enhance the quality and usefulness of data analysis: paying inadequate attention to it can seriously compromise the validity of the results.
Exercises and questions for discussion
1 To what extent can treating