4
This chapter shows how to get data into the R Commander from a variety of sources, including entering data directly at the keyboard, reading data from a plain-text file, accessing data stored in an R package, and importing data from an Excel or other spreadsheet, or from other statistical software. I also explain how to save and export R data sets from the R Commander, and how to modify data—for example, how to create new variables and how to subset the active data set.
Managing data is an admittedly dry but nevertheless vitally important topic. As experienced researchers will attest, collecting, assembling, and preparing data for analysis are typically much more time consuming than analyzing the data. It makes sense to take up data management early because you’ll always have to read data, and usually have to modify them, before you can perform statistical analysis. I suggest that you read as much of this chapter as you need to get started with your work, but that you at least familiarize yourself with the balance of its contents, so that you know where to look when a particular data management problem presents itself.
A graphical interface such as the R Commander is frankly not the best data management tool. Even common data management tasks are very diverse, and data sets often have unique characteristics that require customized treatment. GUIs, in contrast, excel at tasks where the choices are limited and can be anticipated.
As a flexible programming language, R is well suited to data management tasks, and there are many R packages that can help you with these tasks. Often, the most straightforward solution is to write a script of R commands or a simple R program to prepare a data set for analysis. The data management script or program need not be elegant or efficient; it simply must work properly, because it typically will be used only once. What does it matter if your data management script takes a few minutes to run? That’s just a small part of the time you’ll invest in preparing and analyzing your data. Learning a little bit of R programming (see ), therefore, goes a long way to making you a more efficient data analyst, even if you choose to use the R Commander for routine data analysis tasks.
Recall from that data sets in the R Commander are stored in R data frames—rectangular data sets in which the rows represent cases and the columns are usually either numeric variables or factors (categorical variables). There are many ways to read data into the R Commander, and I will describe several of them in the following subsections, including entering data directly at the keyboard into the R Commander data editor, reading data from plain-text files, importing data from other software including spreadsheets such as Excel, and accessing data sets stored in Rpackages.
Plain-text or spreadsheet data must be in a simple, rectangular form, with one row or line for each case, and one column or “field” for each variable. There may be an initial row with variable names, or an initial column with case names, or both. If the data are in an irregular or other more complex form, you’ll likely have to do some work before you can read them into the R Commander. The R Commander, however, can handle simple merges of separate data sets with the same variables or cases (see ).
One way to enter a small data set into the R Commander is to type the data into a plain-text file using an editor or into a spreadsheet, and then to use the input methods described in or . My personal preference is to keep small data sets in plain-text files.
Alternatively, it is also simple to type small data sets directly into the R Commander data editor. To illustrate, I’ll enter a data set that appears in an exercise in an introductory statistics text, Moore et al. (2013, Exercise 5.4). The data, for 12 women in a study of dieting, comprise two numeric variables: each woman’s lean body mass, in kilograms, and her resting metabolic rate, in calories burned per hour.
I begin by selecting Data > New data set from the R Commander menus. That brings up the New Data Set dialog shown at the top-left of . I replace the default data-set name Dataset with the more descriptive name Metabolism. Pressing OK opens the R Commander data editor with an empty data set, as shown at the top-right. There is initially one case (row) and one variable (column) in the data set.
Pressing the Add column button once and the Add row button 11 times produces the still-empty data set shown at the bottom-left of , with all of the values in the body of the data table initially missing (NA). You can also at any point press the Enter or return key to add a new row to the bottom of the data set, or the Tab key to add a new column at the right of the data. Using these keys rather than the buttons can be convenient for initial data entry: Simply Tab as you enter numbers in the first row until the data set contains columns for all the variables, and then Enter (or return) after you complete typing in the data in each row.
I next replace the generic variable names V1 and V2 with mass and rate, and enter the values for each variable, as given in the text by Moore et al., into the cells of the data table, as shown at the bottom-right. Once finished, I press the OK button in the editor, making Metabolism the active data set. The following message appears in the R Commander Messages pane: NOTE: The dataset Metabolism has 12 rows and 2 columns.
To see whether and how the variables are related, I suggest that you make a scatterplot of the data, via Graphs > Scatterplot, with lean body mass on the horizontal axis and metabolic rate on the vertical axis. If you wish, then perform a linear least-squares regression of rate on mass: Statistics > Fit models > Linear regression.
FIGURE 4.1: Using the R Commander data editor to enter the Metabolism data set, proceeding fromHere is some additional information about using the R Commander data editor for direct data entry:
• In this data set, the row names are simply numbers supplied by the editor, but you can also type names into the rownames column of the data editor. If you do so, however, make sure that each name is unique (i.e., that there are no duplicate names) and that any names with embedded blanks are enclosed in quotes (e.g., “John Smith”); alternatively, just squeeze out the blanks (JohnSmith), or use a period, dash, or other character as a separator (John.Smith, John-Smith).
• You can use the arrow keys on your keyboard to move around the cells of the data editor, or simply left-click in any cell. When you do so, text that you type will replace whatever is currently in the cell—initially, the default variable names, the row numbers, and the NAs in the body of the data table.
• If a column entered into the data editor consists entirely of numbers, it will become a numeric variable. Conversely, a column containing any non-numeric data (with the exception of the missing data indicator NA) will become a factor. Character data with embedded blanks must be enclosed in quotes (e.g., “agree strongly”).
• If the initial column width of the data editor cells is insufficient to contain a variable name, a row name, or a data value, the cell in question won’t be displayed properly. Left-click on either the left or right border of the data editor and drag the border until all cells are sufficiently wide for their contents. If the width (or height) of the data editor window is insufficient to display simultaneously all columns (or rows) of the data set, then the horizontal (or vertical) scrollbar will be activated.
• The Edit menu in the data editor supports some additional actions, such as deleting rows or columns, and the Helpmenu provides access to information about using the editor.
• The data editor is a modal dialog: Interaction with the R Commander is suspended until you press the OK or Cancel button in the editor.
• After typing in a data set in this manner, it is, as always, a good idea to press the View data set button in the R Commander toolbar to confirm that the data have been entered correctly.
You can also use the R Commander data editor to modify an existing data set—for example, to fix an incorrect data value: Press the Edit data set button in the R Commander toolbar to edit the active data set.
In , I demonstrated how to read data into the R Commander from a plain-text, comma-separated-values (CSV) data file. To recapitulate, each line in a CSV file represents one case in the data set, and all lines have the same number of values, separated by commas. The first value in each line may be a case name, and the first line of the data file may contain variable names, also separated by commas. If there are both case names and variable names in the data, then there will be one fewer variable name in the first line than values in subsequent lines; otherwise, there must be the same number of fields (values) in each line of the file. On input to R, empty fields (i.e., produced by adjacent commas) or fields consisting only of spaces will translate into missing data (NAs).
CSV files are a kind of least common denominator of data storage. Almost all software that deals with rectangular data sets, including spreadsheet programs like Excel and statistical software like SPSS, are capable of reading and writing CSV files. CSV files, therefore, are often the simplest file format for moving data from one program to another.
You may have to do minor editing of a CSV file produced by another program before reading the data into the R Commander—for example, you may find it convenient to change missing data codes to NA—but these operations are generally straightforward in any text editor. Do use an editor intended for plain-text (ASCII) files, however: Word processors, such as Word, store documents in special files that include formatting information, and these files cannot in general be read as plain text. If you must edit a data file in a word processor, be very careful to save the file as plain text.
An advantage of CSV files is that character data fields may include embedded blanks (as, e.g., in strongly agree) without having to enclose the fields in quotes (“strongly agree” or ‘strongly agree’). Fields that include commas, however, must be quoted (e.g., “don’t know, refused”). By the way, fields that include single (‘) or double (“) quotes must be quoted using the other quote character (as in “don’t know, refused”); otherwise, either single or double quotes may be used (as long as the same quote character is used at both ends of a character value—e.g., “strongly agree’ isn’t legal).
The R Commander Read Text Data dialog also supports reading data from plain-text files with fields separated by white space (one or more blanks), tabs, or arbitrary characters, such as colons (:) or semicolons (;). Moreover, the data file may reside on the Internet rather than the user’s computer, or may be copied to and read from the clipboard (as illustrated for spreadsheet data in ).
shows a few lines from a small illustrative text data file, Duncan.txt, with white space separating data fields. Multiple spaces are employed to make the data values line up vertically, but this is inessential—one space would suffice to separate adjacent data values. Periods are used in the case names instead of spaces (e.g., mail.carrier) so that the names need not be quoted; commas (,) and dashes (-) are similarly used for the categories of type.
The data, on 45 U. S. occupations in 1950, are drawn mostly from Duncan (1961); I added the variable type of occupation to the data set. The variables are defined in . The income and education data were derived by Duncan from the U. S. Census, while occupational prestige was obtained from ratings of the occupations in a social survey of the population. Duncan used the least-squares linear regression of prestige on income and educationto compute predicted prestige scores for the majority of occupations in the Census for which there weren’t direct prestige ratings.
FIGURE 4.2: The Duncan.txt file, with white-space-delimited data from Duncan (1961) on 45 U. S. occupations in 1950. Only a few of the 46 lines in the file are shown; the widely spaced ellipses (…) represent elided lines. The first line in the file contains variable names. I added the variable type to the data set.
TABLE 4.1: Variables in Duncan’s occupational prestige data set.
Variable |
Values |
type |
blue-collar; white-collar; prof, tech, manag (professional, technical, or managerial) |
income |
percentage of occupational incumbents earning $3500 or more |
education |
percentage of occupational incumbents with high-school education or more |
prestige |
percentage of prestige ratings of good or better |
upper-left to lower-right. ABLE 4.2: Variables in the Canadian occupational prestige data set.
Variable |
Values |
education |
average years of education of occupational incumbents |
income |
average annual income of occupational incumbents, in dollars |
prestige |
average prestige rating of the occupation (0–100 scale) |
women |
percentage of occupational incumbents who were women |
census |
the Census occupation code |
type |
bc, blue-collar; wc, white-collar; prof, professional, technical, or managerial |
Reading a white-space-delimited, plain-text data file in the R Commander is almost identical to reading a comma-delimited file: Select Data > Import data > from text file, clipboard, or URL from the R Commander menus. Fill in the resulting dialog box (see on ) to reflect the structure and location of the input file. In the case of Duncan.txt, I would take all of the defaults, including the default White space field separator, with the exception of the name of the data set, where I’d substitute a descriptive name like Duncan for the default name Dataset.
Many researchers enter, store, and share small data sets in spreadsheet files. To illustrate, I’ve prepared two Excelfiles, the older format file Datasets.xls and the newer format Datasets.xlsx, both containing two data sets: Duncan’s U. S. occupational prestige data, and a similar data set for Canada circa 1970, which is described by Fox and Suschnigg (1989). The Canadian occupational prestige data include the variables in , and the spreadsheet containing the data appears in , which shows the first 21 rows in the spreadsheet; there are 103 rows in all, including the initial row of variable names. The education, income, and occupational gender composition data come from the 1971 Canadian Census, while the prestige scores are the average ratings of the occupations on a 0–100 “thermometer” scale in a mid-1960s Canadian national survey. The structure of the spreadsheet is similar to that of a plain-text input file: There is one row in the spreadsheet for each case, there’s an optional row of variable names at the top, and there’s an optional initial column of case names. When case names are present in the first column, there should be no variable name at the top of the column.
To read the Excel spreadsheet, I select Data > Import data > from Excel file from the R Commander menus, which brings up the dialog box at the left of . I complete the dialog box to reflect the structure of the Prestige spreadsheet—including retaining the default <empty cell> Missing data indicator—and I enter the descriptive name Prestige for the data set, replacing the generic default name Dataset.
Pressing OK in the Import Excel Data Set dialog leads to a standard Open file dialog box, where I navigate to the location of the Excel file containing the data, select it, and press the Open button, producing the Select one tabledialog at the right of . I left-click on the Prestige table and click OK to read the data into the R Commander, making the resulting Prestige data frame the active data set.
FIGURE 4.3: The Excel file Datasets.xlsx showing the first 21 (of 103) rows in the Prestige spreadsheet.
FIGURE 4.4: The Import Excel Data Set dialog box (left), and the Select one table sub-dialog (right), choosing the Prestige spreadsheet (table).
There are two other simple procedures for reading rectangular data stored in spreadsheets:
• Export the spreadsheet as a comma-delimited plain-text file. Editing the spreadsheet before exporting it, to make sure that the data are in a suitable form for input to the R Commander, can save some work later. For example, you don’t want commas within cells unless the contents of the cells are quoted.
Likewise, you might have to edit the resulting CSV file before importing the data into the R Commander. For example, if the first row of the spreadsheet contains variable names and the first column contains row names, the exported CSV file will have an empty first field corresponding to the empty cell at the upper-left of the spreadsheet (as in the Prestige spreadsheet in ). Simply delete the extra initial comma in the first row of the CSV file; otherwise, the first column will be treated as a variable rather than as row names.
• Alternatively, select the cells in the spreadsheet that you want to import into the R Commander: You can do this by left-clicking and dragging your mouse over the cells, or by left-clicking in the upper-left cell of the selection and then Shift-left-clicking in the lower-right cell. Selecting cells in this manner is illustrated in for the Excel spreadsheet containing Duncan’s occupational prestige data. Then copy the selection to the clipboard in the usual manner (e.g., Ctrl-c on a Windows system or command-c on a Mac); in the R Commander, choose Data > Import data > from text file, clipboard, or URL, and press the Clipboard radio button in the resulting dialog, leaving the default White space selected as the Field Separator. The data are read from the clipboard as if they reside in a white-space-separated plain-text file.
In addition to plain-text files and Excel spreadsheets, the Data > Import data menu includes menu items to read data from SPSS internal and portable files, from SAS xport files, from Minitab data files, and from Stata data files. You can practice with the SPSS portable file Nations.por, which is on the web site for this book.
Many R packages include data sets, usually in the form of R data frames, and these are suitable for use in the R Commander. When the R Commander starts up, some packages containing data sets are loaded by default. If you’ve installed other packages with data that you want to use, you can load the packages subsequently via Tools > Load package(s).
Selecting Data > Data in packages > List data sets in packages from the R Commander menus opens a window listing all data sets available in currently loaded R packages. Selecting Data > Data in packages > Read data from an attached package produces the dialog box in . Initially, the dialog appears as at the top of the figure.
If you know the name of the data set you want to read, you can type it into the Enter name of data set box, as in the middle of , where I entered the name of the Duncan data set. As it turns out, this data set, containing Duncan’s occupational prestige data, is supplied by the car package (Fox and Weisberg, 2011). Because data sets in R packages are associated with documentation, pressing the Help on selected data set button will now bring up the help page for the Duncan data set. Clicking OK reads the data and makes Duncan the active data set in the R Commander.
FIGURE 4.5: Selecting the cells in Duncan’s data set from the Duncan spreadsheet, to be copied to the clipboard. FIGURE 4.5: Selecting the cells in Duncan’s data set from the Duncan spreadsheet, to be copied to the clipboard.
FIGURE 4.6: Reading a data set from an R package. The initial state of the Read Data From Package dialog is shown at the top; in the middle, I typed Duncan as the data set name; at the bottom, I selected car from the package list and Prestige from the data set list.
Alternatively, the left-hand list box in the Read Data From Package dialog includes the names of currently attached packages that contain data sets. Double-clicking on a package in this list, as I’ve done for the car package at the bottom of , displays the data sets in the selected package in the list box at the right of the dialog. You can scroll this list in the normal manner, using either the scrollbar at the right of the list or clicking in the list and pressing a letter key on your keyboard—I pressed the letter p in and subsequently double-clicked on Prestige to select it, which transferred the name of the data set to the Enter name of data set box. Finally, pressing the OK button in the dialog reads the Prestige data set and makes it the active data set in the R Commander. This is the Canadian occupational prestige data set, with which we are already familiar.
When a data set residing in an attached package is the active data set in the R Commander, you can access documentation for the data set via either Data > Active data set > Help on active data set or Help > Help on active data set.
You can save the active data set in an efficient internal format by selecting Data > Active data set > Save active data set from the R Commander menus, bringing up the Save As dialog shown in . The file name suggested for the saved data is Prestige.RData because Prestige is the active data set. Before pressing.
The file name suggested for the saved data is Prestige.RData because Prestige is the active data set. Before pressing the Save button in the dialog, navigate to the location in your file system where you want to save the file. In a subsequent session, you can load the saved data set via Data > Load data set, navigating your file system to the location of the data, and selecting the previously saved Prestige.Rdata file.
There are two common reasons for wanting to save a data set in internal format, the second of which is unlikely to apply to data analyzed in the R Commander: (1) You have modified the data—for example, creating new variables, as described in the next section—and you want to be able to continue in a subsequent session without having to repeat your data management work. (2) The data set is so large that reading it from a plain-text file is time consuming.
You can also export the current data set as a plain-text file by choosing Data > Active data set > Export active data set from the R Commander menus, which produces the dialog box in Figure 4.8. Complete the dialog to reflect the form in which you want to export the data—the default choices are shown in the figure—and click OK, subsequently navigating to the location where you want to store the exported data (in the resulting Save As dialog box, which isn’t shown). If you select Commas as the field separator, the R Commander suggests the name Prestige.csv for the exported data file; otherwise it suggests the file name Prestige.txt.
FIGURE 4.7: Saving the active data set.
FIGURE 4.8: Exporting the active data set as a plain-text file.
4.4 Modifying Variables
Menu items in the R Commander Data > Manage variables in active data set menu are devoted to modifying variables and creating new variables. In Section 3.4, I explained how to use the Recode Variables dialog to change the levels of a factor and to create a factor from a numeric variable. I also showed how to use the Reorder Factor Levels dialog to alter the default alphabetic ordering of factor levels. In this section, I provide additional information on recoding variables and describe other facilities in the R Commander for transforming data.
4.4.1 Recoding Variables
Two common uses of the Recode Variables dialog to create new factors were illustrated in Section 3.4: Figure 3.9 (page 31) shows how to transform a numeric variable into a factor, and Figure 3.10 (page 32) shows how to reorganize the levels of a factor.
Recode directives in the Recode Variables dialog take the general form old-value(s) = new-value, where old-value(s) (i.e., original values of the variable being recoded) are specified in one of the several patterns enumerated in Table 4.3. Here is some additional information about formulating recodes:
• If an old value of the variable being recoded satisfies none of the recode directives, the value is simply carried over into the recoded variable. For example, if the directive “strongly agree” = “agree” is employed, but the old value “agree” is not recoded, then both old values “strongly agree” and “agree” map into the new value “agree”.
• If an old value of the variable to be recoded satisfies more than one of the recode directives, then the first applicable directive is applied. For example, if the variable income is recoded with lo:25000 = “low” and 25000:75000 = “middle” (specified in that order), then a case with income = 25000 receives the new value “low”.
• As illustrated in the recode directive lo:25000 = “low”, the special value lo may be used to represent the smallest value of a numeric variable; similarly, hi may be used to represent the largest value of a numeric variable.
• The special old value else matches any value (including NA11) that doesn’t explicitly match a preceding recode directive. If present, else should therefore appear last.
• If there are several variables to be recoded identically, then they may be selected simultaneously in the Variables to recode list in the Recode Variables dialog.
• The Make (each) new variable a factor box is initially checked, and conseq(each) new variable a factor box is initially checked, and consequently the Recode Variables dialog creates factors by default. If the box is unchecked, however, you may specify numeric new values (e.g., “strongly agree” = 1), or even character (e.g., 1:10 = “low”) or logical (e.g., “yes” = TRUE) new values.
• As a general matter, in specifying recode directives, factor levels and character values on both sides of = must be quoted using double quotes (e.g., “low”), while numeric values (e.g., 10), logical values (TRUE or FALSE), and the missing data value NA are not quoted.
TABLE 4.3: Recode directives employed in the Recode Variables dialog.
Old Value(s)
Example Recode Directives
an individual value
a
99 = NA
NA = "missing"
"strongly agree" = "agree"
a set of values
a, b, …, k
1, 3, 5 = "odd"
"strongly agree", "agree somewhat" = "agree"
a numeric range
a : b
1901:2000 = "20th Century"
lo:20000 = "low income"
100000:hi = "high income"
anything else
(must appear last)
else
else = "other"
The special old values lo and hi may be used to represent the smallest and largest values of a numeric variable, respectively.
FIGURE 4.9: The Compute New Variable dialog.
• Recall that one recode directive is entered on each line of the Recode directives box: After you finish typing a recode directive, press the Enter or Return key to move to the next line.
4.4.2 Computing New Variables
Selecting Data > Manage variables in active data set > Compute new variable from the R Commander menus produces the dialog box displayed in Figure 4.9. At the top of the dialog is a list of variables in the active data set (the Prestige data set, read from the car package in Section 4.2.4). Notice that the variable type is identified as a factor in the variable list; the other variables in the data set are numeric.
At the bottom of the dialog are two text fields, the first of which contains the name of the new variable to be created (initially, variable): I typed log.income as the name of the new variable.12 If the name of the new variable is the same as that of an existing variable in the current data set, then, when you press the OK or Apply button in the dialog, the R Commander will ask whether you want to replace the existing variable with the new one.
The second text box, which is initially empty, contains an R expression defining the new variable; you can double-click on variable names in the Current variables list to enter names into the expression, or you can simply type in the complete expression. In the example, I typed log10(income) to compute the log base 10 of income.13 Pressing OK or Apply causes the expression to be evaluated, and (if there are no errors) the new variable log.income is added to the Prestige data set.
New variables may be simple transformations of existing variables (as in the log transformation of income), or they may be constructed straightforwardly from two or more existing variables. For example, the data set DavisThin in the car package contains seven variables, named DT1 to DT7, that compose a “drive-for-thinness” scale.14 The items are each scored 0, 1, 2, or 3. To compute the scale, I need to sum the items, which I may do with the simple expression DT1 + DT2 + DT3 + DT4 + DT5 + DT6 + DT7 (after reading the DavisThin data set into the R Commander, of course, as described in Section 4.2.4).
Table 4.4 displays R arithmetic, comparison, and logical operators, along with some commonly employed arithmetic functions. Both relational and logical operators return logical values (TRUE or FALSE). These operators and functions may be used individually or in combination to formulate more or less complex expressions to create new variables; for example, to convert Celsius temperature to Fahrenheit, 32 + 9*celsius/5 (supposing, of course, that the active data set contains the variable celsius). The precedence of operators in complex expressions is the conventional one: For example, multiplication and division have higher precedence than addition and subtraction, so 1 + 2*6 is 13. Where operators, such as multiplication and division, have equal precedence, an expression is evaluated from left to right; for example, 2/4*5 is 2.5. Parentheses can to used, if desired, to alter the order of evaluation of an expression: For example (1 + 2)*6 is 18, and 2/(4*5) is 0.1. When in doubt, parenthesize! Also remember (from Section 3.5) that the double equals sign (==), not the ordinary equals sign (=), is used to test for equality.
4.4.2.1 Complicated Expressions in Computing New Variables*
The Compute New Variable dialog is more powerful than it appears at first sight, because any R expression may be specified as long as it produces a variable with the same number of values as there are rows in the current data set. Suppose, for example, that the active data set is the Prestige data set, which includes the numeric variable education, in years. The following expression uses the factor and ifelse functions to recode education into a factor (as an alternative to the Recode Variables dialog):
factor(ifelse(education > 12, “post-secondary”, “less than post-secondary”))
Here’s another example of the use of ifelse, which selects the larger of husband’s and wife’s income in an imagined data set of heterosexual married couples:
ifelse(hincome > wincome, hincome, wincome)
TABLE 4.4: R operators and common functions useful for formulating expressions to compute new variables.
Symbol |
Explanation |
Examples |
Arithmetic Operators (return numbers) |
||
– |
negation (unary minus) |
-loss |
+ |
addition |
husband.income + wife.income |
– |
subtraction |
profit – loss |
* |
multiplication |
hours.worked*wage.rate |
/ |
division |
population/area |
exponentiation |
age^2 |
|
Relational Operators (return TRUE or FALSE) |
||
< |
less than |
age < 21 |
<= |
less than or equal to |
age <= 20 |
== |
equal to |
age ==21 |
gender == "male" |
||
>= |
greater than or equal to |
age >= 21 |
> |
greater than |
age > 20 |
! = |
not equal to |
age != 21 |
marital.status != "married" |
||
Logical Operators (return TRUE or FALSE) |
||
& |
and |
age > 20 & gender == "male" |
| |
or (inclusive) |
age < 21 | age > 65 |
! |
not (unary) |
!(age < 21 | age > 65) |
Common Arithmetic Functions (return numbers) |
||
log |
natural log |
log(income) |
log10 |
log base 10 |
log10(income) |
log2 |
log base 2 |
log2(income) |
sqrt |
square root |
sqrt(area) (equivalent to area^0.5) |
exp |
exponential function, ex |
exp(rate) |
round |
rounding |
round(income) (to the nearest integer) |
round(income, 2) (to two decimal places) |
If hincome exceeds wincome, then the corresponding value of hincome is used; otherwise, the corresponding value of wincome is returned.
The general form of the ifelse command is ifelse(logical-expression, value-if-true, value-if-false ), where
• logical-expression is a logical expression that evaluates to TRUE or FALSE for each case in the data set. Thus, in the first example above, education > 12 evaluates to TRUE for those with more than 12 years of education and to FALSE for those with 12 or fewer years of education. In the second example, hincome > wincome evaluates to TRUE for couples for whom the husband’s income exceeds the wife’s income, and FALSE otherwise.
• value-if-true gives the value(s) to be assigned to cases for which logical-expression is TRUE. This may be a single value, as in the first example (the character string “post-secondary”), or a vector of values, one for each case, as in the second example (the vector of husbands’ incomes for the couples); if value-if-true is a vector, then the corresponding entry of the vector is used where logical-expression is TRUE.
• value-if-false gives the value(s) to be assigned to cases for which logical-expression is FALSE; it too may be a single value (e.g., “less than post-secondary”), or a vector of values (e.g., wincome).
Most of the remaining items in the Data > Manage variables in active data set menu (see Figure A.3 on ) are reasonably straightforward:
• Add observation numbers to data set creates a new numeric variable named ObsNumber, with values 1, 2,…, n, where n is the number of rows in the active data set.
• Standardize variables transforms one or more numeric variables to mean 0 and standard deviation 1.
• Reorder factor levels permits you to change the default alphabetic ordering of factor levels, and was illustrated in : See in particular ().
• It sometimes occurs—for example, after subsetting a data set (an operation described in )—that not all levels of a factor actually appear in the data. Drop unused factor levels removes empty levels, which occasionally cause problems in analyzing the data.
• Rename variables and Delete variables from data set do what they say.
• I discuss Define contrasts for a factor in on statistical models in the R Commander (see ).
The two remaining items in the Manage variables in active data set menu convert numeric variables to factors:
• Bin numeric variable allows you to categorize a possibly continuous numeric variable into class intervals, called bins. The resulting Bin a Numeric Variable dialog is shown in , where I select income as the variable to bin; named the factor to be created income.level (the default name is variable), select 4 bins (the default is 3), opt for Equal-count bins (the default is Equal-width bins), and select Numbers for the level names (the default is to specify the level names in a sub-dialog). Clicking OK adds the factor income.level to the data set, where level “1” represents the (rough) fourth of cases with the lowest income, “2” the next fourth, and so on.
FIGURE 4.10: The Bin a Numeric Variable dialog, creating the factor income.level from the numeric variable income.
• Some data sets use numeric codes, typically consecutive integers (e.g., 1, 2, 3, etc.) to represent the values of categorical variables. Such variables will be treated as numeric when the data are read into the R Commander. Convert numeric variables to factors allows you to change these variables into factors, either using the numeric codes as level names (“1”, “2”, “3”, etc.) or supplying level names directly (e.g., “strongly disagree”, “disagree”, “neutral”, etc.).
I’ll illustrate with the UScereal data set in the MASS package; this data set contains information about 65 brands of breakfast cereal marketed in the U.S. To access the data set, I first load the MASS package by the menu selection Tools > Load package(s), selecting the MASS package in the resulting dialog (). I then input the USce-real data and make them the active data set via Data > Data in packages > Read data set from an attached package (as described in ).
demonstrates the conversion of the numeric variable shelf—the supermarket shelf on which the cereal is displayed—originally coded 1, 2, or 3, to a factor with corresponding levels “low”, “middle”, and “high”. Clicking OK in the main dialog (at the left of ) brings up the sub-dialog (at the right), into which I type the level names corresponding to the original numbers. Having been converted into a factor, the variable shelfcan now be used, for example, in a contingency table in the R Commander (as described in ), and will be treated appropriately as a categorical variable if it appears as a predictor in a regression model (see and ).
FIGURE 4.11: The Load Packages dialog, selecting the MASS package.
FIGURE 4.12: The Convert Numeric Variables to Factors dialog (left) and Level names sub-dialog (right), converting shelf in the UScereal data set to a factor.
In contrast to the operations on variables discussed in the preceding section, selections in the Data > Active data setmenu (see Figure A.3 on ) act on data sets as a whole or on rows of data sets. Some of the items in the Active data set menu are entirely straightforward, and I’ll simply explain briefly what they do:
• Select active data set allows you to choose from among the data frames in your work
FIGURE 4.10: The Bin a Numeric Variable dialog, creating the factor income.level from the numeric variable income.
• Some data sets use numeric codes, typically consecutive integers (e.g., 1, 2, 3, etc.) to represent the values of categorical variables. Such variables will be treated as numeric when the data are read into the R Commander. Convert numeric variables to factors allows you to change these variables into factors, either using the numeric codes as level names (“1”, “2”, “3”, etc.) or supplying level names directly (e.g., “strongly disagree”, “disagree”, “neutral”, etc.).
I’ll illustrate with the UScereal data set in the MASS package; this data set contains information about 65 brands of breakfast cereal marketed in the U.S. To access the data set, I first load the MASS package by the menu selection Tools > Load package(s), selecting the MASS package in the resulting dialog (). I then input the USce-real data and make them the active data set via Data > Data in packages > Read data set from an attached package (as described in ).
demonstrates the conversion of the numeric variable shelf—the supermarket shelf on which the cereal is displayed—originally coded 1, 2, or 3, to a factor with corresponding levels “low”, “middle”, and “high”. Clicking OK in the main dialog (at the left of ) brings up the sub-dialog (at the right), into which I type the level names corresponding to the original numbers. Having been converted into a factor, the variable shelfcan now be used, for example, in a contingency table in the R Commander (as described in ), and will be treated appropriately as a categorical variable if it appears as a predictor in a regression model (see and ).
FIGURE 4.11: The Load Packages dialog, selecting the MASS package.
FIGURE 4.12: The Convert Numeric Variables to Factors dialog (left) and Level names sub-dialog (right), converting shelf in the UScereal data set to a factor.
In contrast to the operations on variables discussed in the preceding section, selections in the Data > Active data setmenu (see Figure A.3 on ) act on data sets as a whole or on rows of data sets. Some of the items in the Active data set menu are entirely straightforward, and I’ll simply explain briefly what they do:
• Select active data set allows you to choose from among the data frames in your work if there are more than one; selecting this menu item is equivalent to pressing the Data set button in the R Commander toolbar.
• Refresh active data set resets the information that the R Commander maintains about the active data set, such as the variable names in the data set, which variables are numeric and which are factors, and so on. This information is used, for example, in variable list boxes and to determine which menu items are active. You may need to refresh the active data set if you make a change to the data set outside of the R Commander menus—for example, if you type in and execute an R command that adds a variable to the data set. In contrast, when changes to the active data set are made via the R Commander GUI, the data set is refreshed automatically.
• Help on active data set opens the documentation for the data set if it was read from an R package.
• Variables in active data set lists the names of the variables in the data set in the Output pane.
• Set case names opens a dialog to set the row (case) names of the active data set to the values of a variable in the data set. This operation can be useful if the row names weren’t established when the data were read into the R Commander. The row names variable may be a factor, a character variable, or a numeric variable, but its values must be unique (i.e., no two rows can have the same name). Once row names are assigned, the row names variable is removed from the data set.
• The actions performed by Save active data set and Export active data set were described in Section 4.3.
Two of the items in the Active data set menu have specialized functions, and so I’ll describe them briefly as well:
• Aggregate variables in active data set summarizes the values of one or more variables according to the levels of a factor, producing a new data set with one case for each level. Aggregation proceeds by applying some function—the mean, the sum, or another function that returns a single value—to the values of the variable for cases in each level of the factor. For example, starting with a data set in which the cases represent individual Canadians and that contains a factor for their province of residence, along with other variables such as years of education and dollars of annual income, you can produce a new data set in which the cases represent provinces and the variables include mean education and mean income in each province.
FIGURE 4.13: The Subset Data Set dialog.
• Stack variables in active data set creates a data set in which two or more variables are “stacked,” one on top of the other, to produce a single variable. If there are n cases in the active data set, and if k variables are to be stacked, the new data set will contain one variable and n × k cases, along with a factor whose levels are the names of the stacked variables in the original data set. This transformation of the data set is occasionally useful in drawing graphs.
Three items in the Active data set menu create subsets of cases, most directly Subset active data set, which brings up the dialog box shown in , where the active data set is the Canadian occupational prestige data (Prestige) from the car package. I complete the dialog so that the subsetted data will include all variables in the original data set, which is the default. I change the Subset expression from the default <all cases> to the logical expression type == “prof” to select professional, technical, and managerial occupations. More generally, the subset expression should return a logical value for each case (TRUE or FALSE—see the discussion of R expressions in ). I also change the default name of the new data set <same as active data set> to Prestige.prof.
The Subset Data Set dialog can also be used to create a subset of variables in the active data set: Just uncheck the Include all variables box, use the Variables list box to select the variables to be retained, and leave the Subset expression at the default <all cases>.
Remove row(s) from the active data set in the Data > Active data set menu leads to the dialog box in , where the active data set is the Duncan occupational prestige data set from the car package. I type in the case names “minister” “conductor” and replace the default name for the new data set, <same as active data set>, with Duncan.1. Clicking OK deletes these two cases from the Duncan data, so Duncan.1 contains 43 of the original 45 cases. Cases can be deleted by number as well as by name. For example, because “minister” and “conductor” are the 6th and 16th cases in the original Duncan data, I could have specified the cases to be deleted as 6 16.
FIGURE 4.14: The Remove Rows from Active Data Set dialog.
FIGURE 4.15: The Remove Missing Data dialog.
Selecting Data > Active data set > Remove cases with missing data produces the dialog shown in , again for the Prestige data set from the car package. In completing the dialog, I leave the default Include all variables checked and retain the default name for the new data set, <same as active data set>. On clicking OK, the R Commander warns me that I’m replacing the existing Prestige data set, asking for confirmation. The new data set contains 98 of the 102 rows in the original Prestige data frame, eliminating the four occupations with missing type—none of the other variables in the original data set have missing values.
You may wish to remove missing data in this manner to analyze a consistent subset of complete cases. For example, suppose that you fit several regression models to the full Prestige data, and that some of the models include the variable type and others do not. The models that include type would be fit to 98 cases, and the others to 102 cases, making it inappropriate to compare the models (e.g., by likelihood ratio tests—see ).
Two caveats: (1) Filtering missing data carelessly can needlessly eliminate cases. For example, if there’s a variable in the data set that you do not intend to use in your data analysis, then it’s not sensible to remove cases with missing data on this variable. You should only filter out cases with missing data on variables that you plan to use. (2) There are better general approaches to handling missing data than analyzing complete cases (see, e.g., Fox, 2016, Chapter 20), but they are beyond the scope of this book, and—in the absence of a suitable R Commander plug-in package—beyond the current scope of the R Commander.
4.5.3 Merging Data Sets*
The R Commander allows you to combine data from two data frames residing in the R workspace. Both simple column (variable) merges and simple row (case) merges are supported. I’ll begin with the former.
To illustrate a column merge, I’ve divided the variables in the Canadian occupational prestige data into two plain-text, white-space-delimited files. The first file, Prestige-1.txt,18 includes data on the numeric variables education, income, women, prestige, and census (see Table 4.2 on page 61 for definitions of the variables in the Canadian occupational prestige data). The first line of the data file contains variable names, and the first field in each subsequent line contains the case (occupation) name; there are thus 103 lines in this file, for the 102 occupations. The second file, Prestige-2.txt, is also a white-space-delimited, plain-text file, with data on the single variable type of occupation. The first line of the file contains only the variable name type, while the 98 subsequent lines each contain the name of an occupation followed by its occupational type (i.e., prof, wc, or bc); the four occupations in the data set that are unclassified by occupational type do not appear in Prestige-2.txt.
So as to provide an uncluttered example, I start a new R and R Commander session, and proceed to read the Prestige-1.txt and Prestige-2.txt data files into the data frames Prestige1 and Prestige2 (as described in Section 4.2.2).19 Choosing Data > Merge data sets from the R Commander menus produces the dialog in Figure 4.16. I select Prestige1 as the first data set and Prestige2 as the second data set, type Prestige as the name of the merged data set replacing the default name MergedDataset, press the Merge columns radio button, and leave the Merge only common rows or columns box unchecked. Clicking OK merges the data sets, matching cases by row name, and producing a data set with 102 rows. The four cases that are present in Prestige1 but absent from Prestige2 have missing values (NA) for type. Had I checked the Merge only common rows or columns box, the merged data set would have included only the 98 cases present in both original data sets.
As demonstrated in this example, to merge variables from two data frames, the R Commander uses the row names of the data frames as the merge key. You may have to do some preliminary data management work on the two data sets to insure that their row names are consistent.
Row merges can also be performed via the Merge Data Set dialog by leaving the default Merge rows radio button pressed. If, as is typically the case, there are common variables in the two data sets to be merged, then these variables should have the same names in both data sets. You can choose to merge only variables that are common to both data sets or to merge all variables in each. In the latter event, variables that are in only one of the data sets will be filled out with missing values for the cases originating in the other data set.
To illustrate a row merge, I divided Duncan’s occupational prestige data into three data files: Duncan-prof.txt, containing data for 18 professional, technical, and managerial occupations; Duncan-wc.txt, with data for 6 white-collar occupations; and Duncan-bc.txt with data for 21 blue-collar occupations. All three are plain-text, white-space-delimited files with variable names in the first line and case names in the first field of each subsequent line. All three files contain data for the variables type, income, education, and prestige (see Table 4.1 on page 60).
FIGURE 4.16: Using the Merge Data Sets dialog to combine variables from two data sets with some common cases.
To merge the three parts of the Duncan data set, I begin by reading the partial data sets into the R Commander in the now-familiar manner, creating the data frames Duncan.bc, Duncan.wc, and Duncan.prof.20 Next, I select Data > Merge data sets from the R Commander menus, and pick two of the three parts of the Duncan data, Duncan.bc and Duncan.wc, as illustrated at the left of Figure 4.17, creating the data frame Duncan. Finally, I perform another row merge, of Duncan and Duncan.prof, as shown at the right of Figure 4.17. Because I specify an existing data-set name (Duncan) for the merged data set, the R Commander asks me to confirm the operation. The result of the second merge is the complete Duncan data set with 45 cases and four variables, which is now the active data set in the R Commander.
FIGURE 4.17: Using the Merge Data Sets dialog twice to merge three data sets by rows: Merging Duncan.bc and Duncan.wc to create Duncan (left); and then merging Duncan with Duncan.prof to update Duncan (right).
1This is a newer edition of the basic statistics text that I used as the original “target” for the R Commander, in that I aimed to cover all of the methods described in the text. As I mentioned in Chapter 1, however, the R Commander has since broadened its scope.
2The R Commander session in each chapter of this book is independent of the sessions in preceding chapters; if you’re following along with the book on your computer, restart R and the R Commander for each chapter.
3See Section 5.4.1 for more information about making scatterplots in the R Commander, and Section 7.1 on least-squares regression.
4The Read Text Data dialog allows you to specify an input missing data indicator different from NA (which is the default), but it will not accommodate different missing data codes for different variables or multiple missing data codes for an individual variable. In these cases, you could redefine other codes as missing data after reading the data, using Data > Manage variables in active data set > Recode variables (see Sections 3.4 and 4.4.1), or, as suggested, edit the data file prior to reading it into the R Commander to change all missing data codes to NA or another common value.
5There are many plain-text editors available. Windows systems come with the Notepad editor, and Mac OS X with TextEdit. If you enter data in TextEdit on Mac OS X, be sure to convert the data file to plain text, via Format > Make Plain Text, prior to saving it. The RStudio programming editor for R (discussed in Section 1.4) can also be used to edit plain-text data files.
6Both CSV files and data files that employ other field delimiters are plain-text files. Conventionally, the file type (or extension) .csv is used for comma-separated data files, while the file type .txt is used for other plain-text data files.
7Duncan.txt and other files used in this chapter may be downloaded from the web site for the text, as described in the Section 1.5.
8Duncan’s occupational prestige regression is partly of interest because it represents a relatively early use of least-squares regression in sociology, and because Duncan’s methodology is still employed for constructing socioeconomic-status scales for occupations. For further discussion of this regression, see Fox (2016, especially Chapter 11).
9Four of the occupations (athletes, newsboys, babysitters, and farmers) have missing occupational type, and the corresponding cells of the spreadsheet are empty. If, for example, NA were used to represent missing data in the spreadsheet, I’d type NA as the Missing data indicator in the dialog box.
10In subsequent chapters, I’ll frequently draw data sets from the car package for examples.
11Thus, if you want to retain NAs where they currently appear and still want to use else, then you can specify the directive NA = NA prior to else.
12Remember the rules for naming variables in R: Names may only contain lower- and upper-case letters, numerals, periods, and underscores, and must begin with a letter or period.
13Don’t be concerned if you’re unfamiliar with logarithms (logs)—I’m using them here simply to illustrate a data transformation. Logs are often employed in data analysis to make the distribution of a strictly positive, positively skewed variable, such as income, more symmetric.
14These data were generously made available by Caroline Davis of York University in Toronto, who studies eating disorders.
15I’m grateful to an anonymous referee for suggesting this example.
16As I’ve explained, you can make Prestige the active data set by pressing the Data set button in the R Commander toolbar, selecting Prestige from the list of data sets currently in memory.
17Recall that the double equals sign == is used in R to test equality.
18Recall that all data files used in this book are available on the web site for book: See Section 1.5.
19Notice that I used the data set names Prestige1 and Prestige2 even though the corresponding data files have names containing hyphens (Prestige-1.txt and Prestige-2.txt): Hyphens aren’t legal in R data set names.
20Again, the data sets must have legal R names that can’t, for example, contain hyphens.