Andean and Tiwanaku Archaeology Page - STATISTICS IN SYSTAT 5.0

STATISTICS IN SYSTAT 5.0

WHAT IS SYSTAT? Systat 5.0 is a DOS-based Statistical Package. It works in the plain DOS environment (and not the colorful Windows environment) and uses COMMANDS to conduct the operations. We will use the commands. But SYSTAT has also a menu, which you can access with the command "SYSTAT" from the C:\SYSTDOS folder (we will later check out this menu).

Nota: Systat 6.0 and up are Windows base version of the package. However, they can also work with command prompts in the main screen. The commands will be, in most of the cases, the same as in the DOS version. I will mention some changes in command syntaxis for cases where I have experienced the differences.

PRELIMINARIES. Before turning on the computer we need to think about how we are going to do with the data that needs analysis. Our first task is to perform an initial exploration of batches. We will start exploring the data we have recorded using Stem-and-leaf plots. We know that these plots are a first approach to establish patterns or modes of the sample of artifacts. However, first we need to enter the data in SYSTAT, so that Stem-and-leaf plots can be produced.

1. ENTERING DATA IN SYSTAT. The first module we will use is the EDIT module. EDIT creates and edits SYSTAT files in the Data Editor Worksheet (a worksheet is an application that is table composed of columns and rows). The first thing you will do is name the columns. You need with the codes for the variable (the first row of the worksheet, that starts with "Cases" is reserved for variable names). Numerical and alphabetical data get labeled differently. Data that is numerical can have any letters in the code; codes for alphabetical data (where your data is recorded with letters) needs to have a "$" suffix. Let's say we have data on length and width of lithic tools (this is numerical information with perhaps two decimals, if you used a very precise caliper for measurement); in addition, we have data on the stone used for the artifacts (this can be alphabetical data, C, F, and O, or numerical data, 1, 2 and 3, for Chert, Flint and Obsidian, respectively). The variables row for this data can be: L, W, M$; or simply L, W, M. In the first set we would enter C, F or O in the M$ column rather than numbers.

Important distinction: the variable M$ is a categorical or grouping variable (C, F, and O are the categories). In contrast, L and W are continuous variables (as they are measurements of things). When we analyze the data it is important to distiguish L and W for each of the categories (an analysis of all the data together is only a first step).

Then you can start entering data. Familiarize yourself with the keys to move around the Data Editor Worksheet (arrows, PageUp, PageDown). When all your data is entered (and even before that) you should save the data in a SYSTAT file. Hit "ESC" and the cursor will move to the ">" prompt at the bottom of the worksheet. Write "SAVE filename" (no extension!) and hit "ENTER". You have just saved the data that is now in "filename.sys" (All SYSTAT data files have the .sys extension) . Now you can proceed to produce more detailed statistics on the data. Exit the EDIT module by typing "QUIT".

MISTAKES? If something went wrong in your entering of data, changes can be made by simply opening your dataset: write "USE filename" at the prompt, and then "EDIT". (Or if you are already in the worksheet, hit "ESC" and at the prompt write "USE filename"). It is very easy to move around the worksheet to make the necessary corrections with the arrows.

2. EXPLORING THE DATA IN SYSTAT

STATISTICS OF A SYSTAT DATA FILE: THE STATS MODULE. In the STATS module the command STATS L (or STATS <var1>,<var2>,<...>) produces basic statistics:

The command STATS L/all produces a complete list of all the calculations possible. More importantly, the command STATS L/SE produces only the information for Standard Error, a number you will need to have for many procedures.

If you have a batch of numbers grouped by materials, you will want to have the statistics for each group. You will then use the BY group command before STATS. But before using the BY command (and for other more comple operations) you will need to sort the group variable, so that its cases are ordered in ascending or descending alphanumeric order together. Use the SORT command in the DATA module. )

DATA	[Type DATA at the > prompt; this runs the DATA module]
USE datafile	[reads a data file]
SORT MAT$	[grouping variable in this example is material or MAT$; specify /A or /D (ascending or descending) for each variable being sorted; sorting of a var$ will be alphabetic.]
RUN	[initiates the sort]
EDIT	[to see the sorted file: it will be a temporary file, like AABBFFCC]
SAVE datafile	[save it to the original file; or give the sorted file a new name]
STATS	[go to the STATS module to start calculations by groups]
USE datafile	[reads a data file; and lists the variables]
BY MAT$	[indicates that you want data by MAT$ groups]
STATS L	[will show numbers for each of the groups]

At this point you should copy the information onto your data sheet so you can use the data later.

Some definitions:

MEAN. The arithmetic mean of a variable is its "average." The sum of the values is divided by the number of (nonmissing) values.
MEDIAN. The median is a description of the center of a distribution. If the data were sorted in increasing order, the median is the value above which half the values fall.
SD. Standard deviation, a measure of spread, is the square root of the sum of the squared deviations (of the values from the mean) divided by (n-1).
SEM. The standard error of the mean is the standard deviation divided by the square root of the sample size. It is the estimation error, or the average deviation of sample means from the expected value of a variable.

THE GRAPH MODULE

The command STEM creates a stem-and-leaf plot for one or more variables in a SYSTAT file. The plot shows the distribution of a variable graphically. Stem-and-leaf plots also list the median (M in stem), minimum, lower-hinge (H in stem), upper-hinge (H in stem), and maximum values of the sample. Unlike histograms, stem-and-leaf plots show actual numeric values to the precision of the leaves. A stem-and-leaf is produced by the command:

"STEM <var1>,<var2>,<...>

(by default STATS will use its own stem scale; add / LINES=<#>" to define the number of lines or scale; it will use not the exact number of # but an appropriate close one )

As with STATS the command BY allows you to produce stems for each group (Do not forget to sort the file).

Other commands in the GRAPH module include BOX and HISTOGRAM:

The BOX command creates box plots. The length of each box shows the range within which the central 50% of the numbers fall: the midspread, with its borders at the upper and lower quartiles. The whiskers show the range of values that fall within 1.5 of box-lengths beyond box limit. Values between the inner (1.5 box-lengths) and outer fences (3 box-lengths) are plotted with asterisks (*). Values outside the outer fence: OUTLIERS, are plotted with empty circles (0).

The command HISTOGRAM creates a graph that show the sample density of a continuous variable with vertical bars. The height of each bar shows the number of cases whose values are contained in an interval of values of the variable:

HISTOGRAM <varlist> [draws histograms for each variable specified].

PRODUCING ASCII FILES WITH INFORMATION FROM SYSTAT. The information you see in your screen can be put into a file that you can open in a word-processor to edit and produce data like tables for a text. The procedure in SYSTAT is the following for all the modules (we use here the GRAPH module):

GRAPH	[you enter the module]
USE filename	[you call the file with the data; it will list the variables]
OUTPUTfilename.txt	[the file where the information will be]
STEM var1 var 2<enter>	[you enter the command for the plot with the variables you need]

How to see this file: If you are in C:\SYSTDOS, type "cd.." <enter>; in C:\ type "edit c:\systdos\filename.txt <enter>. Then you will see the data saved in a basic wordprocessor.

MANIPULATING A SYSTAT DATA FILE. The command IF .... THEN LET .... is used for conditional transformations and deletion of data. It allows you to conditionally transform variables: IF <condition> THEN LET <var> = <expression>. For all cases where the condition is true, SYSTAT executes the action. You can use any mathematically valid combination of variables, numbers, functions, and operators. The IF ... command can be run from the prompt in the EDIT-worksheet screen.

A simple form of this command allows to make a copy of one variables onto a new column with a different name: "LET newvar=oldvar". This task is very handy when you have to make changes to oldvar: newvar will be a backup of the data. Another example: say that all your measurements of lenght for variable L are 15mm short. You can easy remedy this by writing only the LET command: "LET L2=L2+.15". (L2 is the new variable on which you make changes). Commands using the full IF... command would look like this:

"IF L=11.0 THEN LET L=L+.15" [to change only those values equal to 11.0];
"IF L>11.0 THEN LET L=L+.15" [to change only those GREATER than 11.0];
"IF CASE > 500 THEN LET x = x^2" [to change cases starting with #500]
"IF AGE > 80 THEN LET AGE$ = 'ELDERLY' "
"IF X = 99 THEN LET X = ."
"IF SEX$ = 'MALE' AND AGE > 30 THEN LET GROUP = 1"
"IF CASE=45 THEN DELETE" [will delete case # 45]

SAVE the file with these changes

The DROP command can eliminate variables from the worksheet: at the > prompt, write DROP var1.

MORE STATISTICAL ANALYSIS

METHODS FOR COMPARING GROUPS:

T-TEST in the STATS module. The two-sample t test (or independent t test) is ideal for comparing means for two groups of cases. For example, are the floor area of structures at Black and Smith sites significantly different? We will the dataset in Group 3 assignment sheet.

USE G3EX2

TTEST A * S$ [the probability p: will tell you the significance of the difference between artifacts of each group for the two variables ]

I quote the following paragraphs from Drennan's 1996 Statistics for Archaeologists: A Common Sense Approach, Chapter 11:

When presenting the result of a significance test, it is always necessary to say just what significance test was used and to provide the resulting statistic and the associated probability For the example in the text, we might say, "The 2.5 m² difference in mean house floor area between the Black and Smith sites is very significant (t = -2.69, .01 > p > .005)." This one sentence really says everything that needs to be said. No further explanation would be necessary if uve were writing for a professional audience whom we can assume to be familiar with basic statistical principles and practice. The "statistic" in this case is it, and providing its value makes it clear that significance was evaluated with a t test, which is quite a standard technique that does not need to be explained anew each time it is used. The probability that the observed difference between the two samples was just a consequence of the vagaries of sampling is the significance, or the associated probability. Ordinarily p stands for this probability so in this case we have provided the information that the significance is less than 1%. This means the same thing as saying that our confideffice in reporting a difference between the two periods is greater than 99%.

If, instead of performing a t test, we simply used the bullet graph to compare estimates of the mean and their error ranges, as in Figure 11.1, we might say "As Figure 11.1 shows, we can have greater than 99% confidence that mean house floor area changed between Formative and Classic periods." The notion of estimates and their error ranges for different confidence levels is also a very standard one which we do not need to explain every time we use it. Bullet graphs, however, are less common than, say, box-and-dot plots, so we cannot assume that everyone will automatically understand the specific confidence levels of the different widths of the error bars. A key indicating what the confidence levels are, as in Figure 11.1, is necessary.

In an instance like the example in the text, a bullet graph and a t test are alternative approaches. Using and presenting both in a report qualifies as statistical overkill. Pick the one approach that makes the simplest, clearest, most relevant statement of what needs to be said in the context in which you are writing; use it; and go on. Presentation of statistical results should support the argument you are making, not interrupt it. The simplest, most straightforward presentation that provides complete information is the best.

Let's prepare then a BULLET GRAPH with the MEAN and the Standard Error of floor area for each site. Create a new data file that would look like this: (Remember that SE given by stats is one SE and represents 68% confidence). Save the file.

M$	Mean	SE
Chert	17.4	3.4
Obsidian	12.3	2.1
Flint	23.7	5.7

If you want to graph the SE at more precise confidence levels you will need to create columns SE2 and SE3 for 95%, and 99% confidence levels, respectively. Use the LET command. The rule of thumb is to multiply the SE result by 1.96 and 2.57, respectively. But be aware that this is good only for large samples. The t-table might tell you that for a confidence level of 99% and 15 df (degrees of freedom, which is n-1: for example, number of obsidian cases minus 1) you should multiply SE by 2.947.

After saving, type SYGRAPH at the command prompt. The file filename.sys should still be in use, otherwise enter "USE filename.sys". To produce the graph type:

"CPLOT mean*m$ / error=se". If you want a graph SE95 or SE 99 put their code instead of SE.

Figure comparing graphs produced with STEM&LEAF, BOX-PLOT, and BULLET GRAPH techniques (From Drennan 1996: Figure 1.11).

ANALYSIS OF VARIANCE or ANOVA is also a test for comparing groups; it has the same purpose as the t-test but its adequate for more that two groups. One of the questions we can investigate with ANOVA is whether there is some preference for making projectile points of different sizes out of different raw materials (we follow the lithic example). The independent or grouping variable is M$ (material) Thus the dependent variable is L (length). First SORT the file by the grouping variable in the DATA module. SAVE that file (see procedure in SORT command). The following commands produce the analysis of variance:

STATS	[runs the STATS module]
USE sortedfile	[sorted filename]
BY groupingvariable	[M$]
PRINT=LONG	[produces more detailed output for the ANOVA analysis]
STATISTICS depvariable	[WT; this produces N, Min, Max, Mean & SD for each group]

The output has the statistics for each of the groups and the following:

The key information here is P. A 0.01 means that there is 1 chance in 100 of randomly selecting three subsamples with the means and SD's that these have from three populations whose means are the same; there is a minimal possibility that the observations are due to the vagaries of sampling. We have in this case a high confidence that the lithic artifacts made of different stones really do have different mean weights.

REGRESSION ANALYSIS. Regression is a statistical procedure for determining the relationship between a random variable and corresponding values of an independent variable. This analysis is good to mesure the relationship between size (X is cm) and weight (Y is g) of lithic artifacts (we assume that this relationship will be positive, that is the longer the artifact the heavier it is; a negative relationship can be expected from the following premise: "there is a decrease in the amount of artifacts (X is number) as we walk away from the site (Y is m)".

So the regression measures how much of Y is explained by X (or relationship between size and weight). This analysis is widely used also in PREDICTIVE ANALYSIS. After you analyze a sample of say 50 artifacts, you will get an equation that will allow us to predict, with a certain confidence level, the weight of an artifact based on size.

First, create the data file in EDIT with size (X) and weigth (Y). Next, produce in GRAPH a scatter plot of this relationship: "PLOT Y*X" (scatter plots need to reveal a very rough shape (oval distribution, for example) for a regression analysis. If there is a clear tendency toward a curved pattern it is recommended to perform transformations in order to smooth that curved pattern. We will not develop this requisite of the analysis here). So proceed with the regression analysis, using the MGLH ("Multivariate General Linear Hypothesis") module.

MGLH
USE datafile
MODEL depvar (Y)= CONSTANT + indepvar (X)
ESTIMATE [starts calculations and generates output]

The output is:

This output has much more information than you really need here. The information we need for a regression analysis, however, is included. And we can draw the following conclusions:

1. The relationship between the variables volume (independent) and number of artifacts (dependent) is expressed mathematically by the regression equation (ideal straight line): Y = 34.156 X + 8.223 (this is the equation that is used to predict Y if we know X).

2. R squared (the strenght of the analysis) is .484, so 48.4% of the variance of Y is "explained" by X (or, 51.6% of the variation of Y is "unexplained" by X).

3. The value of Pearson R: .696 indicates the direction of the relationship between X and Y: positive, somewhat below the perfect middle slope of r=1.

4. Test of significance (on a sample size 20) F=16.093 and p=.001. The probability value indicates that there is only a 0.1% chance of getting a sample with this r2 value (.484) from a universe where there is no relationship between X and Y. In other words, there is a 99.9% chance of getting a sample like this from a universe were there is a B relationship between X and Y.

When you are finished you will exit SYSTAT with the command "QUIT".

Back to Handout's Index