 |
 |
Wednesday, Nov 25
|
|
|
|
User Guide Tutorial #5-B
Intergenerational Correlations in Health Outcomes
Ngina Chiteji and Frank Stafford
December, 2003
I.
Introduction and Overview of the Panel Study of Income Dynamics
Many people are interested in
correlations in health outcomes across generations. We often joke about living till 100 if we
have an elderly relative who has lived a long time, or we speculate about
whether a child will inherit mommy's strong constitution or daddy's height. Such interest often stems from the fact that
ailments and physical attributes can be transmitted genetically, although this
is not the only reason that health outcomes may be similar across the
generations. Environmental factors also
affect health, and in many instances parents and children share the same
physical environment. Irrespective of
the cause, it is clear that many health outcomes exhibit some correlation
across the generations. The purpose of
this tutorial exercise is to demonstrate how one might look for such
correlations across three generations of the same family. The tutorial's focus is on general
health. It demonstrates how one might
conduct the type of analysis that is necessary to determine whether children who
are healthy tend to have mothers who are also generally healthy, and grandmothers
in a similar situation. It will draw on
data from the 1997 Child Development Supplement (CDS) for the youngest
generation, and data from the 1986 family files to obtain information about the
two older generations. Upon completion,
you should have a good idea of how the Panel Study of Income Dynamics (PSID)
can be used to make extended intergenerational comparisons.
As noted in the companion tutorial
(5-A), the PSID is a nationally representative, longitudinal dataset that
contains a wide variety of economic and socio-demographic data about
individuals and families. It has
followed families and their offspring, and the families set up by these
offspring, since 1968, and in 1997 and 2001 special supplements were devoted
specifically to gathering information about dependent children in PSID
families. Accordingly, the data can
support a number of possible intergenerational analyses.
II. The
Sample Exercise Examining Correlation in Health Outcomes
When analyzing the situation of the
youngest generation (the children) low birthweight will be used as a proxy for
poor health. Because low birthweight has
been linked to medical problems later in life, it can be taken as an early
guide to an individual's health situation (see FIFCFS, 2001, for example). For the adults, the PSID contains an actual
question about general health. Once the
exercise has been completed, you will be able to fill in the blank cells of the
Table 1, to determine whether the likelihood of a child's having a low
birthweight appears to be associated with having a mother or a grandmother
whose health is also poor.
Table 1. Intergenerational Health Patterns
|
|
All Children
|
Mother's
Health
|
Grandmother's
health
|
|
Birthweight
|
|
poor
|
not poor
|
poor
|
not poor
|
|
% low
|
|
|
|
|
|
|
% not low
|
|
|
|
|
|
III. Getting your data at the Data Center
To do our analysis we need to obtain
three different customized datasets at the Data Center. The information from these datasets will be
merged eventually, however when using Excel it is easiest to begin working with
separate datasets for each generation.
Accordingly, we start by obtaining a dataset that contains information
about the children (the third generation).
Then, we obtain a dataset that contains information about the health
status of these children's moms (the second generation). And, finally, we have the Data Center output
a dataset containing health data for grandmothers (generation one). To then use these three datasets you will
need to carry out pairwise matching (or linking) based on identifier variables.
What variables will you need for
your kids dataset?
In the dataset for generation three,
you need six variables. If you have ever
seen a birth announcement, you probably know that birthweight is reported as a
pound-ounce combination, as 10 lbs and 11 ounces for example (for a rather
large baby). Accordingly, one has to
factor in information from two variables from the Child Development
Supplement--Q1A8P (pounds) and Q1A8O (ounces)--in order to determine how much a
given child weighed at birth. Because
the CDS also contains a question about the child's general health at birth, we
choose this variable (Q1A11) too, to go along with our birthweight information,
for purposes of comparison.
Next we need some variables that
allow us to link each child to his/her mom.
When using CDS data one can establish this link by obtaining three
variables that contain information about the child's primary caregiver
(PCG). Two of these variables will be
used simultaneously. They are the 1968
family identification number of the PCG (PCG68ID) and the PCG's person number
(PCGPN). In the PSID, an individual's
1968 family id number and person number are variables which are critical
because they are the variables that are used to uniquely identify each individual
ever included in the survey. The crucial
point to remember, when contemplating use of the 1968 family id and the 1968
person number, is that one can use these two variables to locate an individual
throughout the different years of the PSID (and its supplements), and to
determine the original sample family from which an individual has descended. (A refresher sample of post-1968 immigrants,
or their adult children, was added to the PSID in 1997. They too have identifier variables.) In the CDS one is able to obtain these
identification variables for each child, and for the child's primary
caregiver. (One also can obtain them for
each child's secondary caregiver--the OCG--but we will not be working with the
OCG data). This facilitates the process
of matching information about moms with information about children. Each PCG's unique identifier can be used to
locate information about her from different years in the PSID, and this can
then be matched to the information reported about the child directly in the
CDS. Finally, we will use the variable
recording the PCG's relationship to the child (the variable "PCG") to
identify the cases in which the mom is the primary caregiver of the child. This allows us to be certain that when we use
the PCG68ID and PCGPN to match information about adults' health status (taken
from the PSID's individual files) to the records of children (taken from the
CDS), we are doing such in a way that attaches information about the mother
(rather than the father or some other relative) to the information about the
birthweight of the child. (The PCG is
not the mom by construction in the CDS, so there are some instances in which a
child's primary caregiver is not the mom.
In this tutorial we consider the cases where the PCG is the mom,
including an adoptive mother, for ease of exposition. As noted in Tutorial 4, we do not lose too
many cases by imposing this requirement.)
What variables will you need for
the moms dataset?
The information about mothers'
health will be obtained from a question about general health conditions that
was asked for different members of the family in the 1986 wave of the
PSID. These are questions H15, H37, and
H69e. While additional data on
individuals' health status have been collected in more recent years, in 1997
for example, the tutorial focuses on the 1986 data. This will allow it to show how data from
different years can be combined for analysis purposes.
In the dataset for generation two,
you need a total of 6 variables. These
variables are (1) ER32000 (sex of the individual), (2) ER30001 (1968 family
identification number for the individual at hand), (3) ER30002 (1968 person
number for this individual) (4) ER30527 (H1/37/69 Health Status in '86), (5)
ER32009 (the 1968 family identification number of the mother of the individual
at hand), and (6) ER32010 (the person number of the individual's mother).
Why do you want these particular
variables? The first variable will be
used in a subsetting command to create our customized dataset for moms. The information about general health status
is reported at the individual-level, and is contained in files containing
information about all individuals in the PSID in 1986. Because our objective is to obtain health
information for moms only, we know that we are only interested in obtaining
this individual-level data for women.
Accordingly, it behooves us instruct the Data Center to restrict our
output to cases in which the individual in question is female. These will correspond to cases in which the
ER32000 variable takes on a code value of two.
(You can confirm these code values by examining the documentation once
you have selected your variables, as explained below.) The second and third variables will be used
to construct the PSID's unique identifier for each woman whose record we
obtain. These will then be matched to
the identifiers for the PCGs that we have in our kids dataset when we put our
three datasets together. We want the
fifth and sixth variables because they are variables that can be used to
construct an identification variable that uniquely identifies the mothers of
the individuals for whom we are gathering information. This is critical because our three-generation
analysis requires us to match information about kids, moms, and grandmothers,
however we cannot link the grandmothers of the CDS kids together with those
kids in a direct fashion because there is no variable designated as
"grandmother's" 1968 family ID or person number. Instead, we have to link grandmothers
(generation one) and grandchildren (generation three) indirectly--by using the
children's moms as the bridge. As noted
above, the CDS provides the 1968 family ID and person numbers for each child's
PCG, who is the mom in most instances.
In the step that we are discussing here we will be ensuring that we draw
the 1968 identification variables for these moms' mothers, so that when we put
our moms together with their CDS children, we can also use the identifiers
found for moms' mothers to simultaneously attach the information about each
child's grandmother to the child. So,
just to make sure we are all thinking about our strategy in the same way and
that we understand how the 4 identification variables that we are including in
the dataset for generation two will be used, let's review a bit: We will use the moms' identification
variables to link moms with the CDS kids, and then we will use moms' mothers
identification variables to add in information about the grandmothers.
Finally, remember that the entire
point of this tutorial exercise is to compare the health of the members of the
different generations. ER30527 is the
general health status variable that we will use to characterize moms'
health. It contains the responses to a
query asking whether the health of the individual in question is in excellent,
very good, good, fair or poor.
One last point to note: Note that for the type of exercise outlined
in this tutorial we do not need to obtain any information from the PSID yearly
family files, and that this allows us to forego worrying about the
"relationship to head" variables that some users may be familiar with
from other PSID tutorials. In some
instances data about individuals is collected for the heads and wives of PSID
families only, and recorded within the family file (rather than in the
individual files). If one ever wants to
work with any of this type of information ---race, age, or an individual's labor income, for example---one would
have to restrict one's analysis to individuals who are heads (or wives), and
one would use the relationship to head variable for the year in question to do
such. While you might have expected us
to have to take this route, i.e., to restrict ourselves to instances in which
the mom was a head or wife in a 1986 family, for example, to make sure that she
was in her own household and not a dependent individual in another person's
household, it is not necessary to impose such a restriction here because the
1986 health data was collected for all individuals in the family, not just the head
and wife. (For 1999, 2001, and 2003,
however, these data are collected for heads and wives only.)
What variables will you need for
the grandmothers' dataset?
The information that we want for
generation one is very similar to the information that we obtained for
generation two. Our primary interest is
in obtaining information about individuals' health status, and this information
will come from the 1986 individual file.
However, we also need identification variables so that we can link
grandmothers with their offspring (the moms).
Furthermore, to construct this particular dataset, we make use of the
relationship-to-head information provided in the PSID. (This is largely to keep the size of the
resultant Excel file manageable. It is not a variable that is central to the
analysis.) In total then, we need
the following six variables:
(1) ER32000--the variable listing
the individual's sex, (2) ER30527--the variable characterizing the individual's
general health status, (3) ER30001--the 1968 family ID number of the
individual, (4) ER30002--the 1968 person number for an individual, (5)
ER30500--the variable characterizing an individual's relationship to the head
of the 1986 family that he/she resided in, and (6) ER30499--a variable which is
used in conjunction with the relationship-to-head variable in instances in
which we want to make sure the individual listed as household head is currently
in the family.
What is the general strategy for
compiling the grandmothers dataset? As
with the moms dataset, we draw information from two "data groups" in
the Data Center. The gender information
comes from the "Sampling Variables" data group, and the remainder of
the variables come from the 1986 individual file. Additionally, we use the sex variable when
the Data Center prompts us for subsetting criteria. We also will use the relationship-to-head
variable and the sequence number at this juncture. As a consequence, instead of drawing all
individuals who exist in the PSID in 1986, the Data Center will produce a
customized dataset that only contains information for individuals who happen to
be women, and for women who are old enough to have set up their own households
only (rather than someone who is a six-year old daughter of someone else, for
example). This is important because the
PSID contains information on a large number of individuals (close to 70,000
people in 1997 for example), yet we know that we only want individuals who
happen to be grandmothers in 1997, which means we know we will not need the health
status of every individual in the PSID.
Instead, we can go a long way toward eliminating individual records that
do not stand a chance of making a match to our CDS kids right from the start
(individuals who are male for example).
How (and where) do you get your
data for the kids dataset?
The PSID Data Center can be found at
http://simba.isr.umich.edu To construct the dataset with the children's
information, you want to choose "variable selection" and then check
the box for 1997 Child Development Supplement (CDS) under "select data
groups."
Screenshot #1 Select
Data Group(s)

If you hit continue
you will be taken to a screen where you can select specific files from the 1997
CDS:
Screenshot #2 Select data categories and
years

Click on the box
next to "Demographic file," and the box beside "Primary
Caregiver Child File." This will
take you to a screen allowing you to select specific variables:
Screenshot #3 Variable Selection

Now select the
following 7 variables. From the 1997 CDS
Demographic File select (1) PCG (the
relationship of the child's primary caregiver to the child), (2) PCGID68 (the
1968 family ID number of the primary caregiver), (3) PCGPN (the person number
of the primary caregiver), and (4) CH97PRWT.
(This last variable provides the sampling weights for the CDS children,
who are age 0-12, as of 1997.) From the
CDS Child File you want to select your last three variables: (5) Q1A8P (birthweight-pounds), (6) Q1A8O
(birthweight-ounces), and (7) Q1A11 (health at birth). In case you are wondering about the
"view" option beside each variable name, this is a feature that
allows you to remind yourself what a
particular variable is, and how it is coded (to make sure you really have the
correct variable, for example). If you
click on the word "view" the Data Center will respond by displaying a
window containing codes and the actual question asked along with other
pertinent documentation for the variable that you have inquired about. Note that the Data Center automatically adds
two variables that you need to your list--these are the child's 1968 family
identification number and the child's person number. (Note that the Data Center also automatically
adds three other variables--ER33401, ER33402, and ER33403. You will not need these, but we will refrain
from deleting them until later. They are
often critical for analysis using PSID data, as seen in the companion tutorial
5-A for example, so the Data Center is designed to ensure that they are
included by default, even though this is one of the rare cases in which they
are not needed.)
Screenshot #4 PSID/CDS Data Cart Contents


Next hit "Get Data and/or
Codebook" and a screen depicting several output options will be
displayed.
Screenshot #5
There are 5
important steps to take here. First,
one of the options that you have surrounds the format of your codebook (and
whether you even want one). The default
is no codebook, however if you want a codebook to be delivered to you so that
you can remember how the variables that you have selected are coded, and what
questions were asked to get your data, you should specify a codebook type (html
or pdf). The second option you
have concerns the format of your data.
Click on the circle next to Microsoft Excel Spreadsheet to follow this
tutorial. Third, in the
subsetting box, you want to enter the following command: PCG = 1
The number "one" is the code for an instance in which the
child's primary caregiver is the mother, so this command instructs the Data
Center to limit the output to these cases.
(As noted in Tutorial 4, over 95 percent of children have primary
caregivers who are mothers, so we do not lose too many cases by restricting the
analysis in this fashion, while we gain much in terms of ease of
exposition. If we were to include
children whose mothers were not their PCGs we could not make use of the PCGID68
and PCGPN variables that are provided in the children's records from the CDS to
locate these moms' health information elsewhere in the PSID. This would complicate the task of matching
children's and moms' information, though it does not render the task
impossible. The inquisitive user can
think about how one might do such matching after he/she understands the direct
route that we illustrate here in the tutorial.)
The fourth step is to provide your e-mail address if you want the
Data Center to e-mail you to alert you when your dataset is ready. The e-mail will include the location of the Excel
file containing your dataset, labels, and a link to your codebook if you have
asked for one. (If you do not want to
provide an e-mail address, you have the option of simply waiting at the Data
Center while it produces your dataset. A
job completion notice, such as the one depicted below, will be displayed on the
screen once the dataset has been created.)
The fifth and last step for you to take at this point is to click
on the "CDS kids only" box at the bottom of the screen. This tells the Data Center that you want
information about children only at this point.
Now you can hit "submit"
and the Data Center will get to work to put together your customized dataset
containing information about children.
At this point, you can simply wait for the Data Center to finish, or you
can close this window, and begin a new
internet session so you can
get to work compiling your moms dataset.
Screenshot #6 Job completion notice

How to construct the moms dataset
To obtain the data for your dataset
for generation two, begin a new session at the Data Center, and select
"PSID individual" as your data group, along with the "Sampling
variables" data group option.
Screenshot #7 Select Data Group(s)

Next, under the list of year options
specified for the individual-level data (the PSID individual files), check the
box next to 1986. Additionally, you want
to check the box beside "All years" under the "sampling
variables" heading.
Screenshot #8 Select data categories and
years

Next you need to select three
variables from the "All years--Sampling variables" list. These variables are (1) ER32000 (sex of
individual), (2) ER32009 (1968 family ID of mother), and (3) ER32010 (person
number of mother). Furthermore, you need
to select one variable from the 1986 individual file. That variable is ER30527 (H1/37/69 Health
Status in '86). After you have selected
these variables hit "continue" and a datacart depicting the
variables in your dataset will appear.
Note that the Data Center adds the last two variables that we want,
ER30001 and ER30002, automatically. You
do not need to select them separately. (Also,
note that the Data Center also adds three other variables automatically. They are ER30498, ER30499 and ER30500. As noted earlier, these three are sometimes
critical for PSID analyses so the Data Center assumes that you might need them,
even though this tutorial presents one of the rare instances in which you will
not. We will delete them later in the
exercise.)
Now hit "Get Data" and the
Data Center will display a screen listing several output options for you to
choose from.
Screenshot #9

Once again, there
are 5 important steps to take at this stage. First, you want to indicate whether
you would like a codebook and what format you would like it to be in. Second, to follow this tutorial, you
want to instruct the Data Center to deliver your dataset in Excel format. Third, you want to enter a subsetting
command. That command is, ER32000 =2
This instructs the Data Center to only output information for individuals
who are female. Fourth you want
to enter your e-mail address if you are having your dataset, labels, and
codebook e-mailed to you. Finally,
in the CDS intergenerational individuals section in the bottom right
hand corner of the screen, you want to fill in the circle next to the words
"PCG." This last step
instructs the Data Center to impose a further restriction on your dataset. It will not only restrict the output to cases
in which the individuals are women, it also will limit the output to women who
happen to be PCGs. (This means you won't
be troubled with records for individuals who happen to be women but not to have
had any children age 0-12 as of 1997. We
do not want these individuals because we know a priori that their unique
individual identifiers will not match any of the child records that are
contained in our dataset for generation three.)
Now you can hit the submit command
and the Data Center will get to work to create your customized dataset with
information about moms. Again, at this
point you have the option of simply closing the window and opening a new internet session so that you can get
to work selecting the variables for your third and final dataset without having
to wait for the Data Center to finish your moms dataset first. (And, again, if you entered an e-mail address
you will know when your moms dataset is ready because you will receive it via
e-mail. Or, as noted under the
guidelines for the kids dataset, if you did not provide an e-mail address you
will see a screen with a job completion notice once the Data Center has
finished.)
How to construct the grandmothers
dataset
To obtain the data for the
generation-one dataset, you will again need to begin a new session at the Data
Center. And, once again, you select
"PSID Individual" and "Sampling Variables" as your two data
groups. Under the list of options
specified after you hit continue, you want to check the box next to 1986 as you
did for the moms dataset, and to check the box beside the phrase "All years."
Next choose the following four
variables from the list of options presented on the screen: (1) ER32000 from the "All years"
category, (2) ER30499 from the 1986 Individual file, (3) ER30500 from the 1986
Individual file, and (4) ER30527 from the 1986 Individual file. Your last two desired variables, ER30001 and
ER30002, will be added automatically by the Data Center. (Note that the Data Center also will add
ER30498 to your variable list, under the assumption that you might also need
it, because it is typically a critical one for PSID analyses. We will delete it later in the exercise.) At this point you will see a screen of the
sort presented above in Screenshot # 8.
Now hit "Get Data" and the
screen listing the output options will materialize. At this point, you probably can guess what
you need to do next. First, you need to
tell the Data Center whether you want a codebook created, that you want an
Excel file created, and if/where to e-mail your dataset, labels and codebook. Additionally, you want to enter a subsetting
command. That command is as follows:
ER32000
= 2 and ((ER30500 = 10 and ER30499 = 1) or ER30500 = 20 or ER30500 = 22)
This subsetting
command instructs the Data Center to limit the output to individual data for
women, and to restrict output to records for females who were living
independently (as heads or wives rather than dependent children) in 1986. The subsetting commands need to be typed
carefully and there is help for subsetting formats in the Data Center. (Again, this is done largely to restrict our
file size to one that is manageable in Excel, yet it should not dramatically
affect the ultimate results.) Finally,
at the bottom of the screen you can select the box for "all
individuals."
Screenshot #10

Next hit
"submit" and the Data Center will begin creating your dataset with
information for generation-one (the grandmothers).
IV. Using Excel on your output subsets
In working toward the answer to your
research question, you will need to perform some operations on each dataset
separately, before you merge your information from the three generations
together. The steps that you need to
take are outlined below.
A. Working with your CDS
kids dataset
1. First, after opening your Excel file containing
your first dataset, you want to make a duplicate copy of the worksheet to use
for calculations. This way, if you make
a mistake, you will still have your original dataset to go back to. To make a copy of your worksheet choose
"edit" from the command menu at the top of the screen, then select
"move or copy sheet," then highlight the phrase "move to
end," and click beside "create a copy."
2. Next, you may want to use the "freeze
pane" command to permanently lock your column labels into place so that
they remain visible if you scroll down through the dataset. To do this, position your cursor in cell A2,
then choose "window" from the command menu, and then select
"freeze panes". A line should
appear separating row one from the remaining rows of the dataset, and you will
see that if you move downward in your dataset the row of column headings
remains visible, even if you go all the way down to the bottom of the dataset. ADDITIONALLY, at this point you can
delete the three variables that you will not need, ER33401, ER33402 and
ER33403, which were automatically added by the Data Center. You can delete them by selecting highlighting
their variable names, and then choosing "edit" from the command menu,
and then "delete" and then "entire column(s)".
3. Now let's really get to work. Create a variable that combines the pound and
ounces data on birthweight. To do this,
we want to create a new column at I to hold the composite birthweight
measure. The new column can be created
by positioning the cursor in column I and then choosing "insert" from
the command menu, and then column. This
will create a new column to the left of the one initially containing the
cursor. Label this column "total
birthweight," and enter the following formula in cell I2:
=G2*16 + H2
Then, copy this
formula into the remaining cells. One
easy way to do this is by using Excel's "fill down" feature. If you are not familiar with this feature
from work with previous tutorials, all you need to do is to position your
cursor at the bottom right corner of cell I2.
Then, press down on your mouse and drag the cursor down to the end of
the column. As you move downward you
should see that Excel is highlighting cells as you go. Once you reach the last cell in the column
you can release your grip on the mouse and Excel will automatically fill in all
the cells you have highlighted with the formula. (And, it automatically adjusts the formula so
that it makes the appropriate row references in each cell to give you the
correct total birthweight formula each time.)
Now use Excel's "copy" and "paste special" feature
to convert the formulas into values.
(You do this by highlighting the cells you want to copy--the entire
column that is--and then choosing "edit" from the command menu at the
top of your screen, and then choosing copy.
Then you reposition your cursor in the first cell of column I, and then
return to the "edit" option on the command menu. Select "paste special" and Excel
will display a box with options. Choose
"values." This instructs Excel
to copy the values computed by the formula into each cell.)
4. Now take a moment to glance down at some of
the values you are getting for your new, total birthweight measure. Some of them are outrageous--as high as
996004 ounces. This is not
reasonable! You're getting such numbers
because some of the responses for the birthweight (pounds) variable and the
birthweight (ounces) variable given by the Data Center have values like 998 and
999. These are not real
birthweights. The PSID simply uses these
high values to indicate instances in which the child's birthweight was not
known or the individual did not respond to the question, and this is explained
for another application in Tutorial #1 (see http://psidonline.isr.umich.edu/Guide/tutorials/tutorials_home.html).
Accordingly, you now need to eliminate the cases with such values. To do this, you have to sort the dataset by
birthweight (pounds) and delete the rows containing either 998 or 999. Then you want to re-sort the dataset by
birthweight (ounces) and delete the rows containing either 998 or 999. After you do this, you'll se that you have
eliminated such extreme and unreal values from your "total
birthweight" column.
How do you sort
your dataset? To sort a dataset one has to select
"data" from the command menu and then "sort." Excel should highlight your entire dataset in
response, and display a box allowing you to indicate which variable (or column)
you want to use as the sort variable. If
you have your cursor positioned in the column containing the variable that you
want to sort by before you select "data" from the command menu, Excel
will list this particular variable as the default sort option. (Otherwise, you have to choose from the list
of variables given.) Also, make sure to
indicate that your first row contains labels--rather than data--by checking on
the box next to header row.
How do you
delete rows? You do this by highlighting the rows you wish
to delete and then choosing "edit" from the command menu at the top
of your screen, and then "delete," and then "entire row"
from the small box displayed in the middle of the screen.
5. Next you want to create a composite variable
that combines the PCG's 1968 identification variable with her person number, in
order to create the special identifier variable that can be used to uniquely
identify each individual in the PSID.
Create a new, blank column beside the PCGPN column, at F, using the
technique you learned in the previous step.
Label this column "mom's unique ID." In cell F2 enter the following formula:
=(D2*1000)
+ E2
Now copy this
formula into the remaining cells of the column by using the "fill
down" feature. Then make sure you
convert your formulas into values.
6. Before we move on to actual calculations, we
need to create a unique identifier for each child whose record appears in our
dataset. The formula and procedure for
doing such is similar to the one that we used to create the unique identifier
for PCGs. First create a new column at
C. (By now you should be an expert at
creating new columns.) Then enter the
following formula in cell C2:
=(A2*1000)
+ B2
Then copy the
formula in to the remaining cells of column C by using Excel's "fill
down" feature. Remember to
transform the formulas into actual values using the copy/paste feature when you
have finished.
7. Now delete some columns to eliminate
variables that you no longer need. Here
is a list of the variables to keep:
(1) child's unique id, (2) mom's unique id, (3) CH97PRWT (4) total
birthweight, and (5) Q1A11 (the health at birth variable). Not much has been said about the CH97PRWT
variable at this point. It is a sampling
or statistical weight. (Do not confuse
it with the birthweight variable, which tells us something about each
individual's health condition. As
discussed in previous tutorials and in the PSID User Guide (found at http://psidonline.isr.umich.edu/Guide/UG/tablcont.html)
the statistical weight is a mathematical tool that is employed when we get to
the calculations phase of the analysis to ensure that the results are
nationally representative.)
8. Next we want to create an index to indicate
whether a child's birthweight was low, as defined by the Center for Disease
Control. Insert a blank column beside
your total-birthweight column, at column E.
Label it "low birthweight" and enter the following formula in
cell E2:
=if(D2 <
88, 1, 0)
What is the point
of this formula? Medical standards
define a child as having a low birthweight if his/her birthweight is below 5.5
pounds. This is equivalent to 88 ounces,
and since the total birthweight measure that appears in column D tracks
birthweights in ounces, we want to compare the figure in that column to
88. The formula tells Excel to assign a
value of 1 for any child whose birthweight is low, and to assign a value of 0
otherwise for our index. (We can think
of this as having a "1" to indicate "yes, the birthweight is
low," and a zero to indicate "no." Those of you who have some experience working
with data will no doubt recognize that the low birthweight variable essentially
represents an example of data that is qualitative--since a child's birthweight
is either "low" or "not low." However, we want to use numbers to represent
these qualitative responses, because it is easier for Excel to work with data
entries that are numerical, rather than verbal.) OK, now copy the formula into the
remaining cells of the column by using the "fill down"
procedure. Then remember to transform
your formulas into values using Excel's "copy/paste" special feature.
9. Next use the health status at birth variable
to create one more index that summarizes information about children's early
health. Use the blank column beside
Q1A11, at column G, and label it "bad health." Enter the following formula in cell G2:
=(F2 < 3,
0, if(F2 > 3, ".",1))
What does this
formula do? It instructs Excel to assign
a value of zero if the health at birth was said to be the same as other babies
(denoted with a code of 1 for the Q1A11 variable), or better (denoted by a 2
for Q1A11). In the instances in which there
was missing data for Q1A11 (when the variable takes on a value of either 8 or 9
because the respondent either did not know the answer to the question or
refused to answer it), the "bad-health" index variable will be
assigned the universal symbol for missing values--a period ("."). Finally, this formula assigns a value of one
to the index variable if the health at birth was coded as either 3, which means
worse than other babies.
What else? Make sure you copy the formula to the
remaining cells in the column. And, make
sure you convert your formulas to values after that.
10. Next, to get ready for the three-way merging
that will eventually be done to put all three generations together, we want to label
the current worksheet "gen3 merge." To label a worksheet you simply position your
cursor at the bottom of the screen where each worksheet name appears and
double-click. This allows you to type a
name of your choice.
11. Now as a final step we want to resort the
dataset by mom's unique id (or whatever name you assigned to the unique identifier
for PCGs that we created back in step 5).
To do this, position your cursor in column B. Select "data" from the commands
menu, and then choose "sort."
This should prompt Excel to ask if you want to sort the entire dataset
by "moms unique id." Hit
"OK." (If you see alphabetical
column heading options instead, you need to take a moment to tell Excel that
you have a header row--by checking the circle next to this question. FURTHERMORE, note that you want to sort the
entire dataset by this variable--not just the column. If you only sort the column, you will be
erroneously adjusting the mom ids for each of the children in your dataset.)
B. Working with the moms dataset
1. After opening your moms dataset, the first
thing to do is to delete the three variables that we will not be using. As noted above, for this tutorial exercise
ER33401, ER33402, and ER33403 are not needed (although they frequently are
needed for analyses using the PSID).
Accordingly, you can delete the columns containing these variables. (Remember, to delete a column you position
your cursor in the column you wish to delete, or highlight the variable name,
and then you choose "edit" from the command menu, then
"delete" then "entire column.")
2. Now we are ready to prepare the dataset for
generation two for use in the analysis.
The first thing one wants to do is to create the unique identifier for
moms by combining the information contained in the 1968 family ID number (ID
'68) and the person number (PN) for each individual whose record appears in the
dataset. The relevant variables here are
ER30001 (ID '68) and ER30002 (PN).
Remember, to create the unique identifier you multiply the '68 ID by one
thousand and then add the person number to that value. Accordingly, you need to insert a new blank
column at C, and in cell C2 you want to enter the following formula:
=A2*1000 + B2
Label this column
"mom's unique id" and then copy the formula to the remaining rows of
the column by using the "fill down" procedure, and then convert your
formulas into values using the "copy/paste special" feature.
3. Next, create a unique identifier for each
mom's mother (i.e., our grandmothers) by using the variables ER320009 and
ER32010. You will need a new, blank
column to hold this new variable, so insert a blank column at G, and label it
"mom's mother's id." Enter the
following formula in cell G2:
= E2*1000 + F2
Next copy the
formula into the remaining cells of the column and then convert the formulas
into values.
4. Finally, we want to use the information
contained in the '86 Health Status variable (ER30527) to create an index that
reveals whether an individual is in relatively good or poor health. If you glance at the documentation for the variable
(ER30527) you will see that the ranges of responses for this categorical
variable are 1 to 5. Values from 1 to 3
indicate good or better health, while instances in which the response is coded
as 4 or 5 correspond to situations in which the individual's health is fair or
poor. If you scroll down your column,
you notice that the '86 Health Status variable also can take on a value of 8 or
9. These correspond to cases in which
the health status was unknown or where the question about health status was not
applicable. Both are examples of missing
data and when we create our index variable it will be coded to indicate
such. One final note about the '86
Health Status variable: For some records
this variable has been assigned a value of 0.
This denotes a case of absent data--individuals who were not asked the
health question (for example, because they were not in the PSID in 1986). As noted in Tutorial #4 (found at http://psidonline.isr.umich.edu/Guide/tutorials_home.html),
while this does not constitute missing data in the traditional sense, it will
be treated as such when we create our new index variable.
OK, before creating
the new variable, we need a blank column to put it in. Accordingly, use the blank column at H and
label it "bad health--moms."
Now enter the following formula in cell H2:
=if(and(G2
> 0, G2 < 4), 0, if(and(G2 > 3, G2 < 6), 1, ".")))
Remember to copy
this formula into the remaining cells of the column and to convert the formulas
into values.
5. As a final step, we want to eliminate the
individuals for which we do not have useful health information. Accordingly, we want to sort the dataset by
the new bad-health variable, and then to delete all rows with a "."
entered for this variable. (Once you
sort the dataset in this fashion the records with missing values should all be
grouped toward the end.)
6. Now create a new copy of this worksheet. Label it "gen2 merge." (Remember that to create a copy of a
worksheet, you simply select "edit" from the command menu, and then
"move or copy sheet," and then you instruct Excel to "create a
copy" and to place this copy at the end of the workbook (by selecting
"move to end" from the list of options that appears just above the
words "create a copy"). Then
delete all the columns in this worksheet except the ones containing
the following variables: mom's id, mom's mother's id, and
bad-health--moms.
7. The final step to take is to re-sort the
dataset contained in this gen2-merge worksheet by
"mom's-mothers-id,"
so that the data will be ready when we get to the step where we merge this
generation's data with data from the other two generations.
C.
Working with the grandmothers dataset
1. After opening the dataset containing the
information for grandmothers (the dataset for generation 1), the first thing to
do is to delete the column containing the variable ER30498.
2. Next, with this grandmothers dataset one
wants to take many of the steps that we took for the moms dataset. First, we need to create a unique identifier
for each individual in the dataset. Call
it "g-mom's id" and use the procedure that we used to create the
unique identifier in the previous cases (step B1 for example). Second, we want to create a variable that
indicates whether the grandmother had bad health. For this, we again turn to the '86 Health
Status variable (ER30527). We need to
transform it as we did above, for moms, in order to generate a new measure that
applies to grandmothers. (You can label
this "g-moms bad health.") Use
the same procedures that you used to construct the bad health--moms measure to
create the new variable for grandmothers, and be sure to sort the dataset
afterwards so that you can eliminate the rows containing missing data--as
indicated by the symbol "."
3. Next create a copy of this worksheet, and
label it "gen1 merge."
(Remember that to create a copy of a worksheet, you simply select
"edit" from the command menu, and then "move or copy
sheet," and then you instruct Excel to "create a copy" and to
place this copy at the end of the workbook--by selecting "move to
end" from the list of options that appears just above the words
"create a copy"). In this new
worksheet you want to delete all columns except the
following: g-moms id, and g-moms bad
health.
4. Finally, you want to re-sort the dataset
contained in the gen1-merge worksheet by the g-mom's id variable.
D. Merging the data from the three different
datasets to create a 3-generational dataset
1. For this process we want to open a new, blank
workbook in Excel. You probably want to
give it a name. (How about
"all_3_generations"?)
2. Next, open the moms dataset so that you can
copy information from it into your new workbook. Go to the gen2-merge worksheet, highlight it
and then select copy from the command menu.
Then return to your empty workbook (the "all_3_generations"
file) and paste this new information from generation 2 into it. (You may want to hit "save" at this
point so that this modification is permanently captured in your file.)
3. Now open your grandmoms dataset and go to the
gen3-merge worksheet in that file.
Highlight all of the information contained in it and select
"copy" from the command menu.
Then return to your new file (the "all_3_generations" workbook)
and paste the information from the grandmoms dataset into it. Make sure you copy this information into a
BLANK space towards the right of the dataset--beginning at column E for
example--so that you do not overwrite any existing information. If you do this correctly, you should now be
looking at a file that displays your information about the moms on the left
hand side of the dataset and your information about grandmothers on the
right-hand side.
4. Now, if you do not already have one, insert a
blank column in between the moms information and the information that came from
the grandmothers dataset. This should be
at column D. Label this blank column
"assign g-mom health." In cell 2 of that column enter the following
formula:
=if(lookup(B2, $E$2:$E$1423)=B2,lookup($E$2:$F$1423),".")
And, copy this
formula into the remaining cells by using the fill down procedure. Then convert your formulas to values with the
copy/paste special feature.
What is the purpose
of this formula? It instructs Excel to
look at the first entry in your column containing the moms' mothers' id numbers
(column B), and then to look up and down the column containing entries for
g-mom's ids (column E, which should have 1423 rows) to search for a match. Additionally, it tells Excel that when a
match is found, Excel should assign the corresponding value from column F
(which contains the index of g-mom bad-health index) in the blank column. This means Excel is attaching the
grandmother's health information to each mom's record. If there is no match, the formula instructs
Excel to assign the missing value symbol (to indicate that there is no health
mom available for the grandmother).
NOTE: If you did not remember to sort both sets of
information--the information taken from your moms dataset and the information
taken from your grandmothers dataset--by the respective grandmother's unique
identifier variables before you copied it your matching will not proceed
correctly. If you get funny results at
this stage, this may be the reason. If
so, start again, and remember to sort the information from the gen2-merge and
gen1-merge worksheets prior to copying it.
(As specified in steps B7 and C4, both sets were supposed to have been
sorted, by "mom's mother's id" and "g-mom's id" respectively.)
5. Now you can eliminate all of the columns to
the right of the "assign g-mom health" variable. Since you have now attached each
grandmother's health information to their offspring (to each mom that is), you
no longer need the gen1-merge portion of your dataset (the portion that was
previously storing all your information about grandmothers for you).
6. Finally, you want to resort this dataset--this
time using the "mom's id" variable.
This readies us for matching these records with records from our child
dataset.
7. Now insert 7 blank columns in the leftmost
portion of your worksheet, i.e., at column A..
You are going to copy the information about the children (generation
three) in this spot.
8. Now open up the child dataset that we were
working with earlier (in part A). Go to
the gen1-merge worksheet of the child dataset and copy its information into the
blank columns that you created above in step 7.
(Remember, you accomplish this by highlighting all of the data in the
"gen1-merge" worksheet and then selecting "copy" from the
command menu. After this, you close the
window and return to your "all_3_generations" workbook. Then you select "paste" from the
command menu. Make sure you do not
overwrite any existing data. If you
have followed the instructions exactly, 7 columns should be enough to hold your
child-related information without spilling over to the pre-existing information
about moms. However, if you have more
than 6 columns in your "gen1-merge worksheet" you will need additional
blank columns to copy your information into.)
At this point, you
should have a worksheet before you that contains information about the first
generation in columns A through G, followed by information about the moms (with
grandmother's health information merged on) in columns I through L.
9. Now insert three blank columns in between the
two different sets of information, at columns I, J, and K. Label the first blank column "attach
mom's bad health" and label the second "attach g-mom's bad health." In the first column enter the following
formula:
=if(lookup(B2,$L$2:$L$1423)=B2,lookup(B2,$L$2"$N$1423),".")
Copy this formula
to the remaining cells of your column (column I presumably) by using the fill
down procedure, and then don't forget to convert your formulas to values.
What does this
formula do? It instructs Excel to take each PCG id number
(contained in column B) and to use it to search for a match in the range of
values for "mom's id" (in column L).
Then, if a match is found, Excel knows to assign the corresponding value
from column N in column I. This means
that Excel is taking the value for the moms-bad-health variable and placing it
in column I. In doing such, it is
attaching the measure of mom's health to the record for each child that we
have.
10. Now, in cell 2 of column J you want to enter
the following formula, which will allow you to attach the grandmother's health
information to each child record:
=if(lookup(B2,$L$2:$L$1423)=B2(lookup(B2,$L$2:$O$1423),".")
This formula
instructs Excel to take each PCG id number and to search for it in column H and
to attach the corresponding information from the grandmom's bad health variable
(from column N) when a match is found.
This amounts to assigning the value for the measure of grandmother's
health to each child's record.
Again, you want to
copy your formula to the remaining cells of the column (using the fill down
procedure), and to convert the formulas into values (using Excel's copy/paste
special feature).
11. Now you are done merging your 3 generations'
worth of information together. All you
need to do now is to do your calculations so you can fill in Table 1.
12. First, to determine what percentage of
children have a low birthweight, you need to find a blank cell in your
worksheet and to enter the following formula into it:
=sumproduct(C2:C3296,E2:E3296)/sum(C2:C3296)
This computes
the percent of children who were born with a low birthweight. The
formula does such by combining the information from the CH97PRWT column (the
statistical weights in column C) with the information about whether a child had
a low birthweight that is contained in column E.
13. Next, to compute the percentage of children
with low birthweight for moms who have poor health and for grandmothers whose
health is poor, it is easiest to create 2 separate new worksheets in which to
do the calculations for each.
Accordingly, you want to use the "edit" then "move/copy
worksheet" feature to make a duplicate copy of the worksheet (and a
triplicate copy) after the current one (at the end).
Once you have
created your two new copies of the worksheet, name the worksheet containing the
first copy "kid-mom connections" and the second one
"kid-grandmom connections."
Now, using the kid-mom connections worksheet, resort your entire dataset
by the "attach mom bad health" variable. This will instruct Excel to order the
information so that the cases where this variable takes on a value of 0 appear
first, and the cases in which it takes on a value of 1 appear afterwards, and
the cases in which there is missing data (the "." cases) come at the
end. You want to delete all the rows
with the "." symbol. Then, in
a blank cell somewhere (preferably towards the right end of the dataset) you
want to enter the following formula:
=sumproduct(C2:C1935,E2:E1935)/sum(C2:C1935)
This computes
the incidence of low birthweight among children whose moms have good health.
In another empty
cell enter the following formula:
=sumproduct(C1936:C2026,E1936:E2026)/sum(C1936:C2026)
This computes
the proportion of kids who had a low birthweight among kids whose mothers'
health is poor.
14. Now go to the worksheet that you labeled
"kids grandmoms connections."
Sort this worksheet by the attach-grandmom-bad-health variable. Then delete all rows with missing data for
this variable (the ones with "." entered instead of numerical
values). Afterwards, find an empty cell
and enter the following formula into it:
=sumproduct(C2:C1052,E2:E1052)/sum(C2:C1052)
This computes
the proportion of kids with low birthweight among those whose grandmothers are
healthy.
In another blank
cell enter the following formula:
=sumproduct(C1053:C1425,E1053:E1425)/sum(C1053:C1425)
This calculation
yields the percent of kids who had a low birthweight who have grandmothers who
have poor health.
15. What do you make of your calculations? It looks like the likelihood of having a low
birthweight (our poor health measure for the youngest generation) is higher
among kids whose grandmothers are not very healthy either. For moms, however, the difference between
children's health situations by whether or not the mom has good health is only
slight.
16. If you are really interested in this issue,
you can repeat these calculations using the "worse health" measure
instead of the low birthweight measure as your measure of each child's health.
17. How about comparing mothers and
grandmothers? To compute the percent of
moms with poor health among those whose own mothers also had poor health enter
the following formula in a blank cell:
References
Federal Interagency
Forum on Child and Family Statistics (FIFCFS) 2001. America's Children: Key National Indicators of Well-being 2001. Federal Interagency Forum on Child and Family
Statistics, Washington, DC: U.S. Government
Printing Office. (This publication can
be found on-line at the following URL: http://www.childstats.gov/ac2001/ac01.asp
)
Institute for Social Research |
University of Michigan |
Privacy |
Conditions of Use