Tuesday, November 27, 2012

Stata tip: Plotting simillar graphs on the same graph

Suppose you want to make a bar graph of a variable, such as consumption, for two mutually exclusive groups such as males and females, represented by one categorical variable ("male"). This is easy enough: use the graph bar command with an over() option. What if you want to plot over two categorical variables, one within the other: for example you want to plot average consumption for males and females that are self-employed, and average consumption for males and females that are not self-employed. Easy enough, just include an extra over() option with the extra categorical variable. In this example, each over group is mutually exclusive: you are either male or female, but can't be both, and you are either self-employed or not self-employed, but can't be both. In this example, the variables in your dataset are:

consumption, male, self-employed

Suppose, however, that you want to essentially combine two separate graphs into one graph as follows: you want to plot average consumption over three categorical variables that are NOT mutually exclusive, so you don't want to plot one within the other. For example, imagine that you are considering three different policy options for awarding a social assistance benefits: the current policy ("currentpolicy"), alternative A ("alternative_a") and alternative B ("alternative_b"). Each mechanism divides the population into those who qualify and those who don't qualify for the social assistance benefit. Thus each policy option is represented in the data by a binary variable (a.k.a. a dummy variable, which is just a categorical variable with two levels: 0 for those who do not qualify and 1 for those who do qualify). Of the three policy options, the current policy is the least pro-poor, alternative A is more pro-poor (more of the benefits go to the poor), and alternative B is the most pro-poor.

Now, the three policy options are not mutually exclusive. It is possible to qualify under all three policy options, to be excluded under all three policy options, or to qualify under only one or two of the policy options. This is not true of the groups self-employed and not self-employed, and of the groups male and female. Essentially, suppose you want to combine the first three graphs below onto the the same graph, as shown in the fourth graph:






The program below will do this. Make sure you have label values defined for the categorical values. Also note that:

a) The value labels for each of the categories should have the same numbering (i.e. they should all be 0, 1, 2 or 0, 2, 5. It should not be the case that one has 0, 1, 3 and the other has 1, 3, 4).

b) The groups defined by the categorical variable will be plotted the in the order that you specify them. So in the fourth graph above, the order is: current policy, alternative a, alternative b, since that is what is given in the option catvarlist below: catvarlist("currentpolicy alternative_a alternative_b")

c) You need to specify new value labels with the same numbering as the original value labels for the categorical variables. This is necessary for the plot to turn out nice (the spacing gets mucked up if I don't force this). You can do this in the v() option below. So the original value labels could be:

current policy: 0 "non-exempt" 1 "exempt"
alternative a: 0 "does not receive" 1 "receives"
alternative b: 0 "Ineligible" 1 "eligible"

You might want to relabel this using the v() option as follows:

v(0 "non-beneficiary" 1 "beneficiary")

To generate the fourth graph:

. overlappingcatgraphmean pccd using "$WHO_KG_reports/eraseme.dta", gc(graph bar (asis)) catvarlist("currentpolicy alternative_a alternative_b")     v(0 "Non-beneficiary" 1 "Beneficiary") over2options(lab(angle(0) labs(vsmall))) replace go(note(`"Source: Some data source "') asy asc title("Average consumption of groups within population") subtitle("Simulation of policy options") ytitle("Per capita HH consumption (LCU)", margin(medium)) legend(size(small)) blabel(total, format(%9.0fc)))

To generate the first three graphs:

. graph bar pccd [aw=expfact], over(currentpolicy) asy title("Mean annual consumption comparison") subtitle("Current Policy") ytitle("Consumption")  note("Source: some data source") blabel(total, format(%9.0fc))

. graph bar pccd [aw=expfact], over(alternative_a) asy title("Mean annual consumption comparison") subtitle("Alternative A Policy") ytitle("Consumption") note("Source: some data source") blabel(total, format(%9.0fc))

. graph bar pccd [aw=expfact], over(alternative_b) asy title("Mean annual consumption comparison") subtitle("Alternative B Policy") ytitle("Consumption") note("Source: some data source") blabel(total, format(%9.0fc))

The program:

program define overlappingcatgraphmean

    // Written by Shafique Jamal (shafique.jamal@gmail.com). 27 Nov 2012
    // I want to plot the mean of a variable over categorical values on the same plot. Of course, these categorical variables will not be mutually exclusive between them (though the are within them)
    // "using" should specify a .dta file - this program will save a dataset
    // doesn't take weights - uses svy mean to calculate the mean
    //
    // You call it like this:
    //
    // overlappingcatgraphmean varname using "filename.dta", gc(graph bar (asis)) go(over(catvariablelabel, [over_subopts]) over(catvariablelevel, [over_subopts]) asc title("My Title") ...) catvarlist(categoricalvar1 categoricalvar2) replace
    //    
    // Note that:
    // 1. 'catvariablelabel', 'catvariablelevel_n' 'catvariablelevel' must be entered exactly as is (without the quotes) - these are names of variables that the program creates
    // 2. the order in which you enter the over() options is up to you.
    //
    // UPDATE 12-07-2012: Best way is to call it with a long dataset like this: graph bar v, over(eligible) over(avg). Also note that I haven't tested whether this works with "if"

    syntax varname using/ [if] [in], GCmd(string) GOptions(string asis) CATvarlist(varlist) Valuelabelsforlevels(string asis) [replace over1options(string asis) over2options(string asis) ]
    version 9.1
    marksample touse
    tempname tempmat
    tempname variablelabel
    local `variablelabel' : variable label `varlist'   
   
    // foreach category, find the mean
    foreach catvar of local catvarlist {
        tempfile tf_`catvar'
       
        // this is a pain: get the name of the variable's value label
        tempname tn_`catvarvaluelabel'
        local `tn_`catvarvaluelabel'' : value label `catvar'
        label save ``tn_`catvarvaluelabel''' using `"`tf_`catvar''"', replace
       
        // UPDATE: None of this is necessary. The user will pass a list of value labels, separated by spaces, and these will be assumed to be the same for all the categorical variables specified
        //    e.g. user can pass v(0 "Qualifies" 1 "Does not Qualify"), where the categorical variables and corresponding value lables are:
        //    exempt     : 0 "Exempt"     1 "Non-exempt"
        //    PMT        : 0 "Eligible"    1 "Non-eligible"
        //    MBPF    : 0 "Receives"    1 "Does not receive"

        di "cat = `catvar'"
        tempname catvarlabel_`catvar'
        local `catvarlabel_`catvar'': variable label `catvar'
        tempname levels_`catvar'
        levelsof `catvar', local(`levels_`catvar'')
        foreach level of local `levels_`catvar'' {
            svy: mean `varlist' if `catvar' == `level' & `touse'
            matrix `tempmat' = r(table)
            tempname mean_`catvar'_`level'
            local `mean_`catvar'_`level'' = `tempmat'[1,1]
            tempname vl`catvar'_`level'
            local `vl`catvar'_`level'' : label (`catvar') `level'
            // di "Mean of var: ``mean_`catvar'_`level'''"
        }
    }

    tempname valueslabels
    label define `valueslabels' `valuelabelsforlevels'
    tempfile tf_valuelabelsforlevels
    label save `valueslabels' using `"`tf_valuelabelsforlevels'"', replace
   
    // I'll now make a dataset out of this with the following variables: mean of the variable; category name; category level
    //     The latter two will be numeric, categorical variables with variable labels attached.
    preserve
    clear
    do `"`tf_valuelabelsforlevels'"'
    gen meanofvariable = .
    label var meanofvariable `"``variablelabel''"'
    gen catvariablelabel = ""
    gen catvariablelevel = ""
    gen catvariablelevel_n = .
    gen sortorder = .
   
    // create the sort order - it will be the order in which the categorical variables were specified
   
    tempname count sortcount
    local `count' = 0
    local `sortcount' = 0
    foreach catvar of local catvarlist {
        // di "cat = `catvar'"
        local `sortcount' = ``sortcount'' + 1
        foreach level of local `levels_`catvar'' {
            local `count' = ``count'' + 1
            set obs ``count''
            replace meanofvariable = ``mean_`catvar'_`level''' in ``count''
            replace catvariablelabel = `"``catvarlabel_`catvar'''"' in ``count''
            replace catvariablelevel_n = `level' in ``count''
            replace catvariablelevel = `"``vl`catvar'_`level'''"' in ``count''
            replace sortorder = ``sortcount'' in ``count''
           
            // di "Mean of var: ``mean_`catvar'_`level'''"
        }
    }
    label values catvariablelevel_n `valueslabels'
    save `"`using'"', `replace'
    `gcmd' meanofvariable,  over(catvariablelevel_n, sort(catvariablelevel_n) `over1options') over(catvariablelabel, sort(sortorder) `over2options') `goptions'
    restore

end program

1 comment:

uzuzjmd said...

Thanks man! What would I do without people like you.