Sunday, January 26, 2014

My ado files are all available on GitHub. The link is here: https://github.com/shafiquejamal/stata_ado_files

My ado files are all available on GitHub. The link is here:

https://github.com/shafiquejamal/stata_ado_files

To get these files, visit the above link and click download zip (on the right).

To get individual files, click on any of the files, and then click "Raw". You can then copy the text and past into your do file editor (e.g. https://raw.github.com/shafiquejamal/stata_ado_files/master/collapseandpreserve.ado)

Saturday, December 8, 2012

Stata tip: "Mega If" : a simple command to generate long if conditions

Suppose you need to type the following command that has many conditions in the if condition:

replace group = 1 if (benefitNumber == 1 | benefitNumber == 2 | benefitNumber == 3 | benefitNumber == 4 | benefitNumber == 5 | benefitNumber == 6 | benefitNumber == 11 | benefitNumber == 12 | benefitNumber == 17 | benefitNumber == 19 | benefitNumber == 21 | benefitNumber == 22 | benefitNumber == 23)

It's a bit much to type. The program megaif below will generate this long line from the much shorter command:

megaif 1 2 3 4 5 6 11 12 17 19 21 22 23, v(benefitNumber) c(replace group = 1)

The above has lots of "or equals", but you can also generate lots of "and does not equal", for example:

replace group = 1 if (benefitNumber != 1 & benefitNumber != 2 & benefitNumber != 3 & benefitNumber != 4 & benefitNumber != 5 & benefitNumber != 6 & benefitNumber != 11 & benefitNumber != 12 & benefitNumber != 17 & benefitNumber != 19 & benefitNumber != 21 & benefitNumber != 22 & benefitNumber != 23)

using the following command:

megaif 1 2 3 4 5 6 11 12 17 19 21 22 23, v(benefitNumber) c(replace group = 1) e(!=) s(&)

It works for numeric variables and for string variables too. Check this out:

megaif "a b" b "cc" d `"e"', v(benefit_stringvar) c(replace group = 1) e(!=) s(&)

executes the following command:

cmd to execute: replace group = 1 if (benefit_stringvar != "a b" & benefit_stringvar != "b" & benefit_stringvar != "cc" & benefit_stringvar != "d" & benefit_stringvar != "e")

As you can see, for string variables the quotes are optional unless you're checking for text that has a space in it. The program is below. Enjoy!

program define megaif

    // By Shafique Jamal
    // e.g.
    // sysuse auto, clear
    // megaif 0 1, v(foreign) c(drop) e(~=) // this will drop all the observations. Just for illustrative purposes to show how the command could be used
    // another e.g.
    // The command:
    //    megaif 14 15 16 17 18 19 20 21 22, c(gen priv1 = 1) var(income_type2)
    // would execute the following command:
    //     gen priv1 = 1 if (income_type2 == "14" | income_type2 == "15" | income_type2 == "16" | income_type2 == "17" | income_type2 == "18" | income_type2 == "19" | income_type2 == "20" | income_type2 == "21" | income_type2 == "22")


    syntax anything(id="variable and values" name=arguments), Var(varname) Cmd(string) [Equality(string) Separator(string)]
   
    // The default is equality
    if ("`equality'" == "") {
        local equality "=="
    }
   
    if ("`separator'" == "") {
        local separator " | "
    }
    else {
        local separator " `separator' "
    }
   
    cap confirm numeric variable `var'
    if (_rc == 0) { // variable is numeric
        local numericvar = 1
    }
    else {
        local numericvar = 0
    }
    // di "numericvar = `numericvar'"
   
    local count = 0
    local orcondition ""
    foreach w of local arguments {
        local count = `count' + 1

        // di `"w = `w'"'
        if (`numericvar' == 0) {
            local orcondition `"`orcondition'`orseparator'`var' `equality' "`w'""'
        }
        else {
            local orcondition `"`orcondition'`orseparator'`var' `equality' `w'"'
        }
        local orseparator "`separator'"
    }
   
    // di `"orcondition = `orcondition'"'
    di `"cmd to execute: `cmd' if (`orcondition') "'
    // set trace on
    // set traced 1
    `cmd' if (`orcondition')
    set trace off

end




Thursday, December 6, 2012

Stata tip: Quickly, and in one command, rename all variable labels of variables generated with the 'xi' command to reflect the value labels of the xi'd variable

When you use the xi command on categorical variables, even on those that have a value label associated with them, you get this:

. xi: svy, subpop(rural): reg logpccd rural_regressors refrigerator car livingroomsper numchild11 numchild11_sq i.oblast i.typeofdwell i.roof i.coldwatermeterinstalled i.soc_

. d

_Ioblast_3      byte   %8.0g                  oblast==3
_Ioblast_4      byte   %8.0g                  oblast==4
_Ioblast_5      byte   %8.0g                  oblast==5
_Ioblast_6      byte   %8.0g                  oblast==6
_Ioblast_7      byte   %8.0g                  oblast==7
_Ioblast_8      byte   %8.0g                  oblast==8
_Ioblast_11     byte   %8.0g                  oblast==11
_Itypeofdwe_2   byte   %8.0g                  typeofdwelling==2
_Itypeofdwe_3   byte   %8.0g                  typeofdwelling==3
_Itypeofdwe_4   byte   %8.0g                  typeofdwelling==4
_Itypeofdwe_5   byte   %8.0g                  typeofdwelling==5
_Itypeofdwe_6   byte   %8.0g                  typeofdwelling==6
_Itypeofdwe_7   byte   %8.0g                  typeofdwelling==7
_Itypeofdwe_8   byte   %8.0g                  typeofdwelling==8
_Itypeofdwe_9   byte   %8.0g                  typeofdwelling==9


The value labels are not much more informative than are the variable names. Below is a program that will automatically rename the variable label of these variables that result from the xi command so that they include the corresponding value label, as follows:

_Ioblast_3      byte   %8.0g                  oblast=Jalalabat
_Ioblast_4      byte   %8.0g                  oblast=Naryn
_Ioblast_5      byte   %8.0g                  oblast=Batken
_Ioblast_6      byte   %8.0g                  oblast=Osh
_Ioblast_7      byte   %8.0g                  oblast=City of Osh
_Ioblast_8      byte   %8.0g                  oblast=Chui
_Ioblast_11     byte   %8.0g                  oblast=City of Bishkek
_Itypeofdwe_2   byte   %8.0g                  typeofdwelling=Apartment or room in a residential hotel
_Itypeofdwe_3   byte   %8.0g                  typeofdwelling=Separate house
_Itypeofdwe_4   byte   %8.0g                  typeofdwelling=Part of a house
_Itypeofdwe_5   byte   %8.0g                  typeofdwelling=Dormitory
_Itypeofdwe_6   byte   %8.0g                  typeofdwelling=Lodge or a tied cottage (temporary tenure dwelling)
_Itypeofdwe_7   byte   %8.0g                  typeofdwelling=Other non-residential premises used for residence
_Itypeofdwe_8   byte   %8.0g                  typeofdwelling=Other residential premises
_Itypeofdwe_9   byte   %8.0g                  typeofdwelling=Barracks
 

There are actually two programs - mine is a wrapper for a program that Nicholas J. Cox wrote. Both of these are below.

Usage (run this after the xi command):

. varsformyrelabel

Programs:

program define varsformyrelabel

    // Written by Shafique Jamal (shafique.jamal@gmail.com), 12-07-2012
    // UPDATE 12-07-2012: Need to change how the variable name for the list of `allunxidvariables' is determined. Need to get it from the variable label, rather than the variable name
    //

    // Get list of variables that were xi'd
    local xivars "`_dta[__xi__Vars__To__Drop__]:'"
    // di `"xivars:`xivars'"'
   
    // Now just need to get list of un-xi'd variables from this list
    // Here is the first one
    local currentdummyvar : word 1 of `xivars'
    // di `"currentdummyvar:`currentdummyvar'"'
   
    // This will get the full variable name
    local currentunxidvar = regexr("`: variable label `currentdummyvar''","==.*$","")
    // di `"currentunxidvar:`currentunxidvar'"'
    local allunxidvars "`currentunxidvar'"
    // di `"allunxidvars:`allunxidvars'"'
   
    // This will get the _I`var' name, without the _# suffix - I need this for the first argument to the myrelabel routine. Variable name gets shortened
    local currentunxidvarwith_I = regexr("`currentdummyvar'","_[0-9]+$","")
    // di `"currentunxidvar:`currentunxidvarwith_I'"'
    local allunxidvarswith_I "`currentunxidvarwith_I'"
    // di `"allunxidvarswith_I:`allunxidvarswith_I'"'
   
    // Now loop through the rest
    local count = 0
    foreach var of local xivars {
        local count = `count' + 1
        if (`count' != 1) {
            local w : word `count' of `xivars'
            // di "w: `w'"
           
            // check whether the next xi'd var is related to the current one
            // if (regexm("`w'","^_I`currentunxidvar'_[0-9]+$")) { // yes, this is part of the same family as the current _I.... variable under consideration
            if (regexm("`: variable label `w''","^`currentunxidvar'==.*$")) { // yes, this is part of the same family as the current _I.... variable under consideration
                // di "skip"
            }
            else { // no, it is different. add to the list
                // this gets the full variable name
                local currentunxidvar = regexr("`: variable label `w''","==.*$","")
                // di `"currentunxidvar:`currentunxidvar'"'
                local allunxidvars "`allunxidvars' `currentunxidvar'"
                // di `"allunxidvars:`allunxidvars'"'
               
                // This gets the _Ivar name
                local currentunxidvarwith_I = regexr("`w'","_[0-9]+$","")
                // di `"currentunxidvar:`currentunxidvarwith_I'"'
                local allunxidvarswith_I "`allunxidvarswith_I'  `currentunxidvarwith_I'"
                // di `"allunxidvarswith_I:`allunxidvarswith_I'"'
            }       
        }
    }

    di "allunixidvars: `allunxidvars'"
    di `"allunxidvarswith_I:`allunxidvarswith_I'"'
    local count = 0   
    foreach var of local allunxidvars {
        local count = `count' + 1
        local varwith_I : word `count' of `allunxidvarswith_I'
        myrelabel `varwith_I'_* `var'
    }
   
end


program def myrelabel
*! NJC 1.0.0 15 July 2003
    version 7
    syntax varlist(numeric)

    tokenize `varlist'
    local nvars : word count `varlist'
    local last ``nvars''
    local vallabel : value label `last'
    if "`vallabel'" == "" {
        di as err "`last' not labelled"
        exit 498
    }

    local `nvars'
    local varlist "`*'"

    foreach v of local varlist {
        local varlabel : variable label `v'
        local eqs = index(`"`varlabel'"', "==")
        if `eqs' {
            local value = real(substr(`"`varlabel'"', `eqs' + 2, .))
            if `value' < . {
                local label : label `vallabel' `value'
                label var `v' `"`last'=`label'"'
            }
        }
    }

end


Monday, December 3, 2012

Stata tip: Using perl compatible regular expressions (PCRE) in Stata

UPDATE 12-07-2012: Thanks to Nicholas J Cox who the problem I was having with the -marksample- command. I replaced the code below with the new, fixed code.

Stata's regular expression engine is too limited for my needs. I asked the statalist about how to change Stata's regular expression engine, but apparently it is not possible. So I wrote a Stata program (.ado file) to call a perl script to run a regular expression on a variable.

Matching and substitution are supported. Named captures/groups are not, but non-named captures (e.g. $1, $2, etc.) ARE supported. I think quantifiers are supported. Anyways, if it works as I think it should based on my design and testing, it should be a decent improvement over Stata's built-in regular expression engine (I hope they update it soon).

You can download the perl script here. Download it and place it in any directory - just remember the directory because you will have to specify when you call the program.

The Stata program is here and below. Put this in your personal ado folder.

Usage:

Match only:
        pcre SOME_STRING_VARIABLE, re("/^(\d)(\w)/i") gen(NEW_VARIABLE_TO_BE_GENERATED) pa("/usr/local/ActivePerl-5.16/bin/")

Substitution:
        pcre SOME_STRING_VARIABLE, re("/^(\d)(\w)/gi") gen(NEW_VARIABLE_TO_BE_GENERATED) pa("/usr/local/ActivePerl-5.16/bin/") repl("firstone_$1_secondone_$2") 

Notes:

1. The arguement for re() should be a regular expression enclosed in double quotes. You can use only the forward slash for a delimiter. Named captures/groups don't work yet (I can't figure out why. Any ideas?)

2. The arguement for repl() should be the replacement part of s//THIS_PART/. It should be enclosed in double quotes. Do NOT include the forward slashes or any delimiters. Option modifiers do NOT go here. You can use backreferences $1, $2, etc. but NOT named groups/named captures (i.e. you can't use \g{1}, \g{name}, etc. The \g{} notation doesn't work at all).

3. You can specify the path to your perl installation in pa() (Be sure to include the trailing forward slash). If you don't, it will use whatever version of perl is accessible from the command line in a terminal in whatever path this is run from.

4. You should specify the path of the perl script that this program calls: stataregex.pl. You can download this from my blog: shafiquejamal.blogspot.com. The default is the /Applications/STATA12/ directory. Be sure to include the trailing forward slash.

5. This will generate a binary/dummy variable the match was a success, and variables prefixed by this same variable name with _1, _2, _3 ... , _16 appended to store the named captures/groups.

program define pcre

     // 30101990
    // Written by Shafique Jamal (shafique.jamal@gmail.com), 01 Dec 2012. Use at own risk :-p
    //
    // This program allows the user to use perl compatible regular expressions on a (single) string VARIABLE (not a scalar string) for matching, obtaining captures from memory parenthesis, and
    //    subsitutions. Its not perfect... I think it supports quantifiers, it does support options/option modifiers, but it does not support named captures/groups.
    //
    // Usage:
    //
    //    Match only:
    //        pcre SOME_STRING_VARIABLE, re("/^(\d)(\w)/i") gen(NEW_VARIABLE_TO_BE_GENERATED) pa("/usr/local/ActivePerl-5.16/bin/")
    //  Substitution:
    //        pcre SOME_STRING_VARIABLE, re("/^(\d)(\w)/gi") gen(NEW_VARIABLE_TO_BE_GENERATED) pa("/usr/local/ActivePerl-5.16/bin/") repl("firstone_$1_secondone_$2")
    //
    // Note:
    //
    //    1. The arguement for re() should be a regular expression enclosed in double quotes. You can use only the forward slash for a delimiter. Named captures/groups don't work yet (I can't
    //        figure out why. Any ideas?)
    //  2. The arguement for repl() should be the replacement part of s//THIS_PART/. It should be enclosed in double quotes. Do NOT include the forward slashes or any delimiters.
    //        Option modifiers do NOT go here. You can use backreferences $1, $2, etc. but NOT named groups/named captures (i.e. you can't use \g{1}, \g{name}, etc. The \g{} notation doesn't work at all). 
    //  3. You can specify the path to your perl installation in pa() (Be sure to include the trailing forward slash). If you don't, it will use whatever version of perl is accessible from the command line in a terminal in whatever path this
    //        is run from.
    //  4. You should specify the path of the perl script that this program calls: stataregex.pl. You can download this from my blog: shafiquejamal.blogspot.com
    //        The default is the /Applications/STATA12/ directory. Be sure to include the trailing forward slash. 
    //    5. This will generate a binary/dummy variable the match was a success, and variables prefixed by this same variable name with _1, _2, _3 ... , _16 appended to store the named captures/groups.
    //        It will also store (NEW_VAR_NAME)_s to store the new string with the substitution
    //
    // Steps:
    // 1. generate a merge variable based on _n. This is to make sure that the newly generated variable matches up by observations with the argument variable
    // 2. outsheet the merge variable and the argument variable into a csv file
    // 3. read the file into memory using perl
    // 4. perform the reg exp mach querry on each observation. Store result (0 or 1) in an array, whose index is the observation number as given in the merge variable
    // 5. save a new datafile, with the orignal merge var, and the match results variable, with the variable names in the headings
    // 6. merge this
    //
    // 02-12-2012: go ahead and pass the full regular expression with delimiters and options in the option REgularexpression(string asis)
    // Next step: detect whether a variable or string is the first arguement
    //
    //
    //
    // 1. generate a merge variable based on _n. This is to make sure that the newly generated variable matches up by observations with the argument variable
   
    syntax varname(string) [if], GENerate(name) REgularexpression(string asis) [Perlprogramdirwithfinalslash(string asis) PAthroperlwithfinalslash(string asis) REPLacement(string asis)]
    version 9.1
    marksample touse, strok
    // di `"`0'"'
   
    // 2. outsheet the merge variable and the argument variable into a csv file
    tempvar mergevar
    tempname _m
    // tempname touse2
    tempfile tfoutsheet
    tempfile tfinsheet
    tempfile tfinsheed_dta
    gen `mergevar' = _n
    // for some reason, marksample is not working
    // gen `touse2' = 0
    // qui replace `touse2' = 1 `if'
    cap drop `generate'
    // this is the variable that will hold the string with subsitutions
    cap drop `generate'_*
   
   
    // count if `touse'
    // count if `touse2'
    // di `"`if'"'
    // list hhid `mergevar' `touse'
   
    // qui outsheet `mergevar' `varlist' `touse' using "tfoutsheet.csv", c replace
    qui outsheet `mergevar' `varlist' `touse' using "`tfoutsheet.csv'", c replace
   
    // check options passed
    if (`"`optionmodifiers'"'==`""') {
        local optionmodifiers `""'
    }
   
    // check for perl program directory
    if (`"`perlprogramdirwithfinalslash'"'==`""') {
        local perlprogramdirwithfinalslash "/Applications/STATA12/"
    }
   
    // 3. Perl operations. Need to supply arguments in this order: inputfilename outputfilename nameofnewvariablegenerated regularexpressionpattern regularexpressionoptions
     // shell `pathroperlwithfinalslash'perl -v
     // di `"shell `pathroperlwithfinalslash'perl "`perlprogramdirwithfinalslash'stataregex.pl" "`tfoutsheet.csv'" "`tfinsheet.csv'" "`generate'" `regularexpression'"'
     qui shell `pathroperlwithfinalslash'perl "`perlprogramdirwithfinalslash'stataregex.pl" "`tfoutsheet.csv'" "`tfinsheet.csv'" "`generate'" `regularexpression' '`replacement''
   
    preserve
    qui insheet using "`tfinsheet.csv'", c clear
    sort `mergevar'
    qui save `"`tfinsheed_dta'"', replace
    restore
   
    sort `mergevar'
    qui merge 1:1 `mergevar' using `"`tfinsheed_dta'"', gen(`_m')
    qui drop `_m'
   
    foreach var of varlist `generate'* {
        cap confirm numeric var `var'
        if (_rc == 0) {
            qui replace `var' = . if `touse' == 0
        }
        else {
            qui replace `var' = "" if `touse' == 0
        }
    }   

end program

Wednesday, November 28, 2012

Stata tip: collapse dataset while preserving variable and value labels

When I use the collapse command, I loose the variable and value labels associated with my variables. The following program does everything that the collapse command does, but preserves the variable and value labels. It also has an option to refrain from putting the stat in the variable label.

Usage:

collapseandpreserve hdff=zarpl time_id (last) obraz wouer=soc_st, by(hh_code resp) o

Program:

program define collapseandpreserve

    // Written by Shafique Jamal (shafique.jamal@gmail.com).
    // This will collapse the dataset and preserve the variable and value labels. The syntax for using this is just like with the collapse command.
    // There is one additional optional option: show stat. If you add this option to the command (collapseandperserve ... ,by(...) omitstatfromvarlabel
    //    then it will not show the statistic (i.e. (fist), (mean), (last), etc.) in the variable label
   
    syntax anything(id="variable and values" name=arguments equalok), by(string asis) [cw fast Omitstatfromvarlabel]
    version 9.1
   
    // save all the value labels
    tempfile tf
    label save using `"`tf'"', replace
   
    // get the list of variables to be collapse, and keep track of the value label - variable correspondence
    tempname precollapse_listofvars
    tempname postcollapse_listofvars
    tempname listofvaluelabels
    tempname valuelabelname
    tempname stat
    tempname oldvarname
    tempname newvarname
    local `stat' "(mean)"
    foreach a of local arguments {
        di `"word: `a'"'
        if (regexm(`"`a'"',"^\(.*\)$")) { // if there is something like (first), (mean), etc.
            local `stat' = `"`a'"'   
        }
        else { // This is a variable. Store the associated variable label and value label name
           
            // What if there is an = in the term? then need two list of variables: a precollapse list and a postcollapse list
            if (regexm(`"`a'"',"^(.*)=(.*)$")) {
                // di "Regex match!"
                local `oldvarname' = regexs(2)
                // di "oldvarname: ``oldvarname''"
                local `newvarname' = regexs(1)
                // di "newvarname: ``newvarname''"
            }
            else {
                // di "NO regex match!"
                local `oldvarname' `"`a'"'
                // di "oldvarname: ``oldvarname''"
                local `newvarname' `"`a'"'
                // di "newvarname: ``newvarname''"
            }
           
            local `precollapse_listofvars'   `"``precollapse_listofvars'' ``oldvarname''"'
            local `postcollapse_listofvars'   `"``postcollapse_listofvars'' ``newvarname''"'
            local `valuelabelname' : value label ``oldvarname''
            tempname vl_``newvarname''
            local `vl_``newvarname''' : variable label ``oldvarname''
            if (`"``vl_``newvarname''''"' == `""') {
                local `vl_``newvarname''' `"``newvarname''"'
            }
            di `"omitstatfromvarlabel = `omitstatfromvarlabel'"'
            if (`"`omitstatfromvarlabel'"'==`""') {
                local `vl_``newvarname''' `"``stat'' ``vl_``newvarname''''"'
                di "not omitting"
            }
            else {
                local `vl_``newvarname''' `"``vl_``newvarname''''"'
                di "omitting"
            }
           
            if (`"``valuelabelname''"' == `""') { // variable has no value label
                local `listofvaluelabels' `"``listofvaluelabels'' ."'
            }
            else {
                local `listofvaluelabels' `"``listofvaluelabels'' ``valuelabelname''"'
            }
        }
    }
   
    collapse `arguments', by(`by') `cw' `fast'
    // macro list
   
    // retrieve the valuelabels
    qui do `"`tf'"'
   
    // reapply the variable labels and the value labels
    tempname count
    local `count' = 0
    di "------------------------------------------------"
    foreach var of local `postcollapse_listofvars' {
        di `"var: `var'"'
        di `"its variable label: ``vl_`var'''"'
        // reapply the variable labels
        local `count' = ``count'' + 1
        label var `var' `"``vl_`var'''"'
       
        // reapply the value labels
        local `valuelabelname' : word ``count'' of ``listofvaluelabels''
        if (`"``valuelabelname''"' != `"."') {
            label values `var' ``valuelabelname''
        }
    }
end program

Tuesday, November 27, 2012

Stata tip: Plotting simillar graphs on the same graph

Suppose you want to make a bar graph of a variable, such as consumption, for two mutually exclusive groups such as males and females, represented by one categorical variable ("male"). This is easy enough: use the graph bar command with an over() option. What if you want to plot over two categorical variables, one within the other: for example you want to plot average consumption for males and females that are self-employed, and average consumption for males and females that are not self-employed. Easy enough, just include an extra over() option with the extra categorical variable. In this example, each over group is mutually exclusive: you are either male or female, but can't be both, and you are either self-employed or not self-employed, but can't be both. In this example, the variables in your dataset are:

consumption, male, self-employed

Suppose, however, that you want to essentially combine two separate graphs into one graph as follows: you want to plot average consumption over three categorical variables that are NOT mutually exclusive, so you don't want to plot one within the other. For example, imagine that you are considering three different policy options for awarding a social assistance benefits: the current policy ("currentpolicy"), alternative A ("alternative_a") and alternative B ("alternative_b"). Each mechanism divides the population into those who qualify and those who don't qualify for the social assistance benefit. Thus each policy option is represented in the data by a binary variable (a.k.a. a dummy variable, which is just a categorical variable with two levels: 0 for those who do not qualify and 1 for those who do qualify). Of the three policy options, the current policy is the least pro-poor, alternative A is more pro-poor (more of the benefits go to the poor), and alternative B is the most pro-poor.

Now, the three policy options are not mutually exclusive. It is possible to qualify under all three policy options, to be excluded under all three policy options, or to qualify under only one or two of the policy options. This is not true of the groups self-employed and not self-employed, and of the groups male and female. Essentially, suppose you want to combine the first three graphs below onto the the same graph, as shown in the fourth graph:






The program below will do this. Make sure you have label values defined for the categorical values. Also note that:

a) The value labels for each of the categories should have the same numbering (i.e. they should all be 0, 1, 2 or 0, 2, 5. It should not be the case that one has 0, 1, 3 and the other has 1, 3, 4).

b) The groups defined by the categorical variable will be plotted the in the order that you specify them. So in the fourth graph above, the order is: current policy, alternative a, alternative b, since that is what is given in the option catvarlist below: catvarlist("currentpolicy alternative_a alternative_b")

c) You need to specify new value labels with the same numbering as the original value labels for the categorical variables. This is necessary for the plot to turn out nice (the spacing gets mucked up if I don't force this). You can do this in the v() option below. So the original value labels could be:

current policy: 0 "non-exempt" 1 "exempt"
alternative a: 0 "does not receive" 1 "receives"
alternative b: 0 "Ineligible" 1 "eligible"

You might want to relabel this using the v() option as follows:

v(0 "non-beneficiary" 1 "beneficiary")

To generate the fourth graph:

. overlappingcatgraphmean pccd using "$WHO_KG_reports/eraseme.dta", gc(graph bar (asis)) catvarlist("currentpolicy alternative_a alternative_b")     v(0 "Non-beneficiary" 1 "Beneficiary") over2options(lab(angle(0) labs(vsmall))) replace go(note(`"Source: Some data source "') asy asc title("Average consumption of groups within population") subtitle("Simulation of policy options") ytitle("Per capita HH consumption (LCU)", margin(medium)) legend(size(small)) blabel(total, format(%9.0fc)))

To generate the first three graphs:

. graph bar pccd [aw=expfact], over(currentpolicy) asy title("Mean annual consumption comparison") subtitle("Current Policy") ytitle("Consumption")  note("Source: some data source") blabel(total, format(%9.0fc))

. graph bar pccd [aw=expfact], over(alternative_a) asy title("Mean annual consumption comparison") subtitle("Alternative A Policy") ytitle("Consumption") note("Source: some data source") blabel(total, format(%9.0fc))

. graph bar pccd [aw=expfact], over(alternative_b) asy title("Mean annual consumption comparison") subtitle("Alternative B Policy") ytitle("Consumption") note("Source: some data source") blabel(total, format(%9.0fc))

The program:

program define overlappingcatgraphmean

    // Written by Shafique Jamal (shafique.jamal@gmail.com). 27 Nov 2012
    // I want to plot the mean of a variable over categorical values on the same plot. Of course, these categorical variables will not be mutually exclusive between them (though the are within them)
    // "using" should specify a .dta file - this program will save a dataset
    // doesn't take weights - uses svy mean to calculate the mean
    //
    // You call it like this:
    //
    // overlappingcatgraphmean varname using "filename.dta", gc(graph bar (asis)) go(over(catvariablelabel, [over_subopts]) over(catvariablelevel, [over_subopts]) asc title("My Title") ...) catvarlist(categoricalvar1 categoricalvar2) replace
    //    
    // Note that:
    // 1. 'catvariablelabel', 'catvariablelevel_n' 'catvariablelevel' must be entered exactly as is (without the quotes) - these are names of variables that the program creates
    // 2. the order in which you enter the over() options is up to you.
    //
    // UPDATE 12-07-2012: Best way is to call it with a long dataset like this: graph bar v, over(eligible) over(avg). Also note that I haven't tested whether this works with "if"

    syntax varname using/ [if] [in], GCmd(string) GOptions(string asis) CATvarlist(varlist) Valuelabelsforlevels(string asis) [replace over1options(string asis) over2options(string asis) ]
    version 9.1
    marksample touse
    tempname tempmat
    tempname variablelabel
    local `variablelabel' : variable label `varlist'   
   
    // foreach category, find the mean
    foreach catvar of local catvarlist {
        tempfile tf_`catvar'
       
        // this is a pain: get the name of the variable's value label
        tempname tn_`catvarvaluelabel'
        local `tn_`catvarvaluelabel'' : value label `catvar'
        label save ``tn_`catvarvaluelabel''' using `"`tf_`catvar''"', replace
       
        // UPDATE: None of this is necessary. The user will pass a list of value labels, separated by spaces, and these will be assumed to be the same for all the categorical variables specified
        //    e.g. user can pass v(0 "Qualifies" 1 "Does not Qualify"), where the categorical variables and corresponding value lables are:
        //    exempt     : 0 "Exempt"     1 "Non-exempt"
        //    PMT        : 0 "Eligible"    1 "Non-eligible"
        //    MBPF    : 0 "Receives"    1 "Does not receive"

        di "cat = `catvar'"
        tempname catvarlabel_`catvar'
        local `catvarlabel_`catvar'': variable label `catvar'
        tempname levels_`catvar'
        levelsof `catvar', local(`levels_`catvar'')
        foreach level of local `levels_`catvar'' {
            svy: mean `varlist' if `catvar' == `level' & `touse'
            matrix `tempmat' = r(table)
            tempname mean_`catvar'_`level'
            local `mean_`catvar'_`level'' = `tempmat'[1,1]
            tempname vl`catvar'_`level'
            local `vl`catvar'_`level'' : label (`catvar') `level'
            // di "Mean of var: ``mean_`catvar'_`level'''"
        }
    }

    tempname valueslabels
    label define `valueslabels' `valuelabelsforlevels'
    tempfile tf_valuelabelsforlevels
    label save `valueslabels' using `"`tf_valuelabelsforlevels'"', replace
   
    // I'll now make a dataset out of this with the following variables: mean of the variable; category name; category level
    //     The latter two will be numeric, categorical variables with variable labels attached.
    preserve
    clear
    do `"`tf_valuelabelsforlevels'"'
    gen meanofvariable = .
    label var meanofvariable `"``variablelabel''"'
    gen catvariablelabel = ""
    gen catvariablelevel = ""
    gen catvariablelevel_n = .
    gen sortorder = .
   
    // create the sort order - it will be the order in which the categorical variables were specified
   
    tempname count sortcount
    local `count' = 0
    local `sortcount' = 0
    foreach catvar of local catvarlist {
        // di "cat = `catvar'"
        local `sortcount' = ``sortcount'' + 1
        foreach level of local `levels_`catvar'' {
            local `count' = ``count'' + 1
            set obs ``count''
            replace meanofvariable = ``mean_`catvar'_`level''' in ``count''
            replace catvariablelabel = `"``catvarlabel_`catvar'''"' in ``count''
            replace catvariablelevel_n = `level' in ``count''
            replace catvariablelevel = `"``vl`catvar'_`level'''"' in ``count''
            replace sortorder = ``sortcount'' in ``count''
           
            // di "Mean of var: ``mean_`catvar'_`level'''"
        }
    }
    label values catvariablelevel_n `valueslabels'
    save `"`using'"', `replace'
    `gcmd' meanofvariable,  over(catvariablelevel_n, sort(catvariablelevel_n) `over1options') over(catvariablelabel, sort(sortorder) `over2options') `goptions'
    restore

end program

Stata tip: fixing the legend on bar graphs to display variable labels instead of variable names

Check out the legends on these two graphs (the first one is the problem legend, the second one is the better legend):



For the first one, I used the command:

. graph bar (mean) appliedmed hospitalized [aw=expfact], over(exempt) ...

and in the legend, it used "mean of [variable name]" instead of using the variable label. If you use the option nolabel after the graph bar command, you would just get "[variable name]" in the legend. How do you get stata to use the variable labels in the legend instead of the variable names, like in the second graph above? (note that, in the second graph, my program makes the variable labels go over two lines when they are long, and makes the line break at a space, not in the middle of a word). Use the following code:

Usage:

/ local vlist appliedmed hospitalized
. makelegendlabelsfromvarlabels `vlist', local(relabellegend) c(30)
. graph bar (mean) `vlist' [aw=expfact], over(exempt) title(`"Share reporting applied for medical assistance in the past 30 days"') ytitle("Fraction of group", margin(medium)) blabel(total, format(%9.2fc)) subtitle("Average for each group") legend(size(vsmall) `relabellegend')


Where the program makelegendlabelsfromvarlabels is defined as below. In the above, the option c(30) tells stata that the first line should have only 30 characters, and that the rest of the value label should be placed on the line below.

program define makelegendlabelsfromvarlabels

    // Written by Shafique Jamal (shafique.jamal@gmail.com). 25 Nov 2012
    //
   
    // Wrote it to fix an annoyance with graph bar. I want graph bar to use variable labels, not variable names, in the legend, but it won't do this if I am using a "(stat)" rather than "(asis)"
    syntax varlist, local(name local) [c(integer 30)]
    version 9.1
   
    // local charlength = 30
   
    tempname count
    local `count' = 0
    tempname labeloptions
    tempname variablelabel
    foreach var of local varlist {
        local `count' = ``count'' + 1
        local `variablelabel' : variable label `var'
       
        // It would be great to break this up at a word boundary if the length is > 34 characters
        if (length(`"``variablelabel''"') > `c') {
            tempname variablelabel_part1
            tempname variablelabel_part2
            tempname variablelabel_tochange
            tempname positionofspace
            tempname positionofspace_prev
            tempname exitwhileloop
            local `exitwhileloop'   = 0
            local `positionofspace' = 0
            local `variablelabel_tochange' `"``variablelabel''"'
            while (``exitwhileloop'' == 0) {
           
                local `positionofspace' = strpos(`"``variablelabel_tochange''"', " ")
                if (``positionofspace'' >= `c' | ``positionofspace''==0) {
                    local `exitwhileloop'   = 1
                }
                else {
                    local `positionofspace_prev' = ``positionofspace''
                    local `variablelabel_tochange' = subinstr(`"``variablelabel_tochange''"'," ",".",1)
                }
           
            }
           
            local `variablelabel_part1' = substr(`"``variablelabel''"', 1, ``positionofspace_prev'')
            local `variablelabel_part2' = substr(`"``variablelabel''"', ``positionofspace_prev'' + 1, . )
            local `labeloptions' `"``labeloptions'' label(``count'' `"``variablelabel_part1''"' `"``variablelabel_part2''"') "'
        }
        else {
            local `labeloptions' `"``labeloptions'' label(``count'' `"``variablelabel''"') "'
        }
    }
   
    // di `"labeloptions: ``labeloptions''"'
    // need to return this in a local macro
    c_local `local' `"``labeloptions''"'
   

end program