Saturday, December 8, 2012

Stata tip: "Mega If" : a simple command to generate long if conditions

Suppose you need to type the following command that has many conditions in the if condition:

replace group = 1 if (benefitNumber == 1 | benefitNumber == 2 | benefitNumber == 3 | benefitNumber == 4 | benefitNumber == 5 | benefitNumber == 6 | benefitNumber == 11 | benefitNumber == 12 | benefitNumber == 17 | benefitNumber == 19 | benefitNumber == 21 | benefitNumber == 22 | benefitNumber == 23)

It's a bit much to type. The program megaif below will generate this long line from the much shorter command:

megaif 1 2 3 4 5 6 11 12 17 19 21 22 23, v(benefitNumber) c(replace group = 1)

The above has lots of "or equals", but you can also generate lots of "and does not equal", for example:

replace group = 1 if (benefitNumber != 1 & benefitNumber != 2 & benefitNumber != 3 & benefitNumber != 4 & benefitNumber != 5 & benefitNumber != 6 & benefitNumber != 11 & benefitNumber != 12 & benefitNumber != 17 & benefitNumber != 19 & benefitNumber != 21 & benefitNumber != 22 & benefitNumber != 23)

using the following command:

megaif 1 2 3 4 5 6 11 12 17 19 21 22 23, v(benefitNumber) c(replace group = 1) e(!=) s(&)

It works for numeric variables and for string variables too. Check this out:

megaif "a b" b "cc" d `"e"', v(benefit_stringvar) c(replace group = 1) e(!=) s(&)

executes the following command:

cmd to execute: replace group = 1 if (benefit_stringvar != "a b" & benefit_stringvar != "b" & benefit_stringvar != "cc" & benefit_stringvar != "d" & benefit_stringvar != "e")

As you can see, for string variables the quotes are optional unless you're checking for text that has a space in it. The program is below. Enjoy!

program define megaif

    // By Shafique Jamal
    // e.g.
    // sysuse auto, clear
    // megaif 0 1, v(foreign) c(drop) e(~=) // this will drop all the observations. Just for illustrative purposes to show how the command could be used
    // another e.g.
    // The command:
    //    megaif 14 15 16 17 18 19 20 21 22, c(gen priv1 = 1) var(income_type2)
    // would execute the following command:
    //     gen priv1 = 1 if (income_type2 == "14" | income_type2 == "15" | income_type2 == "16" | income_type2 == "17" | income_type2 == "18" | income_type2 == "19" | income_type2 == "20" | income_type2 == "21" | income_type2 == "22")


    syntax anything(id="variable and values" name=arguments), Var(varname) Cmd(string) [Equality(string) Separator(string)]
   
    // The default is equality
    if ("`equality'" == "") {
        local equality "=="
    }
   
    if ("`separator'" == "") {
        local separator " | "
    }
    else {
        local separator " `separator' "
    }
   
    cap confirm numeric variable `var'
    if (_rc == 0) { // variable is numeric
        local numericvar = 1
    }
    else {
        local numericvar = 0
    }
    // di "numericvar = `numericvar'"
   
    local count = 0
    local orcondition ""
    foreach w of local arguments {
        local count = `count' + 1

        // di `"w = `w'"'
        if (`numericvar' == 0) {
            local orcondition `"`orcondition'`orseparator'`var' `equality' "`w'""'
        }
        else {
            local orcondition `"`orcondition'`orseparator'`var' `equality' `w'"'
        }
        local orseparator "`separator'"
    }
   
    // di `"orcondition = `orcondition'"'
    di `"cmd to execute: `cmd' if (`orcondition') "'
    // set trace on
    // set traced 1
    `cmd' if (`orcondition')
    set trace off

end




Thursday, December 6, 2012

Stata tip: Quickly, and in one command, rename all variable labels of variables generated with the 'xi' command to reflect the value labels of the xi'd variable

When you use the xi command on categorical variables, even on those that have a value label associated with them, you get this:

. xi: svy, subpop(rural): reg logpccd rural_regressors refrigerator car livingroomsper numchild11 numchild11_sq i.oblast i.typeofdwell i.roof i.coldwatermeterinstalled i.soc_

. d

_Ioblast_3      byte   %8.0g                  oblast==3
_Ioblast_4      byte   %8.0g                  oblast==4
_Ioblast_5      byte   %8.0g                  oblast==5
_Ioblast_6      byte   %8.0g                  oblast==6
_Ioblast_7      byte   %8.0g                  oblast==7
_Ioblast_8      byte   %8.0g                  oblast==8
_Ioblast_11     byte   %8.0g                  oblast==11
_Itypeofdwe_2   byte   %8.0g                  typeofdwelling==2
_Itypeofdwe_3   byte   %8.0g                  typeofdwelling==3
_Itypeofdwe_4   byte   %8.0g                  typeofdwelling==4
_Itypeofdwe_5   byte   %8.0g                  typeofdwelling==5
_Itypeofdwe_6   byte   %8.0g                  typeofdwelling==6
_Itypeofdwe_7   byte   %8.0g                  typeofdwelling==7
_Itypeofdwe_8   byte   %8.0g                  typeofdwelling==8
_Itypeofdwe_9   byte   %8.0g                  typeofdwelling==9


The value labels are not much more informative than are the variable names. Below is a program that will automatically rename the variable label of these variables that result from the xi command so that they include the corresponding value label, as follows:

_Ioblast_3      byte   %8.0g                  oblast=Jalalabat
_Ioblast_4      byte   %8.0g                  oblast=Naryn
_Ioblast_5      byte   %8.0g                  oblast=Batken
_Ioblast_6      byte   %8.0g                  oblast=Osh
_Ioblast_7      byte   %8.0g                  oblast=City of Osh
_Ioblast_8      byte   %8.0g                  oblast=Chui
_Ioblast_11     byte   %8.0g                  oblast=City of Bishkek
_Itypeofdwe_2   byte   %8.0g                  typeofdwelling=Apartment or room in a residential hotel
_Itypeofdwe_3   byte   %8.0g                  typeofdwelling=Separate house
_Itypeofdwe_4   byte   %8.0g                  typeofdwelling=Part of a house
_Itypeofdwe_5   byte   %8.0g                  typeofdwelling=Dormitory
_Itypeofdwe_6   byte   %8.0g                  typeofdwelling=Lodge or a tied cottage (temporary tenure dwelling)
_Itypeofdwe_7   byte   %8.0g                  typeofdwelling=Other non-residential premises used for residence
_Itypeofdwe_8   byte   %8.0g                  typeofdwelling=Other residential premises
_Itypeofdwe_9   byte   %8.0g                  typeofdwelling=Barracks
 

There are actually two programs - mine is a wrapper for a program that Nicholas J. Cox wrote. Both of these are below.

Usage (run this after the xi command):

. varsformyrelabel

Programs:

program define varsformyrelabel

    // Written by Shafique Jamal (shafique.jamal@gmail.com), 12-07-2012
    // UPDATE 12-07-2012: Need to change how the variable name for the list of `allunxidvariables' is determined. Need to get it from the variable label, rather than the variable name
    //

    // Get list of variables that were xi'd
    local xivars "`_dta[__xi__Vars__To__Drop__]:'"
    // di `"xivars:`xivars'"'
   
    // Now just need to get list of un-xi'd variables from this list
    // Here is the first one
    local currentdummyvar : word 1 of `xivars'
    // di `"currentdummyvar:`currentdummyvar'"'
   
    // This will get the full variable name
    local currentunxidvar = regexr("`: variable label `currentdummyvar''","==.*$","")
    // di `"currentunxidvar:`currentunxidvar'"'
    local allunxidvars "`currentunxidvar'"
    // di `"allunxidvars:`allunxidvars'"'
   
    // This will get the _I`var' name, without the _# suffix - I need this for the first argument to the myrelabel routine. Variable name gets shortened
    local currentunxidvarwith_I = regexr("`currentdummyvar'","_[0-9]+$","")
    // di `"currentunxidvar:`currentunxidvarwith_I'"'
    local allunxidvarswith_I "`currentunxidvarwith_I'"
    // di `"allunxidvarswith_I:`allunxidvarswith_I'"'
   
    // Now loop through the rest
    local count = 0
    foreach var of local xivars {
        local count = `count' + 1
        if (`count' != 1) {
            local w : word `count' of `xivars'
            // di "w: `w'"
           
            // check whether the next xi'd var is related to the current one
            // if (regexm("`w'","^_I`currentunxidvar'_[0-9]+$")) { // yes, this is part of the same family as the current _I.... variable under consideration
            if (regexm("`: variable label `w''","^`currentunxidvar'==.*$")) { // yes, this is part of the same family as the current _I.... variable under consideration
                // di "skip"
            }
            else { // no, it is different. add to the list
                // this gets the full variable name
                local currentunxidvar = regexr("`: variable label `w''","==.*$","")
                // di `"currentunxidvar:`currentunxidvar'"'
                local allunxidvars "`allunxidvars' `currentunxidvar'"
                // di `"allunxidvars:`allunxidvars'"'
               
                // This gets the _Ivar name
                local currentunxidvarwith_I = regexr("`w'","_[0-9]+$","")
                // di `"currentunxidvar:`currentunxidvarwith_I'"'
                local allunxidvarswith_I "`allunxidvarswith_I'  `currentunxidvarwith_I'"
                // di `"allunxidvarswith_I:`allunxidvarswith_I'"'
            }       
        }
    }

    di "allunixidvars: `allunxidvars'"
    di `"allunxidvarswith_I:`allunxidvarswith_I'"'
    local count = 0   
    foreach var of local allunxidvars {
        local count = `count' + 1
        local varwith_I : word `count' of `allunxidvarswith_I'
        myrelabel `varwith_I'_* `var'
    }
   
end


program def myrelabel
*! NJC 1.0.0 15 July 2003
    version 7
    syntax varlist(numeric)

    tokenize `varlist'
    local nvars : word count `varlist'
    local last ``nvars''
    local vallabel : value label `last'
    if "`vallabel'" == "" {
        di as err "`last' not labelled"
        exit 498
    }

    local `nvars'
    local varlist "`*'"

    foreach v of local varlist {
        local varlabel : variable label `v'
        local eqs = index(`"`varlabel'"', "==")
        if `eqs' {
            local value = real(substr(`"`varlabel'"', `eqs' + 2, .))
            if `value' < . {
                local label : label `vallabel' `value'
                label var `v' `"`last'=`label'"'
            }
        }
    }

end


Monday, December 3, 2012

Stata tip: Using perl compatible regular expressions (PCRE) in Stata

UPDATE 12-07-2012: Thanks to Nicholas J Cox who the problem I was having with the -marksample- command. I replaced the code below with the new, fixed code.

Stata's regular expression engine is too limited for my needs. I asked the statalist about how to change Stata's regular expression engine, but apparently it is not possible. So I wrote a Stata program (.ado file) to call a perl script to run a regular expression on a variable.

Matching and substitution are supported. Named captures/groups are not, but non-named captures (e.g. $1, $2, etc.) ARE supported. I think quantifiers are supported. Anyways, if it works as I think it should based on my design and testing, it should be a decent improvement over Stata's built-in regular expression engine (I hope they update it soon).

You can download the perl script here. Download it and place it in any directory - just remember the directory because you will have to specify when you call the program.

The Stata program is here and below. Put this in your personal ado folder.

Usage:

Match only:
        pcre SOME_STRING_VARIABLE, re("/^(\d)(\w)/i") gen(NEW_VARIABLE_TO_BE_GENERATED) pa("/usr/local/ActivePerl-5.16/bin/")

Substitution:
        pcre SOME_STRING_VARIABLE, re("/^(\d)(\w)/gi") gen(NEW_VARIABLE_TO_BE_GENERATED) pa("/usr/local/ActivePerl-5.16/bin/") repl("firstone_$1_secondone_$2") 

Notes:

1. The arguement for re() should be a regular expression enclosed in double quotes. You can use only the forward slash for a delimiter. Named captures/groups don't work yet (I can't figure out why. Any ideas?)

2. The arguement for repl() should be the replacement part of s//THIS_PART/. It should be enclosed in double quotes. Do NOT include the forward slashes or any delimiters. Option modifiers do NOT go here. You can use backreferences $1, $2, etc. but NOT named groups/named captures (i.e. you can't use \g{1}, \g{name}, etc. The \g{} notation doesn't work at all).

3. You can specify the path to your perl installation in pa() (Be sure to include the trailing forward slash). If you don't, it will use whatever version of perl is accessible from the command line in a terminal in whatever path this is run from.

4. You should specify the path of the perl script that this program calls: stataregex.pl. You can download this from my blog: shafiquejamal.blogspot.com. The default is the /Applications/STATA12/ directory. Be sure to include the trailing forward slash.

5. This will generate a binary/dummy variable the match was a success, and variables prefixed by this same variable name with _1, _2, _3 ... , _16 appended to store the named captures/groups.

program define pcre

     // 30101990
    // Written by Shafique Jamal (shafique.jamal@gmail.com), 01 Dec 2012. Use at own risk :-p
    //
    // This program allows the user to use perl compatible regular expressions on a (single) string VARIABLE (not a scalar string) for matching, obtaining captures from memory parenthesis, and
    //    subsitutions. Its not perfect... I think it supports quantifiers, it does support options/option modifiers, but it does not support named captures/groups.
    //
    // Usage:
    //
    //    Match only:
    //        pcre SOME_STRING_VARIABLE, re("/^(\d)(\w)/i") gen(NEW_VARIABLE_TO_BE_GENERATED) pa("/usr/local/ActivePerl-5.16/bin/")
    //  Substitution:
    //        pcre SOME_STRING_VARIABLE, re("/^(\d)(\w)/gi") gen(NEW_VARIABLE_TO_BE_GENERATED) pa("/usr/local/ActivePerl-5.16/bin/") repl("firstone_$1_secondone_$2")
    //
    // Note:
    //
    //    1. The arguement for re() should be a regular expression enclosed in double quotes. You can use only the forward slash for a delimiter. Named captures/groups don't work yet (I can't
    //        figure out why. Any ideas?)
    //  2. The arguement for repl() should be the replacement part of s//THIS_PART/. It should be enclosed in double quotes. Do NOT include the forward slashes or any delimiters.
    //        Option modifiers do NOT go here. You can use backreferences $1, $2, etc. but NOT named groups/named captures (i.e. you can't use \g{1}, \g{name}, etc. The \g{} notation doesn't work at all). 
    //  3. You can specify the path to your perl installation in pa() (Be sure to include the trailing forward slash). If you don't, it will use whatever version of perl is accessible from the command line in a terminal in whatever path this
    //        is run from.
    //  4. You should specify the path of the perl script that this program calls: stataregex.pl. You can download this from my blog: shafiquejamal.blogspot.com
    //        The default is the /Applications/STATA12/ directory. Be sure to include the trailing forward slash. 
    //    5. This will generate a binary/dummy variable the match was a success, and variables prefixed by this same variable name with _1, _2, _3 ... , _16 appended to store the named captures/groups.
    //        It will also store (NEW_VAR_NAME)_s to store the new string with the substitution
    //
    // Steps:
    // 1. generate a merge variable based on _n. This is to make sure that the newly generated variable matches up by observations with the argument variable
    // 2. outsheet the merge variable and the argument variable into a csv file
    // 3. read the file into memory using perl
    // 4. perform the reg exp mach querry on each observation. Store result (0 or 1) in an array, whose index is the observation number as given in the merge variable
    // 5. save a new datafile, with the orignal merge var, and the match results variable, with the variable names in the headings
    // 6. merge this
    //
    // 02-12-2012: go ahead and pass the full regular expression with delimiters and options in the option REgularexpression(string asis)
    // Next step: detect whether a variable or string is the first arguement
    //
    //
    //
    // 1. generate a merge variable based on _n. This is to make sure that the newly generated variable matches up by observations with the argument variable
   
    syntax varname(string) [if], GENerate(name) REgularexpression(string asis) [Perlprogramdirwithfinalslash(string asis) PAthroperlwithfinalslash(string asis) REPLacement(string asis)]
    version 9.1
    marksample touse, strok
    // di `"`0'"'
   
    // 2. outsheet the merge variable and the argument variable into a csv file
    tempvar mergevar
    tempname _m
    // tempname touse2
    tempfile tfoutsheet
    tempfile tfinsheet
    tempfile tfinsheed_dta
    gen `mergevar' = _n
    // for some reason, marksample is not working
    // gen `touse2' = 0
    // qui replace `touse2' = 1 `if'
    cap drop `generate'
    // this is the variable that will hold the string with subsitutions
    cap drop `generate'_*
   
   
    // count if `touse'
    // count if `touse2'
    // di `"`if'"'
    // list hhid `mergevar' `touse'
   
    // qui outsheet `mergevar' `varlist' `touse' using "tfoutsheet.csv", c replace
    qui outsheet `mergevar' `varlist' `touse' using "`tfoutsheet.csv'", c replace
   
    // check options passed
    if (`"`optionmodifiers'"'==`""') {
        local optionmodifiers `""'
    }
   
    // check for perl program directory
    if (`"`perlprogramdirwithfinalslash'"'==`""') {
        local perlprogramdirwithfinalslash "/Applications/STATA12/"
    }
   
    // 3. Perl operations. Need to supply arguments in this order: inputfilename outputfilename nameofnewvariablegenerated regularexpressionpattern regularexpressionoptions
     // shell `pathroperlwithfinalslash'perl -v
     // di `"shell `pathroperlwithfinalslash'perl "`perlprogramdirwithfinalslash'stataregex.pl" "`tfoutsheet.csv'" "`tfinsheet.csv'" "`generate'" `regularexpression'"'
     qui shell `pathroperlwithfinalslash'perl "`perlprogramdirwithfinalslash'stataregex.pl" "`tfoutsheet.csv'" "`tfinsheet.csv'" "`generate'" `regularexpression' '`replacement''
   
    preserve
    qui insheet using "`tfinsheet.csv'", c clear
    sort `mergevar'
    qui save `"`tfinsheed_dta'"', replace
    restore
   
    sort `mergevar'
    qui merge 1:1 `mergevar' using `"`tfinsheed_dta'"', gen(`_m')
    qui drop `_m'
   
    foreach var of varlist `generate'* {
        cap confirm numeric var `var'
        if (_rc == 0) {
            qui replace `var' = . if `touse' == 0
        }
        else {
            qui replace `var' = "" if `touse' == 0
        }
    }   

end program