Summary statistics
In this short part, we are going to learn how to make summary statistics, especially by group, to export them in .tex format.
Simple statistics
We use the very standard and famous auto.dta database:
// Load the data
sysuse auto, clear
// Take a quick look at the data
describe
summarize
list in 1/10We are going to make simple summary statistics first:
summarize price
summarize price, detailTo make it more compact and to have several variables: tabstat.Use by() to split by a categorical variable, tabstat gives flexible stats and layout:
tabstat price mpg weight length, statistics(mean sd min p50 max n) columns(statistics)
tabstat price mpg weight, by(foreign) statistics(mean sd p50 n) columns(statistics)To make it frequency tables, we can use tabulate with one or two variables:
tabulate foreign
tabulate foreign rep78, row columnIf you want summary statistics (e.g., mean of mpg) by category use tabulate with summarize():
tabulate foreign, summarize(mpg)To make simple comparison t-test (mean of price by foreign):
ttest price, by(foreign) unequalLaTeX export
There are two widely used approaches: estpost + esttab (from the estout package) — flexible, works with summarize, tabstat, tabulate, ttest, etc., and can write LaTeX directly.
A large amount of work can be achieved with base Stata (historically user-written packages have been added to the “main” Stata version). However, some packages have to installed before hand. You need to do it only once on your computer. To use the aforementioned packages, you need to install them. To do so:
ssc install estoutHere is a short summary of the key estout commands:
Clear any stored estimates:
eststo clear(shortcut forestimates store clear)Post summary stats (overall)
estpost summarize price mpg weight length, listwiseWrite LaTeX table (replace file if exists)
esttab . using "summary_overall.tex", replace ///
title("Summary statistics (overall)") ///
cells("count(fmt(%0.0f)) mean(fmt(%9.2f)) sd(fmt(%9.2f)) min(fmt(%9.0f)) p50(fmt(%9.2f)) max(fmt(%9.0f))") ///
label booktabs noobs nomtitleThis creates summary_overall.tex you can \input{} into your LaTeX document.
Notice the use of /// at the end of the line. It breaks long line of code. Hence, you need to run all the code bunch. In more recent of Stata, the lines break automatically.
It also works by group: let’s say you want the mean and standard deviation of price, mpg, and weight by foreign:
estpost tabstat price mpg weight, by(foreign) statistics(mean sd p50 n) nototal columns(statistics)Export to LaTeX
esttab . using "summary_by_foreign.tex", replace cells(mean sd) label nogaps booktabs noobs nomtitle
This produces a LaTeX table where each group appears and mean/sd are shown. Note that you can export in other text formats .txt, .smcl (the Stata format), .xlsx, etc.
Finally, we can export a frequency table for foreign:
estpost tabulate foreign
esttab . using "freq_foreign.tex", replace ///
cells("b(lab(freq)) pct(fmt(2))") label booktabs noobs nomtitleRegression tables
To make regression tables, we need to make regressions! The function is simple: regress y x, options, usually abbreviated with reg. Many other functions exist: areg, ivregress, reghdfe, etc. depending on the model you want to use but this is beyond the scope of this lecture.
Let’s run some simple example by assessing the correlation between the cars’ characteristics and their price. Hence, the model writes:
\[ \texttt{price}_i = \beta_0 + \beta_1 \texttt{foreign}_i + \beta_2 \texttt{mpg}_i + \texttt{rep78}_i + e_i \]
The associated Stata code is:
sysuse auto.dta, clear
reg price i.foreign mpg i.rep78We use the i. operator because foreign and rep78 are discrete variables (even though coded as a numeric).
We also want to estimate the same model and include the interaction between foreign and mpg, to check if the gasoline consumption differs by origin.
\[ \texttt{price}_i = \beta_0 + \beta_1 \texttt{foreign}_i + \beta_2 \texttt{mpg}_i + \beta_3 \texttt{foreign}_i \times \texttt{mpg}_i + e_i \]
The Stata code with the interaction term writes:
reg price i.foreign mpg i.rep78 i.foreign#c.mpgNotice that we use the c. operator to ensure that mpg is considered as a numeric variable and such that the interaction is not done on all combinations of foreign and mpg.
Now that we ran our models, how to export the results to LaTeX? We use the estimates store, abbreviated in eststo, syntax. Two ways to proceed:
eststo: reg price i.foreign mpg i.rep78 i.foreign#c.mpg
/* or */
reg price i.foreign mpg i.rep78 i.foreign#c.mpg
eststo
/* if we want to give a name */
eststo my_name : reg price i.foreign mpg i.rep78 i.foreign#c.mpg
reg price i.foreign mpg i.rep78 i.foreign#c.mpg
eststo my_nameWhile Stata is opened and no eststo clear is ran, we can print again the results by esttab est1. To export the results, we also use esttab with options, notably the tex or preferably the booktabs options (better LaTeX rendering).
eststo no_inter: reg price i.foreign mpg i.rep78
eststo inter: reg price i.foreign mpg i.rep78 i.foreign#c.mpg
esttab no_inter inter
esttab no_inter, booktabs // make it LaTeX
esttab no_inter, booktabs se // add the SE
esttab no_inter, booktabs se(%9.0f) drop(_cons) // format of se + drop cons
esttab no_inter, booktabs se(%9.0f) drop(_cons) star(* 0.1 ** 0.05 *** 0.001) // change sig stars
esttab no_inter, booktabs se(%9.0f) drop(_cons) star(* 0.1 ** 0.05 *** 0.001) ///
r2 indicate("Repair record dummies = *.rep78") // add R2 and add a "Yes/No" ligne for dummies
esttab no_inter, booktabs se(%9.0f) drop(_cons) star(* 0.1 ** 0.05 *** 0.001) ///
r2 indicate("Repair record dummies = *.rep78") ///
label // add labels
esttab no_inter, booktabs se(%9.0f) drop(_cons) star(* 0.1 ** 0.05 *** 0.001) ///
r2 indicate("Repair record dummies = *.rep78") ///
label nobaselevels // no base levels
esttab no_inter inter, booktabs se(%9.0f) drop(_cons) star(* 0.1 ** 0.05 *** 0.001) ///
r2 indicate("Repair record dummies = *.rep78") ///
label nobaselevels // for two models
esttab no_inter inter, booktabs se(%9.0f) drop(_cons) star(* 0.1 ** 0.05 *** 0.001) ///
r2 indicate("Repair record dummies = *.rep78") ///
label nobaselevels nonum // no model num
esttab no_inter inter using $path/output/reg.tex, booktabs se(%9.0f) drop(_cons) star(* 0.1 ** 0.05 *** 0.001) ///
r2 indicate("Repair record dummies = *.rep78") ///
label nobaselevels nonum replace // no model num + latex exportAdvanced users
A package I particularly like is tabout, which documentation can be found here. It is not straightforward to use but comes very handy to make complex descriptive statistics table.
Exercise
To wrap up on data manipulation and summary statistics, we are going to make an exercise using first the built-in census.dta.
- Set-up your dofile and create two globals: one containing your path and one containing the string “1980 2010”
- Tabulate the region name (there are four census regions in the US: Northwest, Midwest, West and South)
- Regions’ names are not informative enough: create a new label containing the right regions’ names and update the
regionlabel. - Using
decode var, gen(new_var), create a new variable containg the regions’ names as string - Keep only the
state,regionandpopvariables - Rename
poptopop1980 - Save in
your/path/temp/census1980.dta
Now we are going to use another dataset, an Excel file.
- Open the
nst-est2019-01.xlsxfrom the raw data folder usingimport exceland the path you created! Pay attention to the range you want to open: we keep only A10 to B60 - Rename the first two columns
stateandpop2010 - State names all start with a “.”. Using
subinstrcommand, remove the dots - Save in
your/path/temp/census2010.dta
Now we merge the two datasets.
- Clear the open datasets, open
census2010.dtaand merge withcensus1980.dtawithm:mandstateas joining variable. What happened? Tabulate_mergeto investigate a bit more. - Some states are not named the same depending on the dataset. Write a loop that runs through the
yearglobal: open the dataset, replace “W.” by “West”, “S.” by “South”, and “N.” by North, save the dataset. - Merge the datasets. Remove the only non-merged observation.
Finally, we make some descriptive statistics:
- Compute the population change
- Make a table summarizing (mean, sd and median) the change in population by region
- Reshape the dataset in long format to have two lines per state: one for 1980 and one for 2010
Bonus:
- Collapse
changebyregion(compute the mean) - Using
graph bar var, over(group), make a graph bar of change by region