Tutorial 6: Autonomous replication of an IV paper

Arceo, Hanna, and Oliva (The Economic Journal, 2015)

Does air pollution affect child mortality? Is this relationship linear? Air pollution is an important subject and leads to various deseases. Most estimates from the literature are from developed countries, leading to low external validity. In this paper, the authors propose a novel estimation of the effect of air pollution on infant mortality, in a developing context, Mexico.

Note

For this PC, you are asked to upload a .pdf file at the end of the day. You can work in group. This is not graded but the output quality will be taken into account for the participation grade. Please upload the file on Moodle with the following naming convention: “PC6_GR1_NAME1_NAME2.pdf” or “PC6_GR2_NAME1_NAME2.pdf” (in alphabetical order).

Variables used in the replication
Variable	Description
`w_tmp_mean`	Average temperature
`w_precip`	Precipitation
`w_evap`	Evaporation
`w_invterm`	Thermal inversion
`rw_infant_1y`	Child mortality (aged 1) in Mexico
`grw_infant_1y`	Child mortality (aged 1) in Guadalajara
`pm10_max24hr`	PM10 pollution
`co_max8hr`	Co pollution
`so2_mean`	Sulfure dioxyde pollution
`o3_mean`	Ozone pollution
`m`	Municipal ID
`week`, `month`, `year`	Time ID

Exercise 1: Intuition

Most answers are in the introduction of the paper.

Why a simple OLS regression of child mortality on pollution would lead to a biased estimation?
A common IV strategy for pollution is to use regulation, why would it lead to a weak first stage?
The authors argue that the external validity of the results found in developed countries is low. Why? Would we over- or under-estimate the real effect if we were to use the coefficients find in developed/less polluted countries?
Instead of running a simple OLS regression, the authors suggest to add month and month-area fixed effect. Why does it improve the quality of the estimation?
The authors suggest using thermal inversion as an instrument for air pollution. Discuss the exogeneity and the relevance conditions of this instrument.

Exercise 2: Data cleaning and visual representation

Open the raw data
Control variables include w_tmp_mean, w_precip, w_cloud, w_evap, w_invterm. Keep only observations for which those controls are not missing.
Remove if w_tmp_impute is 1 (ie, if the temperature is imputed).
Compute the monthly average of termal inversion (w_invterm) and the monthly average temperature (w_tmp_mean).
Replicate Figure 3 using ggplot. The mortality variables are rw_infant_1y and grw_infant_1y. Export to your .tex file.

Exercise 3: Empirical analysis

Pay close attention to the footnote of the tables you want to replicate. Notice that the authors drop the bottom 99 and top 1 percentile observations (to account for outliers). They also weight the regressions by the number of births. Pollution variables are pm10_max24hr, co_max8hr, so2_mean, o3_mean.

Notice you will need to create some variables (the polynomials and the fixed effects). The explained variables are scaled by 1000 in the paper.

Replicate Table 2. Predict the value of pollution. Export. Interpret. Does the IV seem valid?
Replicate Table 3 (Columns 1 to 4 only). Export. Interpret. Are you confident with these results?
[Bonus] Rerun the analysis without dropping the outliers. Interpret.
[Bonus] Plot the main point estimates and the confidence intervals. Which pollutant is the worst?