Tutorial 1

CREST - Institut Polytechnique de Paris

March 5, 2025

Roadmap

For this first tutorial, we are going to

  1. Download R and Rstudio
  2. Discover the software
  3. Make our first data visualisation
  4. Learn how to export them and input them in a .tex file.

What is R? Set-up and launch

R is a free and open-source software used in many contexts: data cleaning, data visualisation, econometrics, machine and deep learning, among others.

It is supported by a lively and active user community. R is an object-based language, meaning that we are going to manipulate objects through a series of commands.

R can be used directly from a command terminal, but we prefer to use RStudio, an IDE, to make our lifes easier. An IDE is an interface that eases the use of a software. The most popular one is RStudio. To use it, we need to:

  1. Download R
  2. Then, download RStudio

You can download both from here https://posit.co/download/rstudio-desktop/.

General best practices

Before jumping to coding per se, let me recall you some best practices when performing computer-based tasks.

Folder organization

Computers are organized around folder. We are going to work in a working directory (wd), but we need to identify it. My folder tree looks like it:

/Users/mmoglia/Dropbox/
├── Documents/
│   ├── perso/
│   │   ├── banque/
│   │   ├── admin/
│   │   └── festival/
├── Downloads/
├── Music/
├── Pictures/
│   ├── famille/
│   ├── vacances/
└── courses/
│   ├── polytechnique/
|   |   ├── 2024_eco102/
|   |   ├── 2025_eco1s002/
├── research/

Here we are going to work in ~/courses/2025_eco1s002/. This folder may (and should!) contains subfolders, for instance: /code, /output, /raw_data, etc.

Naming convention

A typical tip is to choose simple and short titles for the files and the scripts. For instance, this file is named tutorial1.qmd. Your code can be named code_tutorial1.R. It should be self-explanatory.

Tip

Avoid at all cost to use spaces or special characters in your file names. Prefer instead an underscore or a score.

When writing code

  • Always comment your code, make it readable for your future self but also for others. You may use your code in a week, a month, a year, and should be able to directly understands it!

  • When creating objects, functions, or whatever, give it simple but understandable names.

  • For instance, if you create a data frame containing wages for individuals aged between 25 and 30, you way call it wage_25_30 and not w2530 (too short) or wage_individuals_aged_25_30 (too long).

  • Moreover, especially for large project, you want to have different scripts. Always keep scripts short and with clear names. For instance 1_clean_data.R, 2_import_wages.R, 3_merge_datasets.R, and 4_data_analysis.R.

First step in R/Rstudio

Some useful terms

  • So far, we talked about script.

  • We are going to manipulate object using functions.

  • Objects can be of different classes: matrix, vector, character, numeric, data.frame, etc. Some functions are already in R but some are not. Hence, we need to load them thanks to packages.

First step in R/Rstudio

Open a script

  • We are going to create our first script (the extension is .R).

  • This script contains all commands you want to run. R has built-in functions but most useful functions should be called using library().

  • These libraries should be first downloaded then loaded into your project.

  • First, open RStudio and opens a new script.

  • You should have an empty page. You may want to write some words at the top of it to have an idea of what this code does. You can use # to comment code.

First step in R/Rstudio

Script

#-------------------------------------------------------------------------------
# This script opens a dataset and proceeds
# to data visualization.
#
# Author: Matéo Moglia
# Date: 12/02/2025
#-------------------------------------------------------------------------------

# Set your working directory
setwd("C:/Users/mateomoglia/Dropbox/courses/polytechnique/2025_eco1s002/site")
path <- "/Users/mateomoglia/Dropbox/courses/polytechnique/2025_eco1s002/site"

Here, I set the working directory using setwd(). Notice that we created our first object, named path using the assignment operator <- (I personally prefer to use = which works exactly the same).

Tip

What is the class of path? Hint: use the function class()

First step in R/Rstudio

Our first data manipulation

We can load a very famous built-in data set, the iris dataset, into an object called data.

data = iris # Attribute to data the data set iris
class(data) # Check its class
[1] "data.frame"
head(data) # Preview the first lines of data
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Notice that the two objects we created so far can be seen in the Environment pane.

First step in R/Rstudio

Data description

By clicking on the object in the Environement panel, you can visualize it, but you can also write some code to describe it. As we just check, data is a data.frame object. It is made of several columns and rows.

To call a specific column in a dataframe, we use the $ operator. To know all the column names in data, we can use the function names(). To know the “size” of the dataframe, the right function is dim(): the first element is the number of rows and the second the number of columns.

Tip

You can also call a column by its position in the dataframe. For instance, to call the second column you can type data[,2]. For the second row: data[,2]. What should you write to print the 10th element of the 3rd column?

To describe the data, we can use several built-in functions.

table(data$Species) # To make a frequency table

    setosa versicolor  virginica 
        50         50         50 
summary(data$Sepal.Length) # To provide summary statistics
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.300   5.100   5.800   5.843   6.400   7.900 

Now, I want to visualize the data. To do so, we can use base R function plot:

plot(iris$Sepal.Length,iris$Sepal.Width)

First step in R/Rstudio

Our first graph!

We can also use a very popular library for dataviz, ggplot2. First, we install it, then we load it, and we are going to be able to use it.

install.packages("ggplot2",repos = "http://cran.us.r-project.org")
library(ggplot2)

Note

All packages have to be downloaded and loaded this way. Note that once you’ve downloaded a package on your computer, you do not need to install it but you still need to load it through the library function.

g1 <- ggplot(data = data, mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point() +
  theme_minimal()
g1

This looks better! To produce this graph we used different information:

  • The ggplot command takes data and mapping as input.
    • data should be the dataset you want to visualize, here we named it data
    • mapping takes aes (for aesthetics)
      • Here, the aesthetics are the x and y axis, and the color one, to distinguish between the species.
    • geom_point allows to plot… points
    • theme_minimal makes the graph tidy, with a white background, etc.

Now, we can save it:

ggsave(filename = paste0(path,"/output/graph_iris.png"),plot = g1)

Tip

The function paste0 is useful to concatenate text. Here, instead of writing the full path, we concatenate the object path (which happens to be our path) and the end of the path, including the name of the output.

First step in R/Rstudio

Data: base vs. dplyr

To finish this -short- introduction to R, I introduce a new package, probably the most popular one in R: dplyr. As before, we install then load the library.

install.packages("tidyverse")
library(tidyverse)

Let’s imagine we want to filter the dataset to keep only the setosa. In base R, we need to extract the lines of iris for which Species=="setosa". Because R is object based, we can multiple objects, here: base_setosa.

The data frame iris is close to a matrix, hence, we can extract using notations close to the matrix ones. Hence data[1,1] would give us the first line of the first column (in this order) of data. Here, we filt the lines using a condition: data$Species=="setosa". Do not forget the ,.

# Base R: Filter rows where Species is "setosa"
base_setosa <- data[data$Species == "setosa", ]
head(base_setosa)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Using dplyr and the pipe operator %>%, we apply the function filter on Species.

# dplyr: Filter rows where Species is "setosa"
dplyr_setosa <- iris %>% 
  filter(Species == "setosa")
head(dplyr_setosa)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Next, we can select specific columns.

# Base R: Select only Sepal.Length and Sepal.Width columns
base_columns <- iris[, c("Sepal.Length", "Sepal.Width")]
head(base_columns)
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
6          5.4         3.9
# dplyr: Select only Sepal.Length and Sepal.Width columns
dplyr_columns <- iris %>% 
  select(Sepal.Length, Sepal.Width)
head(dplyr_columns)
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
6          5.4         3.9
# Base R: Create a new column with Sepal.Area (Sepal.Length * Sepal.Width)
iris$Sepal.Area <- iris$Sepal.Length * iris$Sepal.Width
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
1          5.1         3.5          1.4         0.2  setosa      17.85
2          4.9         3.0          1.4         0.2  setosa      14.70
3          4.7         3.2          1.3         0.2  setosa      15.04
4          4.6         3.1          1.5         0.2  setosa      14.26
5          5.0         3.6          1.4         0.2  setosa      18.00
6          5.4         3.9          1.7         0.4  setosa      21.06
# dplyr: Create a new column with Sepal.Area (Sepal.Length * Sepal.Width)
dplyr_iris <- iris %>% 
  mutate(Sepal.Area = Sepal.Length * Sepal.Width)
head(dplyr_iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
1          5.1         3.5          1.4         0.2  setosa      17.85
2          4.9         3.0          1.4         0.2  setosa      14.70
3          4.7         3.2          1.3         0.2  setosa      15.04
4          4.6         3.1          1.5         0.2  setosa      14.26
5          5.0         3.6          1.4         0.2  setosa      18.00
6          5.4         3.9          1.7         0.4  setosa      21.06

The advantage of dplyr is to allow for many computations:

dplyr_grouped <- iris %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width) %>%
  mutate(is.Sepal.Large = ifelse(Sepal.Area > median(Sepal.Area),1,0)) %>% 
  group_by(Species) %>%
  summarize(mean_sepal_length = mean(Sepal.Length),
            count_sepal_large = sum(is.Sepal.Large))
print(dplyr_grouped)
# A tibble: 3 × 3
  Species    mean_sepal_length count_sepal_large
  <fct>                  <dbl>             <dbl>
1 setosa                  5.01                21
2 versicolor              5.94                18
3 virginica               6.59                36

This dataset can be, of course represented visually:

ggplot(dplyr_grouped, aes(x = Species, y = count_sepal_large, fill = Species)) +
  geom_bar(stat = "identity") +
  labs(title = "Histogram of Sepal Count by Species",
       x = "Species", y = "Count of large sepals") +
  theme_minimal()

But we can also export it to LaTeX:

install.packages("xtable")
library(xtable)
colnames(dplyr_grouped) <- c("Species", "Mean_Sepal_Length", "Count_Sepal_Large")
print(xtable(dplyr_grouped), type = "latex", file = paste0(path,"/output/sepal_large.tex"))

First step in LaTeX

What is LaTeX

LaTeX is very popular software to produce scientic writings. It is relatively easy to use. From now on, it is mandatory to produce output you would hand using LaTeX. You have three options to use LaTeX:

  1. Use any text editor and compile the file using the console
  2. Use an IDE, like TexStudio or TexLive
  3. Use an online IDE like Overleaf

There are plenty of tutorials over the web (or ChatGPT) to know how to use LaTeX, so I am going to be quick.

First step in LaTeX

Overleaf

Open a blank project in Overleaf. It should look like that:

\documentclass{article}

%% PREAMBULE ------------------

\usepackage{graphicx} % Required for inserting images

%% METADATA -------------------

\title{Tutorial one}
\author{Matéo Moglia}
\date{February 2024}

%% DOCUMENT START ------------

\begin{document}

\maketitle

\section{Introduction}

\end{document}

Three elements are mandatory here: the \documentclass{} call, the \begin{document} and the \end{docoment} parts. All texts should be these last two commands. The preambule is of high importance: it allows you to load packages to perform specific tasks. Here, graphicx is useful to input images. Other popular packages include:

  • geometry allows you to change the margins of the pages,
  • booktab to build nice-looking tables
  • tikz to draw graphs directly in LaTeX
  • xcolor to create and choose your favorite colors

You can then directly write any text you like after \begin{document}.

First step in LaTeX

Math mode

Most interesting is the math mode. You can write in line using $ x = y$ or in an equation environment:

\begin{equation}
  \mu_k = \int_0^{+\infty} x^k f(x) dx = \int_0^{+\infty} t^{-\frac{k}{\alpha}} \exp ^{-t} dt
\end{equation}

Tip

An environment starts with \begin{name} and ends with \end{name}. The environment can be an equation, to center a large chunk of text, a figure, a table, etc.

First step in LaTeX

Include our results (fig)

\begin{figure}
  \centering # To center the graph
  \includegraphics[width=8,height=10]{graph/graph_iris.png}
  \caption{Iris Sepals Length and Width}
\end{figure}

We work within the figure environment, center the graphic, include it with some options and add a caption. The graph should be uploaded in the project first! Here, I stored it in the folder graph.

First step in LaTeX

Include our results (table)

To input the table, we can use \input{}:

\input{table/sepal_large.tex}

No need to specify the environment as it was directly created by the xtable function earlier on!

First step in LaTeX

Compile the file

LaTeX is not a “What You See Is What You Get” software, like Word or Canvas.

  • You need to compile the code to obtain the results (usually a .pdf), that you can further download.

Quick exercise

  1. Open a new script, add the relevant information (path, date, packages) and load the swiss dataset
  2. Create a grouping variable if Catholic is below median or under median
  3. Using the dplyr syntax, compute the number of observations, the mean, median, and inter-quartile range of Fertility, Agriculture, and Education by group. Save as a .tex table.
  4. Plot the relation between Education and Infant.Mortality (ggplot2 syntax) for each group
    1. With a scatter plot and lines that connects all points
    2. Add a title and a subtitle (centered and in bold)
    3. Put the legend at the bottom of the slide
    4. Export as a .pdf
  5. Export everything in a TeX file