Learn how to export them and input them in a .tex file.
What is R? Set-up and launch
R is a free and open-source software used in many contexts: data cleaning, data visualisation, econometrics, machine and deep learning, among others.
It is supported by a lively and active user community. R is an object-based language, meaning that we are going to manipulate objects through a series of commands.
R can be used directly from a command terminal, but we prefer to use RStudio, an IDE, to make our lifes easier. An IDE is an interface that eases the use of a software. The most popular one is RStudio. To use it, we need to:
Here we are going to work in ~/courses/2025_eco1s002/. This folder may (and should!) contains subfolders, for instance: /code, /output, /raw_data, etc.
Naming convention
A typical tip is to choose simple and short titles for the files and the scripts. For instance, this file is named tutorial1.qmd. Your code can be named code_tutorial1.R. It should be self-explanatory.
Tip
Avoid at all cost to use spaces or special characters in your file names. Prefer instead an underscore or a score.
When writing code
Always comment your code, make it readable for your future self but also for others. You may use your code in a week, a month, a year, and should be able to directly understands it!
When creating objects, functions, or whatever, give it simple but understandable names.
For instance, if you create a data frame containing wages for individuals aged between 25 and 30, you way call it wage_25_30 and not w2530 (too short) or wage_individuals_aged_25_30 (too long).
Moreover, especially for large project, you want to have different scripts. Always keep scripts short and with clear names. For instance 1_clean_data.R, 2_import_wages.R, 3_merge_datasets.R, and 4_data_analysis.R.
First step in R/Rstudio
Some useful terms
So far, we talked about script.
We are going to manipulate object using functions.
Objects can be of different classes: matrix, vector, character, numeric, data.frame, etc. Some functions are already in R but some are not. Hence, we need to load them thanks to packages.
First step in R/Rstudio
Open a script
We are going to create our first script (the extension is .R).
This script contains all commands you want to run. R has built-in functions but most useful functions should be called using library().
These libraries should be first downloaded then loaded into your project.
First, open RStudio and opens a new script.
You should have an empty page. You may want to write some words at the top of it to have an idea of what this code does. You can use # to comment code.
First step in R/Rstudio
Script
#-------------------------------------------------------------------------------# This script opens a dataset and proceeds# to data visualization.## Author: Matéo Moglia# Date: 12/02/2025#-------------------------------------------------------------------------------# Set your working directorysetwd("C:/Users/mateomoglia/Dropbox/courses/polytechnique/2025_eco1s002/site")path <-"/Users/mateomoglia/Dropbox/courses/polytechnique/2025_eco1s002/site"
Here, I set the working directory using setwd(). Notice that we created our first object, named path using the assignment operator <- (I personally prefer to use = which works exactly the same).
Tip
What is the class of path? Hint: use the function class()
First step in R/Rstudio
Our first data manipulation
We can load a very famous built-in data set, the iris dataset, into an object called data.
data = iris # Attribute to data the data set irisclass(data) # Check its class
Notice that the two objects we created so far can be seen in the Environment pane.
First step in R/Rstudio
Data description
By clicking on the object in the Environement panel, you can visualize it, but you can also write some code to describe it. As we just check, data is a data.frame object. It is made of several columns and rows.
To call a specific column in a dataframe, we use the $ operator. To know all the column names in data, we can use the function names(). To know the “size” of the dataframe, the right function is dim(): the first element is the number of rows and the second the number of columns.
Tip
You can also call a column by its position in the dataframe. For instance, to call the second column you can type data[,2]. For the second row: data[,2]. What should you write to print the 10th element of the 3rd column?
To describe the data, we can use several built-in functions.
table(data$Species) # To make a frequency table
setosa versicolor virginica
50 50 50
summary(data$Sepal.Length) # To provide summary statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
Now, I want to visualize the data. To do so, we can use base R function plot:
plot(iris$Sepal.Length,iris$Sepal.Width)
First step in R/Rstudio
Our first graph!
We can also use a very popular library for dataviz, ggplot2. First, we install it, then we load it, and we are going to be able to use it.
All packages have to be downloaded and loaded this way. Note that once you’ve downloaded a package on your computer, you do not need to install it but you still need to load it through the library function.
g1 <-ggplot(data = data, mapping =aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +geom_point() +theme_minimal()g1
This looks better! To produce this graph we used different information:
The ggplot command takes data and mapping as input.
data should be the dataset you want to visualize, here we named it data
mapping takes aes (for aesthetics)
Here, the aesthetics are the x and y axis, and the color one, to distinguish between the species.
geom_point allows to plot… points
theme_minimal makes the graph tidy, with a white background, etc.
The function paste0 is useful to concatenate text. Here, instead of writing the full path, we concatenate the object path (which happens to be our path) and the end of the path, including the name of the output.
First step in R/Rstudio
Data: base vs. dplyr
To finish this -short- introduction to R, I introduce a new package, probably the most popular one in R: dplyr. As before, we install then load the library.
install.packages("tidyverse")library(tidyverse)
Let’s imagine we want to filter the dataset to keep only the setosa. In base R, we need to extract the lines of iris for which Species=="setosa". Because R is object based, we can multiple objects, here: base_setosa.
The data frame iris is close to a matrix, hence, we can extract using notations close to the matrix ones. Hence data[1,1] would give us the first line of the first column (in this order) of data. Here, we filt the lines using a condition: data$Species=="setosa". Do not forget the ,.
# Base R: Filter rows where Species is "setosa"base_setosa <- data[data$Species =="setosa", ]head(base_setosa)
This dataset can be, of course represented visually:
ggplot(dplyr_grouped, aes(x = Species, y = count_sepal_large, fill = Species)) +geom_bar(stat ="identity") +labs(title ="Histogram of Sepal Count by Species",x ="Species", y ="Count of large sepals") +theme_minimal()
But we can also export it to LaTeX:
install.packages("xtable")library(xtable)
colnames(dplyr_grouped) <-c("Species", "Mean_Sepal_Length", "Count_Sepal_Large")print(xtable(dplyr_grouped), type ="latex", file =paste0(path,"/output/sepal_large.tex"))
First step in LaTeX
What is LaTeX
LaTeX is very popular software to produce scientic writings. It is relatively easy to use. From now on, it is mandatory to produce output you would hand using LaTeX. You have three options to use LaTeX:
Use any text editor and compile the file using the console
Use an IDE, like TexStudio or TexLive
Use an online IDE like Overleaf
There are plenty of tutorials over the web (or ChatGPT) to know how to use LaTeX, so I am going to be quick.
First step in LaTeX
Overleaf
Open a blank project in Overleaf. It should look like that:
Three elements are mandatory here: the \documentclass{} call, the \begin{document} and the \end{docoment} parts. All texts should be these last two commands. The preambule is of high importance: it allows you to load packages to perform specific tasks. Here, graphicx is useful to input images. Other popular packages include:
geometry allows you to change the margins of the pages,
booktab to build nice-looking tables
tikz to draw graphs directly in LaTeX
xcolor to create and choose your favorite colors
You can then directly write any text you like after \begin{document}.
First step in LaTeX
Math mode
Most interesting is the math mode. You can write in line using $ x = y$ or in an equation environment:
An environment starts with \begin{name} and ends with \end{name}. The environment can be an equation, to center a large chunk of text, a figure, a table, etc.
First step in LaTeX
Include our results (fig)
\begin{figure}
\centering # To center the graph
\includegraphics[width=8,height=10]{graph/graph_iris.png}
\caption{Iris Sepals Length and Width}
\end{figure}
We work within the figure environment, center the graphic, include it with some options and add a caption. The graph should be uploaded in the project first! Here, I stored it in the folder graph.
First step in LaTeX
Include our results (table)
To input the table, we can use \input{}:
\input{table/sepal_large.tex}
No need to specify the environment as it was directly created by the xtable function earlier on!
First step in LaTeX
Compile the file
LaTeX is not a “What You See Is What You Get” software, like Word or Canvas.
You need to compile the code to obtain the results (usually a .pdf), that you can further download.
Quick exercise
Open a new script, add the relevant information (path, date, packages) and load the swiss dataset
Create a grouping variable if Catholic is below median or under median
Using the dplyr syntax, compute the number of observations, the mean, median, and inter-quartile range of Fertility, Agriculture, and Education by group. Save as a .tex table.
Plot the relation between Education and Infant.Mortality (ggplot2 syntax) for each group
With a scatter plot and lines that connects all points