Introduction to ECD

In this vignette I introduce you to the basic functions of the ecdata package. You can download the latest stable releases of the packages through CRAN and PyPi

library(ecdata)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


import ecdata as ec
import polars as pl

R
Python

install.packages('ecdata')
library(ecdata)
library(dplyr)

%pip install ecdata
%pip install polars

import ecdata as ec
import polars as pl

`load_ecd`

The primary function that is shared across the Python and R distributions of the package is the load_ecd function. This function accepts four primary arguments:

Argument	R Specific Quirks	Python Specific Quirks
country	A String/A String Vector	String, Dictionary, or List
language	A String/A String Vector	String, Dictionary, or List
full_ecd	A boolean if set to TRUE downloads full dataset. Defaults to FALSE	A boolean if set to True downloads full dataset. Defaults to False
ecd_version	A character string of the ECD version you want to download. Defaults to latest version	A character string of the ECD version you want to download. Defaults to latest version

Functionally the ecd_version argument is not entirely useful since there has only been one release of the data.

Say we only wanted data for South Korea¹ we can simply set the country argument like this:

R
Python

rok = load_ecd(country = 'Republic of Korea')

✔ Successfully downloaded Republic of Korea.

head(rok, 2)

# A tibble: 2 × 17
  country   url   text  date                title executive type  language file 
  <chr>     <chr> <chr> <dttm>              <chr> <chr>     <chr> <chr>    <chr>
1 Republic… http… 위대… 2022-03-10 00:00:00 정직… Yoon Suk… Spee… Korean   <NA> 
2 Republic… http… 위대… 2022-03-10 00:00:00 정직… Yoon Suk… Spee… Korean   <NA> 
# ℹ 8 more variables: isonumber <dbl>, gwc <chr>, cowcodes <chr>,
#   polity_v <chr>, polity_iv <chr>, vdem <dbl>, year_of_statement <dbl>,
#   office <chr>

rok = ec.load_ecd(country = 'Republic of Korea', ecd_version = '1.0.0', cache = False)

rok.head(2)

shape: (2, 17)

country	url	text	date	title	executive	type	language	file	isonumber	gwc	cowcodes	polity_v	polity_iv	vdem	year_of_statement	office
str	str	str	datetime[μs, UTC]	str	str	str	str	str	f64	str	str	str	str	f64	f64	str
"Republic of Korea"	"https://www.president.go.kr/pr…	"위대하고 자랑스러운 국민 여러분! 고맙습니다. 다시 …	2022-03-10 00:00:00 UTC	"정직한 정부, 정직한 대통령 되겠습니다."	"Yoon Suk Yeol"	"Speech"	"Korean"	null	410.0	"ROK"	"ROK"	"ROK"	"ROK"	42.0	2022.0	null
"Republic of Korea"	"https://www.president.go.kr/pr…	"위대하고 자랑스러운 국민 여러분! 고맙습니다. 다시 …	2022-03-10 00:00:00 UTC	"정직한 정부, 정직한 대통령 되겠습니다."	"Yoon Suk Yeol"	"Speech"	"Korean"	null	410.0	"ROK"	"ROK"	"ROK"	"ROK"	42.0	2022.0	null

We implement caching by default so you will get a pretty shouty warning every few hours in R. load_ecd has some tolerance for common names, abbreviations, and mixed punctuations of countries so if we wanted to download the same data using RK, ROK, or South Korea these will all download the South Korean data.

R
Python

sk = load_ecd(country = 'South Korea')

✔ Successfully downloaded Republic of Korea.


sk = ec.load_ecd(country = 'South Korea')

If you are not interested in single country case studies you can feed multiple countries to the country argument. In R we use a string vector. For Python you can use a list!


list_version = ec.load_ecd(country = ['South Korea', 'Turkey'])

The same functionality is extended to the language argument too!

lazy_load_ecd

Both versions of the package allow you to use lazy loading to defer computation till you are done querying the dataset. To do this all you need to is call lazy_load_ecd

R
Python

turkey_korea_lazy = lazy_load_ecd(country = c('South Korea', 'Turkey'))

✔ Note: Data for: South Korea and Turkey was successfully downloaded. To bring data into memory call dplyr::collect()

turkey_korea_lazy |>
  filter(country == 'Turkey') |>
  collect() |>
  head(2)

# A tibble: 2 × 17
  country url     text  date                title executive type  language file 
  <chr>   <chr>   <chr> <dttm>              <chr> <chr>     <chr> <chr>    <chr>
1 Turkey  https:… Bugü… 2023-04-08 00:00:00 Başa… Recep Ta… Spee… Turkish  <NA> 
2 Turkey  https:… Noks… 2023-04-08 00:00:00 Başa… Recep Ta… Spee… Turkish  <NA> 
# ℹ 8 more variables: isonumber <dbl>, gwc <chr>, cowcodes <chr>,
#   polity_v <chr>, polity_iv <chr>, vdem <dbl>, year_of_statement <dbl>,
#   office <chr>

turkey_rok_lazy = ec.lazy_load_ecd(['South Korea','Turkey'])

turkey_rok_lazy.filter(pl.col('country') == 'Turkey').collect().head(2)

shape: (2, 17)

country	url	text	date	title	executive	type	language	file	isonumber	gwc	cowcodes	polity_v	polity_iv	vdem	year_of_statement	office
str	str	str	datetime[μs, UTC]	str	str	str	str	str	f64	str	str	str	str	f64	f64	str
"Turkey"	"https://www.tccb.gov.tr/konusm…	"Türkiye Cumhuriyeti’nin 11. Cu…	2014-08-28 00:00:00 UTC	"Devir Teslim Töreni’nde Yaptık…	"Recep Tayyip Erdogan"	"Speech"	"Turkish"	null	792.0	"TUR"	"TUR"	"TUR"	"TUR"	99.0	2014.0	null
"Turkey"	"https://www.tccb.gov.tr/konusm…	"Çok Değerli Abdullah Gül Karde…	2014-08-28 00:00:00 UTC	"Devir Teslim Töreni’nde Yaptık…	"Recep Tayyip Erdogan"	"Speech"	"Turkish"	null	792.0	"TUR"	"TUR"	"TUR"	"TUR"	99.0	2014.0	null

Footnotes

I choose South Korea because the underlying file is relatively small compared to some of the other country files.↩︎