Introduction to ECD

In this vignette I introduce you to the basic functions of the ecdata package. You can download the latest stable releases of the packages through CRAN and PyPi


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

import ecdata as ec
import polars as pl
%pip install ecdata
%pip install polars

import ecdata as ec
import polars as pl

load_ecd

The primary function that is shared across the Python and R distributions of the package is the load_ecd function. This function accepts four primary arguments:

Argument R Specific Quirks Python Specific Quirks
country A String/A String Vector String, Dictionary, or List
language A String/A String Vector String, Dictionary, or List
full_ecd A boolean if set to TRUE downloads full dataset. Defaults to FALSE A boolean if set to True downloads full dataset. Defaults to False
ecd_version A character string of the ECD version you want to download. Defaults to latest version A character string of the ECD version you want to download. Defaults to latest version

Functionally the ecd_version argument is not entirely useful since there has only been one release of the data.

Say we only wanted data for South Korea1 we can simply set the country argument like this:

rok = load_ecd(country = 'Republic of Korea')
✔ Successfully downloaded Republic of Korea.
head(rok, 2)
# A tibble: 2 × 17
  country   url   text  date                title executive type  language file 
  <chr>     <chr> <chr> <dttm>              <chr> <chr>     <chr> <chr>    <chr>
1 Republic… http… 위대… 2022-03-10 00:00:00 정직… Yoon Suk… Spee… Korean   <NA> 
2 Republic… http… 위대… 2022-03-10 00:00:00 정직… Yoon Suk… Spee… Korean   <NA> 
# ℹ 8 more variables: isonumber <dbl>, gwc <chr>, cowcodes <chr>,
#   polity_v <chr>, polity_iv <chr>, vdem <dbl>, year_of_statement <dbl>,
#   office <chr>
rok = ec.load_ecd(country = 'Republic of Korea', ecd_version = '1.0.0', cache = False)

rok.head(2)
shape: (2, 17)
country url text date title executive type language file isonumber gwc cowcodes polity_v polity_iv vdem year_of_statement office
str str str datetime[μs, UTC] str str str str str f64 str str str str f64 f64 str
"Republic of Korea" "https://www.president.go.kr/pr… "​위대하고 자랑스러운 국민 여러분! 고맙습니다. 다시 … 2022-03-10 00:00:00 UTC "정직한 정부, 정직한 대통령 되겠습니다." "Yoon Suk Yeol" "Speech" "Korean" null 410.0 "ROK" "ROK" "ROK" "ROK" 42.0 2022.0 null
"Republic of Korea" "https://www.president.go.kr/pr… "​위대하고 자랑스러운 국민 여러분! 고맙습니다. 다시 … 2022-03-10 00:00:00 UTC "정직한 정부, 정직한 대통령 되겠습니다." "Yoon Suk Yeol" "Speech" "Korean" null 410.0 "ROK" "ROK" "ROK" "ROK" 42.0 2022.0 null

We implement caching by default so you will get a pretty shouty warning every few hours in R. load_ecd has some tolerance for common names, abbreviations, and mixed punctuations of countries so if we wanted to download the same data using RK, ROK, or South Korea these will all download the South Korean data.

sk = load_ecd(country = 'South Korea')
✔ Successfully downloaded Republic of Korea.

sk = ec.load_ecd(country = 'South Korea')

If you are not interested in single country case studies you can feed multiple countries to the country argument. In R we use a string vector. For Python you can use a list!


list_version = ec.load_ecd(country = ['South Korea', 'Turkey'])

The same functionality is extended to the language argument too!

lazy_load_ecd

Both versions of the package allow you to use lazy loading to defer computation till you are done querying the dataset. To do this all you need to is call lazy_load_ecd

turkey_korea_lazy = lazy_load_ecd(country = c('South Korea', 'Turkey')) 
✔ Note: Data for: South Korea and Turkey was successfully downloaded. To bring data into memory call dplyr::collect()
turkey_korea_lazy |>
  filter(country == 'Turkey') |>
  collect() |>
  head(2)
# A tibble: 2 × 17
  country url     text  date                title executive type  language file 
  <chr>   <chr>   <chr> <dttm>              <chr> <chr>     <chr> <chr>    <chr>
1 Turkey  https:… Bugü… 2023-04-08 00:00:00 Başa… Recep Ta… Spee… Turkish  <NA> 
2 Turkey  https:… Noks… 2023-04-08 00:00:00 Başa… Recep Ta… Spee… Turkish  <NA> 
# ℹ 8 more variables: isonumber <dbl>, gwc <chr>, cowcodes <chr>,
#   polity_v <chr>, polity_iv <chr>, vdem <dbl>, year_of_statement <dbl>,
#   office <chr>
turkey_rok_lazy = ec.lazy_load_ecd(['South Korea','Turkey'])

turkey_rok_lazy.filter(pl.col('country') == 'Turkey').collect().head(2)
shape: (2, 17)
country url text date title executive type language file isonumber gwc cowcodes polity_v polity_iv vdem year_of_statement office
str str str datetime[μs, UTC] str str str str str f64 str str str str f64 f64 str
"Turkey" "https://www.tccb.gov.tr/konusm… "Türkiye Cumhuriyeti’nin 11. Cu… 2014-08-28 00:00:00 UTC "Devir Teslim Töreni’nde Yaptık… "Recep Tayyip Erdogan" "Speech" "Turkish" null 792.0 "TUR" "TUR" "TUR" "TUR" 99.0 2014.0 null
"Turkey" "https://www.tccb.gov.tr/konusm… "Çok Değerli Abdullah Gül Karde… 2014-08-28 00:00:00 UTC "Devir Teslim Töreni’nde Yaptık… "Recep Tayyip Erdogan" "Speech" "Turkish" null 792.0 "TUR" "TUR" "TUR" "TUR" 99.0 2014.0 null

Footnotes

  1. I choose South Korea because the underlying file is relatively small compared to some of the other country files.↩︎