This post shows how to access the wikipedia knowledge graph
wikipedia
scrape
Author
Luke Heley
Published
March 26, 2023
Objective
Extract the knowledge graph for a Chinese strategic submarine.
Data
We use the wikidata API to extract the knowledge graph from wiki.
The approach is to search wikidata for the item of interest and select the best match from the list of search results that return.
# function to search wiki data and return a tibble of search resultssearch_wikidata <-function(search){ base <-"https://www.wikidata.org" path <-"/w/api.php" query <-list(action="query",list="search",format ="json",srsearch = search ) httr::GET(base, path = path, query = query) |> httr::content() |> purrr::pluck("query", "search") |> purrr::map_df(~{ ul <-unlist(.x) name <-names(ul) value <-as.character(ul) dplyr::tibble(name, value) |> tidyr::pivot_wider() })}(search_results <-search_wikidata("Type-094 submarine"))
# Extract the item title for the item of interestroot_item_title <- search_results$title[1]
We then extract wikidata associated with the item.
# extract the entity data for a chosen item.get_entity_data <-function(item ="Q1203377"){ base <-"https://www.wikidata.org/" path <- glue::glue("wiki/Special:EntityData/{item}.json") query <-list(flavor ="simple") req <- httr::GET(base, path = path, query = query) httr::content(req) |> purrr::pluck("entities", item, "claims") |> purrr::map_df(~{ value <-unlist(.x) name <-names(value) dplyr::tibble(name, value) |> tidyr::pivot_wider() |> tidyr::unnest() })}entity_data <-get_entity_data(root_item_title)
Warning: `cols` is now required when using unnest().
Please use `cols = c()`
`cols` is now required when using unnest().
Please use `cols = c()`
`cols` is now required when using unnest().
Please use `cols = c()`
`cols` is now required when using unnest().
Please use `cols = c()`
Warning: Values from `value` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = {summary_fun}` to summarise duplicates.
* Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(name) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
Warning: `cols` is now required when using unnest().
Please use `cols = c(mainsnak.snaktype, mainsnak.property, mainsnak.hash, `mainsnak.datavalue.value.entity-type`,
`mainsnak.datavalue.value.numeric-id`, mainsnak.datavalue.value.id,
mainsnak.datavalue.type, mainsnak.datatype, type, id, rank)`
Warning: `cols` is now required when using unnest().
Please use `cols = c()`
`cols` is now required when using unnest().
Please use `cols = c()`
`cols` is now required when using unnest().
Please use `cols = c()`
`cols` is now required when using unnest().
Please use `cols = c()`
`cols` is now required when using unnest().
Please use `cols = c()`
`cols` is now required when using unnest().
Please use `cols = c()`
`cols` is now required when using unnest().
Please use `cols = c()`
Warning: Values from `value` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = {summary_fun}` to summarise duplicates.
* Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(name) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
Warning: `cols` is now required when using unnest().
Please use `cols = c(mainsnak.snaktype, mainsnak.property, mainsnak.hash, `mainsnak.datavalue.value.entity-type`,
`mainsnak.datavalue.value.numeric-id`, mainsnak.datavalue.value.id,
mainsnak.datavalue.type, mainsnak.datatype, type, id, rank,
references.hash, references.snaks.P143.snaktype, references.snaks.P143.property,
references.snaks.P143.hash, `references.snaks.P143.datavalue.value.entity-type`,
`references.snaks.P143.datavalue.value.numeric-id`, references.snaks.P143.datavalue.value.id,
references.snaks.P143.datavalue.type, references.snaks.P143.datatype,
`references.snaks-order`)`
Warning: `cols` is now required when using unnest().
Please use `cols = c()`
`cols` is now required when using unnest().
Please use `cols = c()`
`cols` is now required when using unnest().
Please use `cols = c()`
And the associated properties
properties <- entity_data |> dplyr::pull("mainsnak.property") |>unique()get_entity_id <-function(id ="P373"){if(length(id)>1) id <-paste(id, collapse ="|") base <-"https://www.wikidata.org/" path <-"w/api.php" query <-list(action="wbgetentities",ids=id,languages="en",props="labels",format="json" ) req <- httr::GET(base, path = path, query = query) if(req$status !=200) return(stop(glue::glue("Error returned status: {req$status}"))) httr::content(req) |> purrr::pluck("entities") |> purrr::map_df(~{ value <-unlist(.x) name <-names(value) dplyr::tibble(name, value) |> tidyr::pivot_wider() })}(prop_label <-get_entity_id(properties))
# A tibble: 16 × 5
type datatype id labels.en.language labels.en.value
<chr> <chr> <chr> <chr> <chr>
1 property string P373 en Commons category
2 property wikibase-item P516 en powered by
3 property string P561 en NATO reporting name
4 property wikibase-item P31 en instance of
5 property wikibase-item P279 en subclass of
6 property wikibase-item P910 en topic's main category
7 property external-id P646 en Freebase ID
8 property time P729 en service entry
9 property wikibase-item P156 en followed by
10 property wikibase-item P155 en follows
11 property commonsMedia P18 en image
12 property wikibase-item P176 en manufacturer
13 property wikibase-item P137 en operator
14 property monolingualtext P1813 en short name
15 property wikibase-item P520 en armament
16 property wikibase-item P495 en country of origin
# A tibble: 17 × 2
property_label item_label
<chr> <chr>
1 Commons category Type 09IV submarines
2 powered by nuclear marine propulsion
3 NATO reporting name Jin
4 instance of submarine class
5 subclass of ballistic missile submarine
6 subclass of nuclear submarine
7 topic's main category Category:Type 094 submarines
8 Freebase ID /m/09wz47
9 service entry +2010-01-01T00:00:00Z
10 followed by Type 096 submarine
11 follows Type 092 Daqingyu
12 image Jin (Type 094) Class Ballistic Missile Submarine.JPG
13 manufacturer Bohai Shipyard
14 operator People's Liberation Army Navy
15 short name Type 094
16 armament JL-2
17 country of origin People's Republic of China
[[1]]
# A tibble: 22 × 2
X1 X2
<chr> <chr>
1 Profile of the Type 094 Profile of the Type 094
2 Type 094 submarine Type 094 submarine
3 Class overview Class overview
4 Name Type 094 (Jin class)
5 Builders Bohai Shipyard, Huludao, China[2]
6 Operators People's Liberation Army Navy
7 Preceded by Type 092 (Xia class)
8 Succeeded by Type 096
9 Cost $750 million per unit[1]
10 In commission 2007–present[2]
# … with 12 more rows