Intro
Recently I had a task to take higher education organizations (e.g., universities and research institutes) and obtain their name, country, city, address and geographical coordinates to be confronted with Web of Science information. It was similar to what I did in Key Actors in Higher Education Research and Science Studies (HERSS), a Flexdashobrd developed in R. To tackle the task, I downloaded and used a full data dump from Wikidata (see here for info on their database downloads and here for description of their data model/structure).
I searched a lot online to find best practices on how to handle this relatively large compressed file (33 GB with close to 57 million items, each line is a valid json for an item with all its properties as dump of 27th March 2019) and read and use it in reasonable time. Here I am documenting my experience as a “pay back” to the online community I am always learning from. Hopefully someone would read and benefit from this (or future me will be able to replicate the steps, if needed).
Main requirements
- You need to download a full dump from here: https://dumps.wikimedia.org/wikidatawiki/entities/. Following guide uses the
.bz2
json version for which the latest dump is named as “latest-all.json.bz2
”.
- Although following these steps doesn’t require you to know how to code in
Python
, but you need to have Python 3 installed on your machine
- You need to be familiar with Wikidata data structure, specifically
items
and their properties
Steps to parse the full dump
In order to use the dump you downloaded and obtain the information you want, follow these steps:
- Copy the local URL where you save the full Wikidata dump (33 GB in size)
\your_local_directory\wikidata\
the file named latest-all.json.bz2
- The script below is building a connection to the
.bz2
file without decompressing it. It reads it line by line and extracts information requested (based on property names discussed above)
- Open my sample Python script (copied below) in editor of your choice (if you code in Python, you don’t need the next steps, modify it the way you want and export your intended data). It is a script I have adapted and modified with others’ helps (thanks to Roland, Arno and Otmane) from here
- Replace the property names (P followed by a number) with the ones you are interested in
- In the line starting with “
if pydash.has(record, 'claims.P625'):
” I am defining that if the item currently being read doesn’t have property P625 (which is geographical coordinates) then do not process it and skip to the next item
- Since I know that in Wikidata structure of items and claims (which is where properties are included) my property of interest is located in a nested list like “
claims.P625
” and it can have more than one value for each item which is saved as a list, so I am passing latitude = pydash.get(record, 'claims.P625[0].mainsnak.datavalue.value.latitude')
to obtain only the first element (designated by [0])
- For the main item information like English label, English description, I am passing
english_label = pydash.get(record, 'labels.en.value')
which only takes the en as label, while if you are interested to take labels in other languages, you need to replace it with two letter language codes used in Wikidata e.g. de, es and it.
- See here for an example of how the underlying data in one json per line looks like _https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON#Example_
- You will need to modify the line
df_record_all = pd.DataFrame(columns=['id', 'type', 'english_label', 'longitude', 'latitude', 'english_desc'])
which is building an empty table to save the data. You need to provide/modify column names based on the data table you intend to build/gather
- To run the script, you need to open a command prompt (i.e. Mac and Linux terminal or on Windows I would suggest using
Anaconda prompt
which is installed following step 2 in main requirements, or instead, cmder
which is the only command prompt GUI in Windows that I have found to be working the way I expect it. You will need to call it while giving two arguments, where the Python script is located, and where the .bz2
file is accessible, i.e. python.exe H:\Documents\wikidata.py "\your_local_directory\wikidata\latest-all.json.bz2"
- In case you are not using
Anaconda prompt
, then you will need to change your directory to where python.exe
is installed and run the above command from there. (on Mac and Linux of course you don’t need to change to Python’s installation directory, it will suffice to call python (without .exe
) and put the dump URL after it)
- Let the script run (it might take from few hours to few days since there are 57 million items in the dump depending on the number of properties you extract and how frequent they exist in items)
- It will export a CSV file in the “
extracted
” folder that you can use
- While running I have asked it to print the name of current item being processed, and once an output file is exported, it says
CSV exported
- It will generate a CSV of every 5000 items (not to lose the progress in case something goes wrong and keep output files small/manageable). When the process finishes (and in case the number of items processed was not dividable to 5000) it exports a final CSV including the rest of results named as “final_csv_till_…” and prints a message
All items finished, final CSV exported
Sample Python script
My sample python script that you can either use based on steps described above, or modify as you wish and run on the dump file. It is a script I have adapted and modified with others’ helps (thanks to Roland, Arno and Otmane) from here
#!/usr/bin/env python3
"""Get Wikidata dump records as a JSON stream (one JSON object per line)"""
# Modified script taken from this link: "https://www.reddit.com/r/LanguageTechnology/comments/7wc2oi/does_anyone_know_a_good_python_library_code/dtzsh2j/"
import bz2
import json
import pandas as pd
import pydash
i = 0
# an empty dataframe which will save items information
# you need to modify the columns in this data frame to save your modified data
df_record_all = pd.DataFrame(columns=['id', 'type', 'english_label', 'longitude', 'latitude', 'english_desc'])
def wikidata(filename):
with bz2.open(filename, mode='rt') as f:
f.read(2) # skip first two bytes: "{\n"
for line in f:
try:
yield json.loads(line.rstrip(',\n'))
except json.decoder.JSONDecodeError:
continue
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
description=__doc__
)
parser.add_argument(
'dumpfile',
help=(
'a Wikidata dumpfile from: '
'https://dumps.wikimedia.org/wikidatawiki/entities/'
'latest-all.json.bz2'
)
)
args = parser.parse_args()
for record in wikidata(args.dumpfile):
# only extract items with geographical coordinates (P625)
if pydash.has(record, 'claims.P625'):
print('i = '+str(i)+' item '+record['id']+' started!'+'\n')
latitude = pydash.get(record, 'claims.P625[0].mainsnak.datavalue.value.latitude')
longitude = pydash.get(record, 'claims.P625[0].mainsnak.datavalue.value.longitude')
english_label = pydash.get(record, 'labels.en.value')
item_id = pydash.get(record, 'id')
item_type = pydash.get(record, 'type')
english_desc = pydash.get(record, 'descriptions.en.value')
df_record = pd.DataFrame({'id': item_id, 'type': item_type, 'english_label': english_label, 'longitude': longitude, 'latitude': latitude, 'english_desc': english_desc}, index=[i])
df_record_all = df_record_all.append(df_record, ignore_index=True)
i += 1
print(i)
if (i % 5000 == 0):
pd.DataFrame.to_csv(df_record_all, path_or_buf='\\wikidata\\extracted\\till_'+record['id']+'_item.csv')
print('i = '+str(i)+' item '+record['id']+' Done!')
print('CSV exported')
df_record_all = pd.DataFrame(columns=['id', 'type', 'english_label', 'longitude', 'latitude', 'english_desc'])
else:
continue
pd.DataFrame.to_csv(df_record_all, path_or_buf='\\wikidata\\extracted\\final_csv_till_'+record['id']+'_item.csv')
print('i = '+str(i)+' item '+record['id']+' Done!')
print('All items finished, final CSV exported!')