Image suggestions#
Don’t leave Wikipedia articles and sections without images: here’s the Image suggestions data pipeline.
Friends call me ALIS (reads Alice) and SLIS (reads slice).
Can’t wait!#
You need access to a Wikimedia Foundation’s analytics client, also known as a stat box. Then, try out the first step:
me@my_box:~$ ssh stat1008.eqiad.wmnet # Or pick another one
me@stat1008:~$ export http_proxy=http://webproxy.eqiad.wmnet:8080
me@stat1008:~$ export https_proxy=http://webproxy.eqiad.wmnet:8080
me@stat1008:~$ git clone https://gitlab.wikimedia.org/repos/structured-data/image-suggestions.git is
me@stat1008:~$ cd is
me@stat1008:~/is$ conda-analytics-clone MY_ENV
me@stat1008:~/is$ source conda-analytics-activate MY_ENV
(MY_ENV) me@stat1008:~/is$ conda env update -n MY_ENV -f conda-environment.yaml
(MY_ENV) me@stat1008:~/is$ python image_suggestions/commonswiki_file.py ME MY_WEEKLY_SNAPSHOT MY_PREVIOUS_WEEKLY_SNAPSHOT
Get your hands dirty#
Install the development environment:
me@stat1008:~/is$ conda-analytics-clone MY_DEV_ENV
me@stat1008:~/is$ source conda-analytics-activate MY_DEV_ENV
(MY_DEV_ENV) me@stat1008:~/is$ conda env update -n MY_DEV_ENV -f dev-conda-environment.yaml
Test#
(MY_DEV_ENV) me@stat1008:~/is$ tox -e pytest
Lint#
(MY_DEV_ENV) me@stat1008:~/is$ tox -e lint
Docs#
(MY_DEV_ENV) me@stat1008:~/is$ sphinx-build docs/ docs/_build/
Trigger an Airflow test run#
Follow this walkthrough to simulate a production execution of the pipeline in your stat box. Inspired by this snippet.
Build your artifact#
Pick a branch you want to test from the drop-down menu
Click on the pipeline status button, it should be a green tick
Click on the play button next to
publish_conda_env
, wait until doneOn the left sidebar, go to Packages and registries > Package Registry
Click on the first item in the list, then copy the Asset URL. It should be something like
https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/package_files/1218/download
Get your artifact ready#
me@stat1008:~$ mkdir artifacts
me@stat1008:~$ cd artifacts
me@stat1008:~$ wget -O MY_ARTIFACT MY_COPIED_ASSET_URL
me@stat1008:~$ hdfs dfs -mkdir artifacts
me@stat1008:~$ hdfs dfs -copyFromLocal MY_ARTIFACT artifacts
me@stat1008:~$ hdfs dfs -chmod -R o+rx artifacts
Create your test Hive DB#
me@stat1008:~$ sudo -u analytics-privatedata hive
hive (default)> create database MY_TEST_HIVE_DB; exit;
Spin up an Airflow instance#
On your stat box:
me@stat1008:~$ git clone https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags.git
me@stat1008:~$ cd airflow-dags
me@stat1008:~$ sudo -u analytics-privatedata rm -fr /tmp/MY_AIRFLOW_HOME # If you've previously run the next command
me@stat1008:~$ sudo -u analytics-privatedata ./run_dev_instance.sh -m /tmp/MY_AIRFLOW_HOME -p MY_PORT platform_eng
On your local box:
me@my_box:~$ ssh -t -N stat1008.eqiad.wmnet -L MY_PORT:stat1008.eqiad.wmnet:MY_PORT
Trigger the DAG run#
Go to
http://localhost:MY_PORT/
on your browserOn the top bar, go to Admin > Variables
Click on the middle button (Edit record) next to the
platform_eng/dags/image_suggestions_dag.py
KeyUpdate
"conda_env": "hdfs://analytics-hadoop/user/ME/artifacts/MY_ARTIFACT",
and"hive_db" : "MY_TEST_HIVE_DB",
in the Val fieldAdd any other relevant DAG properties
Click on the Save button
On the top bar, go to DAGs and click on the
image_suggestions
slider. This should trigger an automatic DAG runClick on
image_suggestions
You’re all set! Don’t forget to manually fail the hive_to_cassandra
tasks:
Click on the square next to the first
hive_to_cassandra
taskClick on the Mark state as… blue button > failed > Downstream > Mark as failed
Release#
On the left sidebar, go to CI/CD > Pipelines
Click on the play button, select
trigger_release
If the job went fine, you’ll find a new artifact in the Package Registry
We follow Data Engineering’s
workflow_utils:
- the main
branch is on a .dev
release - releases are made by
removing the .dev
suffix and committing a tag
Deploy#
On the left sidebar, go to CI/CD > Pipelines
Click on the play button and select
bump_on_airflow_dags
. This will create a merge request at airflow-dagsDouble-check it and merge
Deploy the DAGs:
me@my_box:~$ ssh deployment.eqiad.wmnet
me@deploy1002:~$ cd /srv/deployment/airflow-dags/platform_eng/
me@deploy1002:~$ git pull
me@deploy1002:~$ scap deploy
See the docs for more details.