Image suggestions#

Don’t leave Wikipedia articles and sections without images: here’s the Image suggestions data pipeline.

Friends call me ALIS (reads Alice) and SLIS (reads slice).

Can’t wait!#

You need access to a Wikimedia Foundation’s analytics client, also known as a stat box. Then, try out the first step:

me@my_box:~$ ssh stat1008.eqiad.wmnet  # Or pick another one
me@stat1008:~$ export http_proxy=http://webproxy.eqiad.wmnet:8080
me@stat1008:~$ export https_proxy=http://webproxy.eqiad.wmnet:8080
me@stat1008:~$ git clone https://gitlab.wikimedia.org/repos/structured-data/image-suggestions.git is
me@stat1008:~$ cd is
me@stat1008:~/is$ conda-analytics-clone MY_ENV
me@stat1008:~/is$ source conda-analytics-activate MY_ENV
(MY_ENV) me@stat1008:~/is$ conda env update -n MY_ENV -f conda-environment.yaml
(MY_ENV) me@stat1008:~/is$ python image_suggestions/commonswiki_file.py ME MY_WEEKLY_SNAPSHOT MY_PREVIOUS_WEEKLY_SNAPSHOT

Get your hands dirty#

Install the development environment:

me@stat1008:~/is$ conda-analytics-clone MY_DEV_ENV
me@stat1008:~/is$ source conda-analytics-activate MY_DEV_ENV
(MY_DEV_ENV) me@stat1008:~/is$ conda env update -n MY_DEV_ENV -f dev-conda-environment.yaml

Test#

(MY_DEV_ENV) me@stat1008:~/is$ tox -e pytest

Lint#

(MY_DEV_ENV) me@stat1008:~/is$ tox -e lint

Docs#

(MY_DEV_ENV) me@stat1008:~/is$ sphinx-build docs/ docs/_build/

Trigger an Airflow test run#

Follow this walkthrough to simulate a production execution of the pipeline in your stat box. Inspired by this snippet.

Build your artifact#

  1. Pick a branch you want to test from the drop-down menu

  2. Click on the pipeline status button, it should be a green tick

  3. Click on the play button next to publish_conda_env, wait until done

  4. On the left sidebar, go to Packages and registries > Package Registry

  5. Click on the first item in the list, then copy the Asset URL. It should be something like https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/package_files/1218/download

Get your artifact ready#

me@stat1008:~$ mkdir artifacts
me@stat1008:~$ cd artifacts
me@stat1008:~$ wget -O MY_ARTIFACT MY_COPIED_ASSET_URL
me@stat1008:~$ hdfs dfs -mkdir artifacts
me@stat1008:~$ hdfs dfs -copyFromLocal MY_ARTIFACT artifacts
me@stat1008:~$ hdfs dfs -chmod -R o+rx artifacts

Create your test Hive DB#

me@stat1008:~$ sudo -u analytics-privatedata hive
hive (default)> create database MY_TEST_HIVE_DB; exit;

Spin up an Airflow instance#

On your stat box:

me@stat1008:~$ git clone https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags.git
me@stat1008:~$ cd airflow-dags
me@stat1008:~$ sudo -u analytics-privatedata rm -fr /tmp/MY_AIRFLOW_HOME  # If you've previously run the next command
me@stat1008:~$ sudo -u analytics-privatedata ./run_dev_instance.sh -m /tmp/MY_AIRFLOW_HOME -p MY_PORT platform_eng

On your local box:

me@my_box:~$ ssh -t -N stat1008.eqiad.wmnet -L MY_PORT:stat1008.eqiad.wmnet:MY_PORT

Trigger the DAG run#

  1. Go to http://localhost:MY_PORT/ on your browser

  2. On the top bar, go to Admin > Variables

  3. Click on the middle button (Edit record) next to the platform_eng/dags/image_suggestions_dag.py Key

  4. Update "conda_env": "hdfs://analytics-hadoop/user/ME/artifacts/MY_ARTIFACT", and "hive_db" : "MY_TEST_HIVE_DB", in the Val field

  5. Add any other relevant DAG properties

  6. Click on the Save button

  7. On the top bar, go to DAGs and click on the image_suggestions slider. This should trigger an automatic DAG run

  8. Click on image_suggestions

You’re all set! Don’t forget to manually fail the hive_to_cassandra tasks:

  1. Click on the square next to the first hive_to_cassandra task

  2. Click on the Mark state as… blue button > failed > Downstream > Mark as failed

Release#

  1. On the left sidebar, go to CI/CD > Pipelines

  2. Click on the play button, select trigger_release

  3. If the job went fine, you’ll find a new artifact in the Package Registry

We follow Data Engineering’s workflow_utils: - the main branch is on a .dev release - releases are made by removing the .dev suffix and committing a tag

Deploy#

  1. On the left sidebar, go to CI/CD > Pipelines

  2. Click on the play button and select bump_on_airflow_dags. This will create a merge request at airflow-dags

  3. Double-check it and merge

  4. Deploy the DAGs:

me@my_box:~$ ssh deployment.eqiad.wmnet
me@deploy1002:~$ cd /srv/deployment/airflow-dags/platform_eng/
me@deploy1002:~$ git pull
me@deploy1002:~$ scap deploy

See the docs for more details.

API documentation#