Skip to content

Commit 444e6c4

Browse files
Merge pull request #927 from microsoft/staging
Staging to master
2 parents cadf54f + 56f80b6 commit 444e6c4

11 files changed

Lines changed: 196 additions & 142 deletions

File tree

README.md

Lines changed: 20 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,21 @@
11
# Recommenders
22

3+
[![Documentation Status](https://readthedocs.org/projects/microsoft-recommenders/badge/?version=latest)](https://microsoft-recommenders.readthedocs.io/en/latest/?badge=latest)
4+
35
This repository contains examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learnings on five key tasks:
4-
- [Prepare Data](notebooks/01_prepare_data/README.md): Preparing and loading data for each recommender algorithm
5-
- [Model](notebooks/02_model/README.md): Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares ([ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS)) or eXtreme Deep Factorization Machines ([xDeepFM](https://arxiv.org/abs/1803.05170)).
6-
- [Evaluate](notebooks/03_evaluate/README.md): Evaluating algorithms with offline metrics
6+
- [Prepare Data](notebooks/01_prepare_data): Preparing and loading data for each recommender algorithm
7+
- [Model](notebooks/02_model): Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares ([ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS)) or eXtreme Deep Factorization Machines ([xDeepFM](https://arxiv.org/abs/1803.05170)).
8+
- [Evaluate](notebooks/03_evaluate): Evaluating algorithms with offline metrics
79
- [Model Select and Optimize](notebooks/04_model_select_and_optimize): Tuning and optimizing hyperparameters for recommender models
8-
- [Operationalize](notebooks/05_operationalize/README.md): Operationalizing models in a production environment on Azure
10+
- [Operationalize](notebooks/05_operationalize): Operationalizing models in a production environment on Azure
911

1012
Several utilities are provided in [reco_utils](reco_utils) to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting training/test data. Implementations of several state-of-the-art algorithms are included for self-study and customization in your own applications. See the [reco_utils documentation](https://readthedocs.org/projects/microsoft-recommenders/).
1113

1214

1315
For a more detailed overview of the repository, please see the documents at the [wiki page](https://github.com/microsoft/recommenders/wiki/Documents-and-Presentations).
1416

1517
## Getting Started
16-
Please see the [setup guide](SETUP.md) for more details on setting up your machine locally, on Spark, or on [Azure Databricks](SETUP.md#setup-guide-for-azure-databricks).
18+
Please see the [setup guide](SETUP.md) for more details on setting up your machine locally, on a [data science virtual machine (DSVM)](https://azure.microsoft.com/en-gb/services/virtual-machines/data-science-virtual-machines/) or on [Azure Databricks](SETUP.md#setup-guide-for-azure-databricks).
1719

1820
To setup on your local machine:
1921
1. Install Anaconda with Python >= 3.6. [Miniconda](https://conda.io/miniconda.html) is a quick way to get started.
@@ -35,27 +37,11 @@ To setup on your local machine:
3537
```
3638
5. Start the Jupyter notebook server
3739
```
38-
cd notebooks
3940
jupyter notebook
4041
```
41-
6. Run the [SAR Python CPU MovieLens](notebooks/00_quick_start/sar_movielens.ipynb) notebook under the 00_quick_start folder. Make sure to change the kernel to "Python (reco)".
42-
43-
**NOTE** - The [Alternating Least Squares (ALS)](notebooks/00_quick_start/als_movielens.ipynb) notebooks require a PySpark environment to run. Please follow the steps in the [setup guide](SETUP.md#dependencies-setup) to run these notebooks in a PySpark environment.
44-
45-
## Install this repository via PIP
46-
A [setup.py](reco_utils/setup.py) file is provided in order to simplify the installation of this utilities in this repo from the main directory.
47-
This still requires the conda environment to be installed as described above. Once the necessary dependencies are installed you can use the following command to install reco_utils as it's own python package.
48-
49-
pip install -e reco_utils
50-
51-
It is also possible to install directly from Github. Or from a specific branch as well.
52-
53-
pip install -e git+https://github.com/microsoft/recommenders/#egg=pkg\&subdirectory=reco_utils
54-
pip install -e git+https://github.com/microsoft/recommenders/@staging#egg=pkg\&subdirectory=reco_utils
55-
56-
57-
**NOTE** - The pip installation does not install any of the necessary package dependencies, it is expected that conda will be used as shown above to setup the environment for the utilities being used.
42+
6. Run the [SAR Python CPU MovieLens](notebooks/00_quick_start/sar_movielens.ipynb) notebook under the `00_quick_start` folder. Make sure to change the kernel to "Python (reco)".
5843
44+
**NOTE** - The [Alternating Least Squares (ALS)](notebooks/00_quick_start/als_movielens.ipynb) notebooks require a PySpark environment to run. Please follow the steps in the [setup guide](SETUP.md#dependencies-setup) to run these notebooks in a PySpark environment. For the deep learning algorithms, it is recommended to use a GPU machine.
5945
6046
## Algorithms
6147
@@ -90,31 +76,32 @@ We provide a [benchmark notebook](benchmarks/movielens.ipynb) to illustrate how
9076
| [NCF](notebooks/02_model/ncf_deep_dive.ipynb) | 0.107720 | 0.396118 | 0.347296 | 0.180775 | N/A | N/A | N/A | N/A |
9177
| [FastAI](notebooks/00_quick_start/fastai_movielens.ipynb) | 0.025503 | 0.147866 | 0.130329 | 0.053824 | 0.943084 | 0.744337 | 0.285308 | 0.287671 |
9278
93-
9479
## Contributing
95-
This project welcomes contributions and suggestions. Before contributing, please see our [contribution guidelines](CONTRIBUTING.md).
9680
81+
This project welcomes contributions and suggestions. Before contributing, please see our [contribution guidelines](CONTRIBUTING.md).
9782
9883
## Build Status
9984
100-
| Build Type | Branch | Status | | Branch | Status |
101-
| --- | --- | --- | --- | --- | --- |
85+
These tests are the nightly builds, which compute the smoke and integration tests. `master` is our main branch and `staging` is our development branch. We use `pytest` for testing python utilities in [reco_utils](reco_utils) and `papermill` for the [notebooks](notebooks). For more information about the testing pipelines, please see the [test documentation](tests/README.md).
86+
87+
### DSVM Build Status
88+
89+
The following tests run on a Windows and Linux DSVM daily. These machines run 24/7.
90+
91+
| Build Type | Branch | Status | | Branch | Status |
92+
| --- | --- | --- | --- | --- | --- |
10293
| **Linux CPU** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly?branchName=master)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=4792) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_staging?branchName=staging)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=4594) |
10394
| **Linux GPU** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu?branchName=master)](https://msdata.visualstudio.com/DefaultCollection/AlgorithmsAndDataScience/_build/latest?definitionId=4997) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu_staging?branchName=staging)](https://msdata.visualstudio.com/DefaultCollection/AlgorithmsAndDataScience/_build/latest?definitionId=4998) |
10495
| **Linux Spark** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_spark?branchName=master)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=4804) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/Recommenders/nightly_spark_staging)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=5186) |
10596
| **Windows CPU** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_win?branchName=master)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=6743) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_staging_win?branchName=staging)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=6752) |
10697
| **Windows GPU** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu_win?branchName=master)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=6756) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_gpu_staging_win?branchName=staging)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=6761) |
10798
| **Windows Spark** | master | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_spark_win?branchName=master)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=6757) | | staging | [![Status](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_apis/build/status/nightly_spark_staging_win?branchName=staging)](https://msdata.visualstudio.com/AlgorithmsAndDataScience/_build/latest?definitionId=6754) |
10899
109-
## AzureML Build Status
100+
### AzureML Build Status
110101
111-
These DevOps pipelines run the existing tests on AzureML.
102+
The following tests run on an AzureML [compute target](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-compute-target). AzureML allows to programmatically start a virtual machine, execute the tests, gather the results in [Azure DevOps](https://azure.microsoft.com/en-gb/services/devops/) and shut down the machine.
112103
113104
| Build Type | Branch | Status | | Branch | Status |
114105
| --- | --- | --- | --- | --- | --- |
115106
| **nightly_cpu_tests** | master | [![Build Status](https://dev.azure.com/best-practices/recommenders/_apis/build/status/nightly_cpu_tests?branchName=master)](https://dev.azure.com/best-practices/recommenders/_build/latest?definitionId=25&branchName=master) | | Staging | [![Build Status](https://dev.azure.com/best-practices/recommenders/_apis/build/status/nightly_cpu_tests?branchName=staging)](https://dev.azure.com/best-practices/recommenders/_build/latest?definitionId=25&branchName=staging) |
116107
| **nightly_gpu_tests** | master | [![Build Status](https://dev.azure.com/best-practices/recommenders/_apis/build/status/bp-nightly_gpu_tests?branchName=master)](https://dev.azure.com/best-practices/recommenders/_build/latest?definitionId=5&branchName=master) | | Staging | [![Build Status](https://dev.azure.com/best-practices/recommenders/_apis/build/status/bp-nightly_gpu_tests?branchName=staging)](https://dev.azure.com/best-practices/recommenders/_build/latest?definitionId=5&branchName=staging) |
117-
118-
119-
**NOTE** - these tests are the nightly builds, which compute the smoke and integration tests. Master is our main branch and staging is our development branch. We use `pytest` for testing python utilities in [reco_utils](reco_utils) and `papermill` for the [notebooks](notebooks). For more information about the testing pipelines, please see the [test documentation](tests/README.md).
120-

SETUP.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ This document describes how to setup all the dependencies to run the notebooks i
1919
* [Requirements of Azure Databricks](#requirements-of-azure-databricks)
2020
* [Repository installation](#repository-installation)
2121
* [Troubleshooting Installation on Azure Databricks](#Troubleshooting-Installation-on-Azure-Databricks)
22-
* [Prepare Azure Databricks for Operationalization](#prepare-azure-databricks-for-operationalization)
22+
* [Prepare Azure Databricks for Operationalization](#prepare-azure-databricks-for-operationalization)
23+
* [Install the utilities via PIP](#install-the-utilities-via-pip)
2324
* [Setup guide for Docker](#setup-guide-for-docker)
2425

2526
## Compute environments
@@ -270,7 +271,7 @@ import reco_utils
270271

271272
* For the [reco_utils](reco_utils) import to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the Recommenders folder, if you zip directly above the Recommenders folder, it won't work.
272273

273-
## Prepare Azure Databricks for Operationalization
274+
### Prepare Azure Databricks for Operationalization
274275

275276
This repository includes an end-to-end example notebook that uses Azure Databricks to estimate a recommendation model using matrix factorization with Alternating Least Squares, writes pre-computed recommendations to Azure Cosmos DB, and then creates a real-time scoring service that retrieves the recommendations from Cosmos DB. In order to execute that [notebook](notebooks/05_operationalize/als_movie_o16n.ipynb), you must install the Recommenders repository as a library (as described above), **AND** you must also install some additional dependencies. With the *Quick install* method, you just need to pass an additional option to the [installation script](scripts/databricks_install.py).
276277

@@ -313,6 +314,21 @@ Additionally, you must install the [spark-cosmosdb connector](https://docs.datab
313314

314315
</details>
315316

317+
## Install the utilities via PIP
318+
319+
A [setup.py](reco_utils/setup.py) file is provided in order to simplify the installation of the utilities in this repo from the main directory.
320+
321+
This still requires the conda environment to be installed as described above. Once the necessary dependencies are installed, you can use the following command to install `reco_utils` as a python package.
322+
323+
pip install -e reco_utils
324+
325+
It is also possible to install directly from Github. Or from a specific branch as well.
326+
327+
pip install -e git+https://github.com/microsoft/recommenders/#egg=pkg\&subdirectory=reco_utils
328+
pip install -e git+https://github.com/microsoft/recommenders/@staging#egg=pkg\&subdirectory=reco_utils
329+
330+
**NOTE** - The pip installation does not install any of the necessary package dependencies, it is expected that conda will be used as shown above to setup the environment for the utilities being used.
331+
316332
## Setup guide for Docker
317333

318334
A [Dockerfile](docker/Dockerfile) is provided to build images of the repository to simplify setup for different environments. You will need [Docker Engine](https://docs.docker.com/install/) installed on your system.

notebooks/01_prepare_data/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ data preparation tasks witnessed in recommendation system development.
88
| --- | --- |
99
| [data_split](data_split.ipynb) | Details on splitting data (randomly, chronologically, etc). |
1010
| [data_transform](data_transform.ipynb) | Guidance on how to transform (implicit / explicit) data for building collaborative filtering typed recommender. |
11-
| [wikidata knowledge graph](wikidata_KG.ipynb) | Details on how to create a knowledge graph using Wikidata |
11+
| [wikidata knowledge graph](wikidata_knowledge_graph.ipynb) | Details on how to create a knowledge graph using Wikidata |
1212

1313
### Data split
1414

notebooks/01_prepare_data/wikidata_KG.ipynb renamed to notebooks/01_prepare_data/wikidata_knowledge_graph.ipynb

Lines changed: 11 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
"metadata": {},
66
"source": [
77
"## Wikidata Knowledge Graph Extraction\n",
8-
"Many recommendation algorithms (DKN, RippleNet, KGCN) use Knowledge Graphs as an external source of information. We found that one of the bottlenecks to benchmark current algorithms like DKN, RippleNet or KGCN is that they used Microsoft Satori. As Satori is not open source, it's not possible to replicate the results found in the papers. The solution is using other open source KGs.\n",
8+
"Many recommendation algorithms (DKN, RippleNet, KGCN) use Knowledge Graphs (KGs) as an external source of information. We found that one of the bottlenecks to benchmark current algorithms like DKN, RippleNet or KGCN is that they used Microsoft Satori. As Satori is not open source, it's not possible to replicate the results found in the papers. The solution is using other open source KGs.\n",
99
"\n",
1010
"The goal of this notebook is to provide examples of how to interact with Wikipedia queries and Wikidata to extract a Knowledge Graph that can be used with the mentioned algorithms.\n",
1111
"\n",
@@ -24,7 +24,8 @@
2424
"name": "stdout",
2525
"output_type": "stream",
2626
"text": [
27-
"System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]\n"
27+
"System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) \n",
28+
"[GCC 7.3.0]\n"
2829
]
2930
}
3031
],
@@ -34,19 +35,17 @@
3435
"sys.path.append(\"../../\")\n",
3536
"print(\"System version: {}\".format(sys.version))\n",
3637
"\n",
38+
"import papermill as pm\n",
3739
"import pandas as pd\n",
40+
"import networkx as nx\n",
41+
"import matplotlib.pyplot as plt\n",
42+
"from reco_utils.dataset import movielens\n",
43+
"\n",
3844
"from reco_utils.dataset.wikidata import (search_wikidata, \n",
3945
" find_wikidata_id, \n",
4046
" query_entity_links, \n",
4147
" read_linked_entities,\n",
42-
" query_entity_description)\n",
43-
"\n",
44-
"import networkx as nx\n",
45-
"import matplotlib.pyplot as plt\n",
46-
"from tqdm import tqdm\n",
47-
"\n",
48-
"from reco_utils.dataset import movielens\n",
49-
"from reco_utils.common.notebook_utils import is_jupyter"
48+
" query_entity_description)\n"
5049
]
5150
},
5251
{
@@ -548,11 +547,8 @@
548547
}
549548
],
550549
"source": [
551-
"# Record results with papermill for tests - ignore this cell\n",
552-
"if is_jupyter():\n",
553-
" # Record results with papermill for unit-tests\n",
554-
" import papermill as pm\n",
555-
" pm.record(\"length_result\", number_movies)"
550+
"# Record results with papermill for unit-tests\n",
551+
"pm.record(\"length_result\", number_movies)"
556552
]
557553
},
558554
{

reco_utils/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# Licensed under the MIT License.
33

44
__title__ = "Microsoft Recommenders"
5-
__version__ = "2019.06"
5+
__version__ = "2019.09"
66
__author__ = "RecoDev Team at Microsoft"
77
__license__ = "MIT"
88
__copyright__ = "Copyright 2018-present Microsoft Corporation"

reco_utils/dataset/wikidata.py

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,9 @@
33

44
import pandas as pd
55
import requests
6+
import logging
67

8+
logger = logging.getLogger(__name__)
79

810
API_URL_WIKIPEDIA = "https://en.wikipedia.org/w/api.php"
911
API_URL_WIKIDATA = "https://query.wikidata.org/sparql"
@@ -57,8 +59,8 @@ def find_wikidata_id(name, limit=1, session=None):
5759
response = session.get(API_URL_WIKIPEDIA, params=params)
5860
page_id = response.json()["query"]["search"][0]["pageid"]
5961
except Exception as e:
60-
# TODO: log exception
61-
# print(e)
62+
# TODO: distinguish between connection error and entity not found
63+
logger.error("ENTITY NOT FOUND")
6264
return "entityNotFound"
6365

6466
params = dict(
@@ -75,8 +77,8 @@ def find_wikidata_id(name, limit=1, session=None):
7577
"wikibase_item"
7678
]
7779
except Exception as e:
78-
# TODO: log exception
79-
# print(e)
80+
# TODO: distinguish between connection error and entity not found
81+
logger.error("ENTITY NOT FOUND")
8082
return "entityNotFound"
8183

8284
return entity_id
@@ -133,9 +135,7 @@ def query_entity_links(entity_id, session=None):
133135
API_URL_WIKIDATA, params=dict(query=query, format="json")
134136
).json()
135137
except Exception as e:
136-
# TODO log exception
137-
# print(e)
138-
# print("Entity ID not Found in Wikidata")
138+
logger.error("ENTITY NOT FOUND")
139139
return {}
140140

141141
return data
@@ -195,9 +195,7 @@ def query_entity_description(entity_id, session=None):
195195
r = session.get(API_URL_WIKIDATA, params=dict(query=query, format="json"))
196196
description = r.json()["results"]["bindings"][0]["o"]["value"]
197197
except Exception as e:
198-
# TODO: log exception
199-
# print(e)
200-
# print("Description not found")
198+
logger.error("DESCRIPTION NOT FOUND")
201199
return "descriptionNotFound"
202200

203201
return description

0 commit comments

Comments
 (0)