Pronounced surplus as it's simply better if not best!
Simple Algorithm for Recommendation (SAR) is a neighborhood based algorithm for personalized recommendations based on user transaction history. SAR recommends items that are most similar to the ones that the user already has an existing affinity for. Two items are similar if the users that interacted with one item are also likely to have interacted with the other. A user has an affinity to an item if they have interacted with it in the past.
SARplus is an efficient implementation of this algorithm for Spark.
Features:
# Users | # Items | # Ratings | Runtime | Environment | Dataset |
---|---|---|---|---|---|
2.5mio | 35k | 100mio | 1.3h | Databricks, 8 workers, Azure Standard DS3 v2 (4 core machines) |
There are a couple of key optimizations:
from pysarplus import SARPlus
# spark dataframe with user/item/rating/optional timestamp tuples
train_df = spark.createDataFrame([
(1, 1, 1),
(1, 2, 1),
(2, 1, 1),
(3, 1, 1),
(3, 3, 1)],
['user_id', 'item_id', 'rating'])
# spark dataframe with user/item tuples
test_df = spark.createDataFrame([
(1, 1, 1),
(3, 3, 1)],
['user_id', 'item_id', 'rating'])
model = SARPlus(spark, col_user='user_id', col_item='item_id', col_rating='rating', col_timestamp='timestamp')
model.fit(train_df, similarity_type='jaccard')
model.recommend_k_items(test_df, 'sarplus_cache', top_k=3).show()
# For databricks
# model.recommend_k_items(test_df, 'dbfs:/mnt/sarpluscache', top_k=3).show()
Insert this cell prior to the code above.
import os
SUBMIT_ARGS = "--packages com.microsoft.sarplus:sarplus:0.5.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.appName("sample")
.master("local[*]")
.config("memory", "1G")
.config("spark.sql.shuffle.partitions", "1")
.config("spark.sql.crossJoin.enabled", True)
.config("spark.ui.enabled", False)
.getOrCreate()
)
pip install pysarplus
pyspark --packages com.microsoft.sarplus:sarplus:0.5.0 --conf spark.sql.crossJoin.enabled=true
One must set the crossJoin property to enable calculation of the similarity matrix (Clusters / < Cluster > / Configuration / Spark Config)
spark.sql.crossJoin.enabled true
spark.sql.sources.default parquet
spark.sql.legacy.createHiveTableByDefault true
This will install C++, Python and Scala code on your cluster.
You'll also have to mount shared storage
Create Azure Storage Blob
Create storage account (e.g. )
Create container (e.g. sarpluscache)
Navigate to User / User Settings
Generate new token: enter 'sarplus'
Use databricks shell (installation here)
databricks configure --token
databricks secrets create-scope --scope all --initial-manage-principal users
databricks secrets put --scope all --key sarpluscache
Run mount code
dbutils.fs.mount(
source = "wasbs://sarpluscache@<accountname>.blob.core.windows.net",
mount_point = "/mnt/sarpluscache",
extra_configs = {"fs.azure.account.key.<accountname>.blob.core.windows.net":dbutils.secrets.get(scope = "all", key = "sarpluscache")})
Disable annoying logging
import logging
logging.getLogger("py4j").setLevel(logging.ERROR)
See DEVELOPMENT.md for implementation details and development information.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。