The goal of this project is to create a recommender system for a company that sells medical products.
Import packages and read in the data file
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('PBL 5 Recommendation Data.csv',encoding='latin-1')
Analyze the dataset
df.head()
df.info()
df['Order_Items.product_id'].nunique()
Create a dataframe which sorts the dataframe by the product
pop = df.groupby('Order_Items.product_id').sum()
pop.head()
pop['Orders.subtotal'].max()
pop[pop['Orders.subtotal']==pop['Orders.subtotal'].max()]
pop['Order_Items.qty'].max()
See which product sold the most quantity of items
pop[pop['Order_Items.qty']==pop['Order_Items.qty'].max()]
See which company made the most
comp = df.groupby('Orders.company').sum()
comp[comp['Orders.subtotal']==comp['Orders.subtotal'].max()]
Make a list which shows us which products were bought by which customers
df_list = df.groupby('Products.id')['Customers.id'].apply(list)
df_list.head()
df_list.index
Create a sparse matrix in which shows which customers bought which items
df_final = pd.DataFrame(columns = df['Customers.id'].unique(), index = df['Products.id'].unique())
df_final.head()
df_final.fillna(0, inplace = True)
df_final.head()
df_list.loc[1.0]
df_final.index
df_final.drop(float('nan'), inplace = True)
for i in df_final.index:
for j in df_list.loc[i]:
df_final.loc[i][j]=1
df_final.head()
df_list.loc[1846]
df_list.loc[2310]
Let's find similarity to item 1846
df_final.loc[2310,699]
df_final.index
import numpy as np
no_buy = []
for i in df_final.columns:
if list(df_final.loc[:,i])==list(np.zeros(1710)):
no_buy.append(i)
df_final = df_final.drop(no_buy, axis = 1)
no_sell = []
for i in df_final.index:
if list(df_final.loc[i])==list(np.zeros(2964)):
no_sell.append(i)
no_sell
df_final.shape
Look at how many customers each product has had in total
num_cust = df_final.sum(axis = 1)
num_cust
num_cust.loc[1846]
num_cust[num_cust==num_cust.max()]
Create a series for a standardized series of the number of customers
m = np.mean(num_cust)
s = np.std(num_cust)
stan_num_cust = num_cust.apply(lambda x: ((x-m)/s))
stan_num_cust = stan_num_cust*-1
stan_num_cust.sort_values()
stan_num_cust.min()
Create the recommendar system by creating a function which takes a product as a parameter and outputs the most similar products based on spatial distance and popularity
#recommender system by customer
#the smaller the number the better
def similarity(bought,other):
spa = spatial.distance.cosine(list(df_final.loc[bought]),list(df_final.loc[other]))
pop = stan_num_cust.loc[other]/100
return(spa+pop)
similarity(1846,2310)
def rec_list(bought):
rec = pd.DataFrame(columns = ['rec'],index = df_final.index)
for i in rec.index:
rec.loc[i,'rec']=similarity(bought,i)
return(rec.sort_values(by = 'rec',ascending = True))
Test the remmender system on product 1846.
rec_list(1846)