The goal of this project is to create a recommender system for a company that sells medical products.

Import packages and read in the data file

In [78]:
import pandas as pd
import matplotlib.pyplot as plt
In [79]:
df = pd.read_csv('PBL 5 Recommendation Data.csv',encoding='latin-1')
C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (20,33,73,106,158) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Analyze the dataset

In [80]:
df.head()
Out[80]:
Customers.id Customers.fname Customers.lname Customers.company Customers.create_date Customers.status Customers.mailing Customers.reminders Customers.tax_exempt Customers.account_id ... Products.google_shopping_label Products.product_option Products.size Products.material Products.arm_style Products.leg_style Products.seat_size Products.family_id Products.saved_status Products.freight_cost
0 797 Christy Dill Company0 1426018724 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN PF61071 0.0 NaN
1 3 John Smith Company1 1386089139 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN PF02132 NaN NaN
2 3 John Smith Company1 1386089139 NaN NaN NaN NaN NaN ... NaN NaN 2 x Extra large Nitrile NaN NaN NaN PF00342 0.0 NaN
3 4 James Anderson NaN 1386780263 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN PF04970 NaN NaN
4 5 Abraham Pollak Company3 1386861599 0.0 0.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN PF03045 NaN NaN

5 rows × 181 columns

In [81]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4194 entries, 0 to 4193
Columns: 181 entries, Customers.id to Products.freight_cost
dtypes: float64(98), int64(10), object(73)
memory usage: 5.8+ MB
In [82]:
df['Order_Items.product_id'].nunique()
Out[82]:
1753

Create a dataframe which sorts the dataframe by the product

In [83]:
pop = df.groupby('Order_Items.product_id').sum()
In [84]:
pop.head()
Out[84]:
Customers.id Customers.create_date Customers.status Customers.mailing Customers.reminders Customers.tax_exempt Customers.account_id Customers.sales_rep Customers.rewards Customers.profile_id ... Products.hygienic Products.default_quantity Products.shipping_length Products.shipping_width Products.shipping_height Products.shipping_weight Products.google_shopping_label Products.leg_style Products.saved_status Products.freight_cost
Order_Items.product_id
1.0 3 1386089139 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 13.23 10.04 10.24 0.0 0.0 0.0 0.0 0.0
11.0 417 1415207733 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0
14.0 818 1426432777 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0
15.0 1293 1436896396 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0
19.0 1577 2851966010 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 4.60 1.20 3.60 0.0 0.0 0.0 0.0 0.0

5 rows × 107 columns

In [85]:
pop['Orders.subtotal'].max()
Out[85]:
13781.119999999997
In [86]:
pop[pop['Orders.subtotal']==pop['Orders.subtotal'].max()]
Out[86]:
Customers.id Customers.create_date Customers.status Customers.mailing Customers.reminders Customers.tax_exempt Customers.account_id Customers.sales_rep Customers.rewards Customers.profile_id ... Products.hygienic Products.default_quantity Products.shipping_length Products.shipping_width Products.shipping_height Products.shipping_weight Products.google_shopping_label Products.leg_style Products.saved_status Products.freight_cost
Order_Items.product_id
1846.0 21864 25825138924 0.0 16.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 387.0 270.0 180.0 0.0 0.0 0.0 0.0 0.0

1 rows × 107 columns

In [87]:
pop['Order_Items.qty'].max()
Out[87]:
396

See which product sold the most quantity of items

In [88]:
pop[pop['Order_Items.qty']==pop['Order_Items.qty'].max()]
Out[88]:
Customers.id Customers.create_date Customers.status Customers.mailing Customers.reminders Customers.tax_exempt Customers.account_id Customers.sales_rep Customers.rewards Customers.profile_id ... Products.hygienic Products.default_quantity Products.shipping_length Products.shipping_width Products.shipping_height Products.shipping_weight Products.google_shopping_label Products.leg_style Products.saved_status Products.freight_cost
Order_Items.product_id
1846.0 21864 25825138924 0.0 16.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 387.0 270.0 180.0 0.0 0.0 0.0 0.0 0.0

1 rows × 107 columns

See which company made the most

In [89]:
comp = df.groupby('Orders.company').sum()
In [90]:
comp[comp['Orders.subtotal']==comp['Orders.subtotal'].max()]
Out[90]:
Customers.id Customers.create_date Customers.status Customers.mailing Customers.reminders Customers.tax_exempt Customers.account_id Customers.sales_rep Customers.rewards Customers.profile_id ... Products.hygienic Products.default_quantity Products.shipping_length Products.shipping_width Products.shipping_height Products.shipping_weight Products.google_shopping_label Products.leg_style Products.saved_status Products.freight_cost
Orders.company
Company59 17015 21497767740 0.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 322.5 225.0 150.0 0.0 0.0 0.0 0.0 0.0

1 rows × 108 columns

Make a list which shows us which products were bought by which customers

In [91]:
df_list = df.groupby('Products.id')['Customers.id'].apply(list)
In [92]:
df_list.head()
Out[92]:
Products.id
1.0                  [3]
19.0          [620, 957]
20.0        [1175, 2722]
22.0              [2215]
30.0    [330, 371, 2454]
Name: Customers.id, dtype: object
In [93]:
df_list.index
Out[93]:
Float64Index([    1.0,    19.0,    20.0,    22.0,    30.0,    35.0,    62.0,
                 64.0,    65.0,    74.0,
              ...
              25170.0, 25269.0, 25356.0, 25527.0, 25558.0, 25612.0, 25694.0,
              25908.0, 25920.0, 26175.0],
             dtype='float64', name='Products.id', length=1710)

Create a sparse matrix in which shows which customers bought which items

In [94]:
df_final = pd.DataFrame(columns = df['Customers.id'].unique(), index = df['Products.id'].unique())
In [95]:
df_final.head()
Out[95]:
797 3 4 5 7 8 10 11 12 13 ... 3724 3725 3726 3728 3729 3730 3732 3733 3735 3736
2310.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
177.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
983.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
991.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 3054 columns

In [96]:
df_final.fillna(0, inplace = True)
In [97]:
df_final.head()
Out[97]:
797 3 4 5 7 8 10 11 12 13 ... 3724 3725 3726 3728 3729 3730 3732 3733 3735 3736
2310.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
177.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
983.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
991.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 3054 columns

In [98]:
df_list.loc[1.0]
Out[98]:
[3]
In [99]:
df_final.index
Out[99]:
Float64Index([ 2310.0,   177.0,     1.0,   983.0,   991.0,     nan,  1379.0,
                815.0,   795.0,  1430.0,
              ...
                251.0,   350.0, 13104.0,  5972.0,  5985.0, 14974.0, 16660.0,
              15649.0,  2707.0,   562.0],
             dtype='float64', length=1711)
In [100]:
df_final.drop(float('nan'), inplace = True)
In [101]:
for i in df_final.index:
    for j in df_list.loc[i]:
        df_final.loc[i][j]=1
In [102]:
df_final.head()
Out[102]:
797 3 4 5 7 8 10 11 12 13 ... 3724 3725 3726 3728 3729 3730 3732 3733 3735 3736
2310.0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
177.0 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1.0 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
983.0 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
991.0 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 3054 columns

In [103]:
df_list.loc[1846]
Out[103]:
[699,
 699,
 699,
 699,
 699,
 1305,
 1305,
 1352,
 1352,
 1352,
 1352,
 1352,
 1352,
 1352,
 1352,
 1352,
 1352,
 2239]
In [104]:
df_list.loc[2310]
Out[104]:
[797, 2072]

Let's find similarity to item 1846

In [105]:
df_final.loc[2310,699]
Out[105]:
0
In [106]:
df_final.index
Out[106]:
Float64Index([ 2310.0,   177.0,     1.0,   983.0,   991.0,  1379.0,   815.0,
                795.0,  1430.0,  1385.0,
              ...
                251.0,   350.0, 13104.0,  5972.0,  5985.0, 14974.0, 16660.0,
              15649.0,  2707.0,   562.0],
             dtype='float64', length=1710)
In [122]:
import numpy as np

no_buy = []
for i in df_final.columns:
    if list(df_final.loc[:,i])==list(np.zeros(1710)):
        no_buy.append(i)
        
df_final = df_final.drop(no_buy, axis = 1)
In [127]:
no_sell = []
for i in df_final.index:
    if list(df_final.loc[i])==list(np.zeros(2964)):
        no_sell.append(i)
In [128]:
no_sell
Out[128]:
[]
In [126]:
df_final.shape
Out[126]:
(1710, 2964)

Look at how many customers each product has had in total

In [199]:
num_cust = df_final.sum(axis = 1)
In [200]:
num_cust
Out[200]:
2310.0      2
177.0       1
1.0         1
983.0       4
991.0       1
1379.0      4
815.0       1
795.0       4
1430.0      2
1385.0      1
910.0      67
566.0       2
1274.0      1
1272.0      2
1411.0     24
1389.0      2
1384.0      4
65.0        1
1296.0      1
1614.0      6
1590.0      2
1022.0      7
100.0       5
439.0       1
2259.0      2
1104.0      2
303.0       3
1867.0     53
1617.0      3
1245.0      7
           ..
16643.0     1
16683.0     1
17848.0     1
14297.0     1
14298.0     1
15109.0     1
15376.0     1
13312.0     1
872.0       1
1500.0      1
15653.0     1
25107.0     1
15621.0     1
3862.0      1
1732.0      1
895.0       1
2013.0      1
7537.0      1
14296.0     1
16658.0     1
251.0       1
350.0       1
13104.0     1
5972.0      1
5985.0      1
14974.0     1
16660.0     1
15649.0     1
2707.0      1
562.0       1
Length: 1710, dtype: int64
In [201]:
num_cust.loc[1846]
Out[201]:
4
In [202]:
num_cust[num_cust==num_cust.max()]
Out[202]:
1842.0    81
dtype: int64

Create a series for a standardized series of the number of customers

In [ ]:
m = np.mean(num_cust)
s = np.std(num_cust)
In [204]:
stan_num_cust = num_cust.apply(lambda x: ((x-m)/s))
In [205]:
stan_num_cust = stan_num_cust*-1
In [211]:
stan_num_cust.sort_values()
Out[211]:
1842.0    -18.770578
911.0     -15.439387
910.0     -15.439387
1867.0    -12.108196
858.0     -11.632312
1862.0    -10.204659
2107.0     -6.635526
837.0      -5.445815
1411.0     -5.207873
2574.0     -4.731988
782.0      -4.731988
3018.0     -4.018162
1031.0     -3.780220
2328.0     -3.304335
17051.0    -3.066393
1841.0     -3.066393
993.0      -3.066393
416.0      -2.828451
857.0      -2.828451
909.0      -2.828451
2046.0     -2.828451
432.0      -2.828451
1414.0     -2.590509
1027.0     -2.590509
3314.0     -2.590509
4194.0     -2.352566
400.0      -2.352566
970.0      -2.114624
21950.0    -2.114624
13475.0    -2.114624
             ...    
5182.0      0.264798
5171.0      0.264798
7606.0      0.264798
7583.0      0.264798
734.0       0.264798
1610.0      0.264798
4678.0      0.264798
18327.0     0.264798
8262.0      0.264798
83.0        0.264798
10142.0     0.264798
10345.0     0.264798
13779.0     0.264798
17179.0     0.264798
2220.0      0.264798
4296.0      0.264798
4295.0      0.264798
4290.0      0.264798
4292.0      0.264798
17066.0     0.264798
4591.0      0.264798
9487.0      0.264798
9488.0      0.264798
9502.0      0.264798
9499.0      0.264798
9501.0      0.264798
9500.0      0.264798
16699.0     0.264798
1323.0      0.264798
562.0       0.264798
Length: 1710, dtype: float64
In [206]:
stan_num_cust.min()
Out[206]:
-18.77057752705752

Create the recommendar system by creating a function which takes a product as a parameter and outputs the most similar products based on spatial distance and popularity

In [213]:
#recommender system by customer
#the smaller the number the better
def similarity(bought,other):
    spa = spatial.distance.cosine(list(df_final.loc[bought]),list(df_final.loc[other]))
    pop = stan_num_cust.loc[other]/100
    return(spa+pop)
    
In [214]:
similarity(1846,2310)
Out[214]:
1.0002685546352197
In [215]:
def rec_list(bought):
    rec = pd.DataFrame(columns = ['rec'],index = df_final.index)
    for i in rec.index:
        rec.loc[i,'rec']=similarity(bought,i)
    return(rec.sort_values(by = 'rec',ascending = True))

Test the remmender system on product 1846.

In [216]:
rec_list(1846)
Out[216]:
rec
1846.0 -0.00449029
1825.0 0.502648
1842.0 0.812294
910.0 0.845606
911.0 0.845606
1867.0 0.878918
858.0 0.883677
1862.0 0.897953
2107.0 0.933645
837.0 0.945542
1411.0 0.947921
782.0 0.95268
2574.0 0.95268
3018.0 0.959818
1031.0 0.962198
2328.0 0.966957
993.0 0.969336
17051.0 0.969336
1841.0 0.969336
909.0 0.971715
432.0 0.971715
857.0 0.971715
416.0 0.971715
2046.0 0.971715
1414.0 0.974095
3314.0 0.974095
1027.0 0.974095
4194.0 0.976474
400.0 0.976474
13475.0 0.978854
... ...
3854.0 1.00265
5182.0 1.00265
5171.0 1.00265
7606.0 1.00265
7583.0 1.00265
734.0 1.00265
1610.0 1.00265
4678.0 1.00265
18327.0 1.00265
10345.0 1.00265
3226.0 1.00265
1323.0 1.00265
16699.0 1.00265
15561.0 1.00265
17179.0 1.00265
2220.0 1.00265
4296.0 1.00265
4295.0 1.00265
4290.0 1.00265
4292.0 1.00265
17066.0 1.00265
4591.0 1.00265
9487.0 1.00265
9488.0 1.00265
9502.0 1.00265
9499.0 1.00265
9501.0 1.00265
9500.0 1.00265
13779.0 1.00265
562.0 1.00265

1710 rows × 1 columns