The goal of this project is to create a recommender system for a company that sells medical products.

Import packages and read in the data file

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('PBL 5 Recommendation Data.csv',encoding='latin-1')

C:\User_Files\Lucy_Wan\Programming\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (20,33,73,106,158) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Analyze the dataset

df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4194 entries, 0 to 4193
Columns: 181 entries, Customers.id to Products.freight_cost
dtypes: float64(98), int64(10), object(73)
memory usage: 5.8+ MB

df['Order_Items.product_id'].nunique()

1753

Create a dataframe which sorts the dataframe by the product

pop = df.groupby('Order_Items.product_id').sum()

pop.head()

pop['Orders.subtotal'].max()

13781.119999999997

pop[pop['Orders.subtotal']==pop['Orders.subtotal'].max()]

pop['Order_Items.qty'].max()

396

See which product sold the most quantity of items

pop[pop['Order_Items.qty']==pop['Order_Items.qty'].max()]

See which company made the most

comp = df.groupby('Orders.company').sum()

comp[comp['Orders.subtotal']==comp['Orders.subtotal'].max()]

Make a list which shows us which products were bought by which customers

df_list = df.groupby('Products.id')['Customers.id'].apply(list)

df_list.head()

Products.id
1.0                  [3]
19.0          [620, 957]
20.0        [1175, 2722]
22.0              [2215]
30.0    [330, 371, 2454]
Name: Customers.id, dtype: object

df_list.index

Float64Index([    1.0,    19.0,    20.0,    22.0,    30.0,    35.0,    62.0,
                 64.0,    65.0,    74.0,
              ...
              25170.0, 25269.0, 25356.0, 25527.0, 25558.0, 25612.0, 25694.0,
              25908.0, 25920.0, 26175.0],
             dtype='float64', name='Products.id', length=1710)

Create a sparse matrix in which shows which customers bought which items

df_final = pd.DataFrame(columns = df['Customers.id'].unique(), index = df['Products.id'].unique())

df_final.head()

df_final.fillna(0, inplace = True)

df_final.head()

df_list.loc[1.0]

[3]

df_final.index

Float64Index([ 2310.0,   177.0,     1.0,   983.0,   991.0,     nan,  1379.0,
                815.0,   795.0,  1430.0,
              ...
                251.0,   350.0, 13104.0,  5972.0,  5985.0, 14974.0, 16660.0,
              15649.0,  2707.0,   562.0],
             dtype='float64', length=1711)

df_final.drop(float('nan'), inplace = True)

for i in df_final.index:
    for j in df_list.loc[i]:
        df_final.loc[i][j]=1

df_final.head()

df_list.loc[1846]

[699,
 699,
 699,
 699,
 699,
 1305,
 1305,
 1352,
 1352,
 1352,
 1352,
 1352,
 1352,
 1352,
 1352,
 1352,
 1352,
 2239]

df_list.loc[2310]

[797, 2072]

Let's find similarity to item 1846

df_final.loc[2310,699]

0

df_final.index

Float64Index([ 2310.0,   177.0,     1.0,   983.0,   991.0,  1379.0,   815.0,
                795.0,  1430.0,  1385.0,
              ...
                251.0,   350.0, 13104.0,  5972.0,  5985.0, 14974.0, 16660.0,
              15649.0,  2707.0,   562.0],
             dtype='float64', length=1710)

import numpy as np

no_buy = []
for i in df_final.columns:
    if list(df_final.loc[:,i])==list(np.zeros(1710)):
        no_buy.append(i)
        
df_final = df_final.drop(no_buy, axis = 1)

no_sell = []
for i in df_final.index:
    if list(df_final.loc[i])==list(np.zeros(2964)):
        no_sell.append(i)

no_sell

[]

df_final.shape

(1710, 2964)

Look at how many customers each product has had in total

num_cust = df_final.sum(axis = 1)

num_cust

2310.0      2
177.0       1
1.0         1
983.0       4
991.0       1
1379.0      4
815.0       1
795.0       4
1430.0      2
1385.0      1
910.0      67
566.0       2
1274.0      1
1272.0      2
1411.0     24
1389.0      2
1384.0      4
65.0        1
1296.0      1
1614.0      6
1590.0      2
1022.0      7
100.0       5
439.0       1
2259.0      2
1104.0      2
303.0       3
1867.0     53
1617.0      3
1245.0      7
           ..
16643.0     1
16683.0     1
17848.0     1
14297.0     1
14298.0     1
15109.0     1
15376.0     1
13312.0     1
872.0       1
1500.0      1
15653.0     1
25107.0     1
15621.0     1
3862.0      1
1732.0      1
895.0       1
2013.0      1
7537.0      1
14296.0     1
16658.0     1
251.0       1
350.0       1
13104.0     1
5972.0      1
5985.0      1
14974.0     1
16660.0     1
15649.0     1
2707.0      1
562.0       1
Length: 1710, dtype: int64

num_cust.loc[1846]

4

num_cust[num_cust==num_cust.max()]

1842.0    81
dtype: int64

Create a series for a standardized series of the number of customers

m = np.mean(num_cust)
s = np.std(num_cust)

stan_num_cust = num_cust.apply(lambda x: ((x-m)/s))

stan_num_cust = stan_num_cust*-1

stan_num_cust.sort_values()

1842.0    -18.770578
911.0     -15.439387
910.0     -15.439387
1867.0    -12.108196
858.0     -11.632312
1862.0    -10.204659
2107.0     -6.635526
837.0      -5.445815
1411.0     -5.207873
2574.0     -4.731988
782.0      -4.731988
3018.0     -4.018162
1031.0     -3.780220
2328.0     -3.304335
17051.0    -3.066393
1841.0     -3.066393
993.0      -3.066393
416.0      -2.828451
857.0      -2.828451
909.0      -2.828451
2046.0     -2.828451
432.0      -2.828451
1414.0     -2.590509
1027.0     -2.590509
3314.0     -2.590509
4194.0     -2.352566
400.0      -2.352566
970.0      -2.114624
21950.0    -2.114624
13475.0    -2.114624
             ...    
5182.0      0.264798
5171.0      0.264798
7606.0      0.264798
7583.0      0.264798
734.0       0.264798
1610.0      0.264798
4678.0      0.264798
18327.0     0.264798
8262.0      0.264798
83.0        0.264798
10142.0     0.264798
10345.0     0.264798
13779.0     0.264798
17179.0     0.264798
2220.0      0.264798
4296.0      0.264798
4295.0      0.264798
4290.0      0.264798
4292.0      0.264798
17066.0     0.264798
4591.0      0.264798
9487.0      0.264798
9488.0      0.264798
9502.0      0.264798
9499.0      0.264798
9501.0      0.264798
9500.0      0.264798
16699.0     0.264798
1323.0      0.264798
562.0       0.264798
Length: 1710, dtype: float64

stan_num_cust.min()

-18.77057752705752

Create the recommendar system by creating a function which takes a product as a parameter and outputs the most similar products based on spatial distance and popularity

#recommender system by customer
#the smaller the number the better
def similarity(bought,other):
    spa = spatial.distance.cosine(list(df_final.loc[bought]),list(df_final.loc[other]))
    pop = stan_num_cust.loc[other]/100
    return(spa+pop)

similarity(1846,2310)

1.0002685546352197

def rec_list(bought):
    rec = pd.DataFrame(columns = ['rec'],index = df_final.index)
    for i in rec.index:
        rec.loc[i,'rec']=similarity(bought,i)
    return(rec.sort_values(by = 'rec',ascending = True))

Test the remmender system on product 1846.

rec_list(1846)

	Customers.id	Customers.fname	Customers.lname	Customers.company	Customers.create_date	Customers.status	Customers.mailing	Customers.reminders	Customers.tax_exempt	Customers.account_id	...	Products.google_shopping_label	Products.product_option	Products.size	Products.material	Products.arm_style	Products.leg_style	Products.seat_size	Products.family_id	Products.saved_status	Products.freight_cost
0	797	Christy	Dill	Company0	1426018724	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	PF61071	0.0	NaN
1	3	John	Smith	Company1	1386089139	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	PF02132	NaN	NaN
2	3	John	Smith	Company1	1386089139	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	2 x Extra large	Nitrile	NaN	NaN	NaN	PF00342	0.0	NaN
3	4	James	Anderson	NaN	1386780263	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	PF04970	NaN	NaN
4	5	Abraham	Pollak	Company3	1386861599	0.0	0.0	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	PF03045	NaN	NaN

	Customers.id	Customers.create_date	Customers.status	Customers.mailing	Customers.reminders	Customers.tax_exempt	Customers.account_id	Customers.sales_rep	Customers.rewards	Customers.profile_id	...	Products.hygienic	Products.default_quantity	Products.shipping_length	Products.shipping_width	Products.shipping_height	Products.shipping_weight	Products.google_shopping_label	Products.leg_style	Products.saved_status	Products.freight_cost
Order_Items.product_id
1846.0	21864	25825138924	0.0	16.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	387.0	270.0	180.0	0.0	0.0	0.0	0.0	0.0

	Customers.id	Customers.create_date	Customers.status	Customers.mailing	Customers.reminders	Customers.tax_exempt	Customers.account_id	Customers.sales_rep	Customers.rewards	Customers.profile_id	...	Products.hygienic	Products.default_quantity	Products.shipping_length	Products.shipping_width	Products.shipping_height	Products.shipping_weight	Products.google_shopping_label	Products.leg_style	Products.saved_status	Products.freight_cost
Order_Items.product_id
1846.0	21864	25825138924	0.0	16.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	387.0	270.0	180.0	0.0	0.0	0.0	0.0	0.0

	Customers.id	Customers.create_date	Customers.status	Customers.mailing	Customers.reminders	Customers.tax_exempt	Customers.account_id	Customers.sales_rep	Customers.rewards	Customers.profile_id	...	Products.hygienic	Products.default_quantity	Products.shipping_length	Products.shipping_width	Products.shipping_height	Products.shipping_weight	Products.google_shopping_label	Products.leg_style	Products.saved_status	Products.freight_cost
Orders.company
Company59	17015	21497767740	0.0	15.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	322.5	225.0	150.0	0.0	0.0	0.0	0.0	0.0

	797	3	4	5	7	8	10	11	12	13	...	3724	3725	3726	3728	3729	3730	3732	3733	3735	3736
2310.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
177.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
983.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
991.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	Customers.id	Customers.create_date	Customers.status	Customers.mailing	Customers.reminders	Customers.tax_exempt	Customers.account_id	Customers.sales_rep	Customers.rewards	Customers.profile_id	...	Products.hygienic	Products.default_quantity	Products.shipping_length	Products.shipping_width	Products.shipping_height	Products.shipping_weight	Products.google_shopping_label	Products.leg_style	Products.saved_status	Products.freight_cost
Order_Items.product_id
1.0	3	1386089139	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	13.23	10.04	10.24	0.0	0.0	0.0	0.0	0.0
11.0	417	1415207733	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.00	0.00	0.00	0.0	0.0	0.0	0.0	0.0
14.0	818	1426432777	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.00	0.00	0.00	0.0	0.0	0.0	0.0	0.0
15.0	1293	1436896396	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.00	0.00	0.00	0.0	0.0	0.0	0.0	0.0
19.0	1577	2851966010	0.0	2.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	4.60	1.20	3.60	0.0	0.0	0.0	0.0	0.0

	rec
1846.0	-0.00449029
1825.0	0.502648
1842.0	0.812294
910.0	0.845606
911.0	0.845606
1867.0	0.878918
858.0	0.883677
1862.0	0.897953
2107.0	0.933645
837.0	0.945542
1411.0	0.947921
782.0	0.95268
2574.0	0.95268
3018.0	0.959818
1031.0	0.962198
2328.0	0.966957
993.0	0.969336
17051.0	0.969336
1841.0	0.969336
909.0	0.971715
432.0	0.971715
857.0	0.971715
416.0	0.971715
2046.0	0.971715
1414.0	0.974095
3314.0	0.974095
1027.0	0.974095
4194.0	0.976474
400.0	0.976474
13475.0	0.978854
...	...
3854.0	1.00265
5182.0	1.00265
5171.0	1.00265
7606.0	1.00265
7583.0	1.00265
734.0	1.00265
1610.0	1.00265
4678.0	1.00265
18327.0	1.00265
10345.0	1.00265
3226.0	1.00265
1323.0	1.00265
16699.0	1.00265
15561.0	1.00265
17179.0	1.00265
2220.0	1.00265
4296.0	1.00265
4295.0	1.00265
4290.0	1.00265
4292.0	1.00265
17066.0	1.00265
4591.0	1.00265
9487.0	1.00265
9488.0	1.00265
9502.0	1.00265
9499.0	1.00265
9501.0	1.00265
9500.0	1.00265
13779.0	1.00265
562.0	1.00265