-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTrade & Ahead.py
1055 lines (775 loc) · 40.6 KB
/
Trade & Ahead.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# -*- coding: utf-8 -*-
"""USL_Project_LearnerNotebook_FullCode+%282%29.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1fiFSNvhEChPaXsvtrfFivpq85-xTBntQ
# Unsupervised Learning: Trade&Ahead
**Marks: 60**
### Context
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
### Objective
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
### Data Dictionary
- Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
- Company: Name of the company
- GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
- GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
- Current Price: Current stock price in dollars
- Price Change: Percentage change in the stock price in 13 weeks
- Volatility: Standard deviation of the stock price over the past 13 weeks
- ROE: A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
- Cash Ratio: The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
- Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
- Net Income: Revenues minus expenses, interest, and taxes (in dollars)
- Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
- Estimated Shares Outstanding: Company's stock currently held by all its shareholders
- P/E Ratio: Ratio of the company's current stock price to the earnings per share
- P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)
## Importing necessary libraries and data
"""
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
"""## Load dataset"""
#Connecting Google drive with Google colab
# Reading the data-set into Google colab
from google.colab import drive
drive.mount('/content/drive')
#Reading the "stock_data.csv" dataset into a dataframe (i.e.loading the data)
path="/content/drive/My Drive/stock_data.csv"
data = pd.read_csv(path)
"""## Data Overview
- Observations
- Sanity checks
-These can be achieved by doing the following;
1. Viewing the first and last few rows of the dataset
2. Checking the shape of the dataset
3. Ensuring that the data is stored in the correct format, it's important to identify the data types.
4. Getting the statistical summary for the variables.
5. Checking for missing values.
6. Checking for duplicates
### View the first and last 5 few rows of the dataset
"""
# returning the first 5 rows using the dataframe head method
data.head()
# returning the last 5 rows using dataframe tail method
data.tail()
"""#### Checking the shape of the dataset"""
#checking shape of the dataframe to find out the number of rows and columns using the dataframe shape command
print("There are", data.shape[0], 'rows and', data.shape[1], "columns.")
"""### Checking the columns data types for the dataset"""
# Using the dataframe info() method to print a concise summary of the DataFrame
data.info()
"""**Observation**
* The dataset contains 15 series (columns) of which four of the series is of the object datatype (Ticker Symbol,Security,GICS Sector and GICS Sub Industry), seven(7) of the series are of the float datatype (Current Price,Price Change, Volatility, Earnings Per Share, Estimated Shares Outstanding,P/E Ratio, and P/B Ratio), while four(4) of the series are of the integer datatype (ROE, Cash Ratio, Net Cash Flow and Net Income).
* All of these dtypes are appropriate for their respective columns
* Total memory usage is approximately 40.0+ KB.
### Creating a copy of original data
"""
# copying the data to another variable to avoid any changes to original data
df = data.copy()
"""### Getting the statistical summary for the variables."""
# checking the statistical summary of the data using describe command and transposing.
df.describe(include='all').T
"""**Observation**
* There are 340 observations present in all
* Differences between mean and median values indicate skewness in the data
* There are 11 GICS Sectors of which most of the are of the industrial sector
* There are 104 GICS Sub induustry
* 50% of the current stock prices are below 59.705 dollars, however, the maximum current stock price is approximately 1275 dollars.
* 75% of the variations or change in price within the selected companies is not below 10.69 dollars.
* 75% of the volatility which is a measure of the standard deviation of the stock price for the past 13 weeks was found to be below 1.6956
* 75% of the companies has its return on equity (ROE) lass than 27%
* 50% of the companies has a favoured cash ratio of 47.0. This means that most of the companies has enough cash or cash equivalent to entirely pay off all their short term debts.
* 25% of the companies has a negative net cash flow of -193906500.0 dollars. This means that the companies are paying more for obligations and liabilities than what it earns through operations.
* 75% of the companies under study has a net income of 1899000000 dollars. This indicates that after all expenses in the companies including salaries, wages, cost of good or materials, and taxes. They seem to have enough to take home.
* 75% of the companies has an Earning per share below 4.62 dollars. This means that majority of the companies if invested upon will not give a profit greater than 4.62 dollars for each share bought by an investor.
### Checking for missing values
"""
# Checking for missing values
df.isnull().sum()
"""**Observation**
No null values present, therefore no missing values listed
###Checking unique values
"""
df.nunique()
"""### Checking for duplicate values"""
# checking for duplicate values
df.duplicated().sum()
"""**Observation**
* There are no duplicate values in the dataset
## Exploratory Data Analysis (EDA)
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
**Questions**:
1. What does the distribution of stock prices look like?
2. The stocks of which economic sector have seen the maximum price increase on average?
3. How are the different variables correlated with each other?
4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?
### Univariate analysis
"""
from matplotlib import patches
import random
#creating a histogram and Boxplot using function
def histobox_plot(df, column, figsize=(15, 10), kde=False, bins=None):
#plt.figure(figsize = (20,10))
# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above)
sns.set(style="darkgrid")
# creating a figure composed of two matplotlib.Axes objects (ax_box and ax_hist)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)},figsize=figsize,)
# assigning a graph to each ax
sns.boxplot(df, x=column, ax=ax_box,showmeans=True, color="violet")
sns.histplot(data=df, x=column, ax=ax_hist)
ax_hist.axvline(
data[column].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist.axvline(
data[column].median(), color="black", linestyle="-"
) # Add median to the histogram
# Remove x axis name for the boxplot
ax_box.set(xlabel='')
for p in ax_hist.patches:
height = p.get_height() # get the height of each bar
# adding text to each bar
ax_hist.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
y = height+0.2, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}'.format(height), # data label, formatted to ignore decimals
ha = 'center') # sets horizontal alignment (ha) to center
"""**`Current Price`**"""
histobox_plot(df, 'Current Price')
"""**Observation**
* The average current stock price of the companies under study is 80.86 dollars which is higher than the median 59.705 dollars indicating that the distribution is a right-skewed with 49 of the 340 stocks having twice the median value of all stocks
* As expected, no stock is listed at less of less than 0 dollars
* There are outliers
**`Price Change`**
"""
histobox_plot(df, 'Price Change')
"""**Observation**
* The average variation in price or price change of the companies under study is 4.078 dollars which is lesser than the median 4.819 dollars indicating that the distribution is a left-skewed
* While there is a tendency for the distribution to favor lower volatilities, it is important to note the presence of extended tails in both positive and negative price changes. Among the most volatile stocks, price fluctuations span from a significant 47% decrease to a substantial 55% increase over a 13-week period.
* There are outliers
**`Volatility`**
"""
histobox_plot(df, 'Volatility')
"""**Observation**
* The mean volatility of the companies under study is 1.526 which is higher than the median 1.386 dollars indicating that the distribution is a right-skewed
* There are outliers
**`ROE`**
"""
histobox_plot(df, 'ROE')
"""**Observation**
* The mean Return on Equity (ROE) of the companies under study is 39.60% which is higher than the median 15.0% indicating that the distribution is a right-skewed.
* There are no stocks listed with values below 0 for either metric. For instance, there are 24 stocks listed with returns on equity below 5, while 25 stocks are listed with returns exceeding 100 percent.
* There are outliers
**`Cash Ratio`**
"""
histobox_plot(df, 'Cash Ratio')
"""**Observation**
* The mean cash ratio of the companies under study is 70.02 which is higher than the median 47 indicating that the distribution is a right-skewed
* There are outliers
**`Net Cash Flow`**
"""
histobox_plot(df, 'Net Cash Flow')
"""**Observation**
* The mean Net Cash Flow of the companies under study is 55537620.59 dollars which is higher than the median 2098000 dollars indicating that the distribution is a right-skewed
* There are outliers
**`Net Income`**
"""
histobox_plot(df, 'Net Income')
"""**Observation**
* The mean Net Income of the companies under study is 1494384602.94 dollars which is higher than the median 707336000 dollars indicating that the distribution is a right-skewed
* There are outliers
**`Earnings Per Share`**
"""
histobox_plot(df, 'Earnings Per Share')
"""**Observation**
* The mean Earnings per share of the companies under study is 2.77 dollars which is lesser than the median 2.90 dollars indicating that the distribution is a left-skewed
* There are outliers
**`Estimated Shares Outstanding`**
"""
histobox_plot(df, 'Estimated Shares Outstanding')
"""**Observation**
* The mean Estimated Shares Outstanding of the companies under study is 577028337.75 dollars which is higher than the median 309675137.8 dollars indicating that the distribution is a right-skewed
* There are outliers
**`P/E Ratio`**
"""
histobox_plot(df, 'P/E Ratio')
"""**Observation**
* The mean P/E ratio of the companies under study is 32.6 which is greater than the median 20.82 dollars indicating that the distribution is a right-skewed
* There are outliers
**`P/B Ratio`**
"""
histobox_plot(df, 'P/B Ratio')
"""**Observation**
* The P/B of the companies under study is -1.718 which is lesser than the median -1.067 indicating that the distribution is a left-skewed
* The distribution for P/B ratios is mostly centered around 0 but with long positive and negative
- For example, 175 of the 340 total stocks are shown to below the 25th percentile and above the 75th percentile and
- Additionally, 31 of the stocks are outliers
"""
# function to create labeled barplots
def labeled_barplot(df, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df[feature]) # length of the column
count = df[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df,
x=feature,
palette="Paired",
order=df[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
"""**`GICS Sector`**"""
labeled_barplot(df, 'GICS Sector', perc=True)
"""**`GICS Sub Industry`**"""
labeled_barplot(df, 'GICS Sub Industry', perc=True)
"""### Bivariate Analysis"""
# #Using heatmap to check correlation between series
plt.figure(figsize=(15, 7))
sns.heatmap(
df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
"""**Observation**
:Some variables exhibit moderate correlations (+/- .40) with one another.
* Volatility demonstrates a negative correlation with price change, implying that an increase in a stock's volatility is likely accompanied by a decrease in its price.
* Net income displays a negative correlation with volatility, suggesting that higher net income is associated with lower price volatility.
* Net income also exhibits a positive correlation with earnings per share (EPS) and estimated shares outstanding.
* EPS shows a positive correlation with the current price, indicating that an increase in a company's EPS is likely to lead to a corresponding rise in its stock price.
* EPS is inversely correlated with return on equity (ROE), implying that as a company generates more equity for shareholders, the subsequent net income will yield a comparatively lower return.
**The stocks of which economic sector have seen the maximum price increase on average?.**
"""
df.groupby('GICS Sector')['Price Change'].mean().sort_values()
plt.figure(figsize=(15,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Price Change', ci=False)
for i in ax.containers:
ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()
"""**Observation**
In the preceding period, stocks within the healthcare sector have demonstrated the most notable average price escalation.
**Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?.**
"""
df.groupby('GICS Sector')['Cash Ratio'].mean().sort_values(ascending=False)
plt.figure(figsize=(20,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Cash Ratio', ci=False)
for i in ax.containers:
ax.bar_label(i,)
plt.show()
"""**Observation**
* The IT and Telecommunications sectors, characterized as relatively newer and less regulated industries, exhibit notably elevated average cash ratios compared to their peer sectors.
* In contrast, the Utilities sector, which operates within a highly regulated environment, records the most modest average cash ratios among all sectors.
**P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. Let's see how the P/E ratio varies, on average, across economic sectors.**
"""
df.groupby('GICS Sector')['P/E Ratio'].mean().sort_values(ascending=False)
plt.figure(figsize=(15,8))
ax=sns.barplot(data=df, x='GICS Sector', y='P/E Ratio', ci=False)
for i in ax.containers:
ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()
"""**Observation**
* Energy enterprises maintain a substantial lead in average P/E ratios across all sectors, while telecommunications companies feature the lowest average P/E ratios.
**Volatility is responsible for the oscillations observed in stock prices. Stocks characterized by high volatility tend to experience more pronounced fluctuations, thereby increasing their investment risk. Now, let's examine the average variations in volatility within different economic sectors.**
"""
df.groupby('GICS Sector')['Volatility'].mean().sort_values(ascending=False)
plt.figure(figsize=(15,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Volatility', ci=False)
for i in ax.containers:
ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()
"""**Observation**
* Among the sectors, the energy sector exhibits the highest level of volatility, closely trailed by the material sector. In contrast, the utilities sector demonstrates the lowest level of volatility.
**The Earnings Per Share (EPS) metric plays a significant role in influencing stock investments. EPS represents the portion of a company's profit allocated to each outstanding share of its common stock. It serves as a crucial indicator of a company's financial health and performance, impacting investor decisions Now, let's examine the average variations in Earnings Per Share within different economic sectors.**
"""
df.groupby('GICS Sector')['Earnings Per Share'].mean().sort_values(ascending=False)
plt.figure(figsize=(15,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Earnings Per Share', ci=False)
for i in ax.containers:
ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()
"""**Observation**
* The health sector was found to have the highest earnings per share, an investor who invest in the health sector is likely to have a profit of 4.54 dollars or above per share in invested.
## Data Preprocessing
- Duplicate value check
- Missing value treatment
- Outlier check
- Feature engineering (if needed)
- Any other preprocessing steps (if needed)
### Outlier Check
- Plotting the boxplots of all numerical columns to check for outliers.
"""
plt.figure(figsize=(15, 12))
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
"""**Observation**
* Each of these variables contains a few or several outliers; nevertheless, these values do not seem implausible considering the inherent nature of stock prices and historical anticipations.
### Scaling
- Prior to engaging in the clustering process, it is imperative to preprocess and scale the data. This step ensures that each variable exerts a comparable influence on the clustering outcome, thus preventing any single variable from overshadowing others. By scaling the data, we create a balanced and equitable platform for the clustering analysis to unfold effectively.
"""
num_col = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio','Net Cash Flow','Net Income','Earnings Per Share','Estimated Shares Outstanding','P/E Ratio','P/B Ratio']
num_col
# Scaling the data set before clustering
scaler = StandardScaler()
subset = df[numeric_columns].copy()
subset_scaled = scaler.fit_transform(subset)
# Creating a dataframe from the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
#create pairplot for scaled dataframe
sns.pairplot(subset_scaled_df,diag_kind='kde')
subset_scaled_df.head()
"""## EDA
- It is a good idea to explore the data once again after manipulating it.
"""
"""## K-means Clustering
### Checking Elbow Plot
"""
k_means_df = subset_scaled_df.copy()
#print average distortions for range of kmeans models fitted to scaled dataset
clusters = range(1, 15)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k, random_state=1)
model.fit(subset_scaled_df)
prediction = model.predict(k_means_df)
distortion = (
sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
/ k_means_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
#fit KMeans model and use visualizaer to indicate optimal K value
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df)
visualizer.show()
"""### Checking the silhouette scores"""
sil_score = []
cluster_list = range(2, 15)
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
preds = clusterer.fit_predict((subset_scaled_df))
score = silhouette_score(k_means_df, preds)
sil_score.append(score)
print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
plt.show()
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(14, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(13, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(12, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(11, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(10, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(9, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(8, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(2, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
"""**Observation**
* TThe K value was chosen based on Elbow method and the distortion which has 6 as its optimal K. Therefore 6 was choosen as the optimal K value for the K-Mean clustering.
* The Silhouette could not give a good optimal K due to the fact that no K could cut through all the clusters and however k greater than 10 values seems to cut through all but that gives a lot of clustering.
"""
# final K-means model
kmeans = KMeans(n_clusters=6, random_state=1)
kmeans.fit(k_means_df)
# creating a copy of the original data
df1 = df.copy()
# adding kmeans cluster labels to the original and scaled dataframes
k_means_df["KM_segments"] = kmeans.labels_
df1["KM_segments"] = kmeans.labels_
"""### Cluster Profiling"""
km_cluster_profile = df1.groupby("KM_segments").mean()
km_cluster_profile["count_in_each_segment"] = (
df1.groupby("KM_segments")["Security"].count().values
)
km_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
## Complete the code to print the companies in each cluster
for cl in df1["KM_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df1[df1["KM_segments"] == cl]["Security"].unique())
print()
#print number of stocks within each sector for all of the clusters
df1.groupby(["KM_segments", "GICS Sector"])['Security'].count()
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
# selecting numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df1, x="KM_segments", y=variable)
plt.tight_layout(pad=2.0)
"""### KMeans Clusters
**Cluster 0**
- 276 which comprised sectors from Consumer Discretionary,Consumer Staples,Energy,Financials,Health Care,Industrials,Information Technology,Materials,Real Estate,Telecommunications Services, and Utilities
- Companies within this cluster have:
- Low Volatility
- Low net income
- low cashflow
- low cash ratio
- low P/E Ratio
**Cluster 1**
- 3 stocks, comprised mostly of stocks within the Consumer Discretionary, Health Care, and Information Technology
- Companies within this cluster have:
- low Volatility
- Highest current Price
- High price change
- lowest Return on Equity
- Low Net income
- low Cash Flow
**Cluster 2**
- 16 stocks from companies like Consumer Discretionary,Consumer Staples, Energy,Financials,Health Care,Information Technology,Real Estate,Telecommunications Services.
- Companies within this cluster have:
- Moderate Volatility
- Highest current Price
- Low price change
- low Return on Equity
- Low Net income
- low Cash Flow
**Cluster 3**
- 3 stocks, comprised of stocks from the Energy sectors
- Companies within this cluster have:
- Highest Volatility
- low current Price
- Low price change
- Highest Return on Equity
- Lowest Net income
- low Cash Flow
**Cluster 4**
- 11 stocks, comprised of stocks from the Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Information Technology, and Telecommunications Services sectors.
- Companies within this cluster have:
- low Volatility
- low current Price
- Moderate price change
- Low Return on Equity
- Highest Net income
- low and high Cash Flow
**Cluster 5**
- 28 stocks, comprised of stocks from the Energy, Health Care, Industrials, Information Technology, and Materials sectors.
- Companies within this cluster have:
- High Volatility
- low current Price
- Low and negative price change
- Low Return on Equity
- low Net income
- low Cash Flow
## Hierarchical Clustering
### Computing Cophenetic Correlation
"""
hc_df = subset_scaled_df.copy()
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
"""**Let's explore different linkage methods with Euclidean distance only.**"""
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(hc_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
"""**Let's view the dendrograms for the different linkage methods with Euclidean distance.**
### Checking Dendrograms
"""
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(hc_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
compare.append([method, coph_corr])
"""**Observation**
- The highest cophenetic correlation is observed for the average and centroid linkage methods, yet the dendrogram associated with the average linkage method seems to yield more coherent clusters.
- The average linkage method suggests that the optimal number of clusters is 5.
"""
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient")
df_cc
"""### Creating model using sklearn"""
HCmodel = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="average")
HCmodel.fit(hc_df)
# creating a copy of the original data
df2 = df.copy()
# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_
"""### Cluster Profiling"""
hc_cluster_profile = df2.groupby("HC_segments").mean()
hc_cluster_profile["count_in_each_segment"] = (
df2.groupby("HC_segments")["Security"].count().values
)
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
## Complete the code to print the companies in each cluster
for cl in df2["HC_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df2[df2["HC_segments"] == cl]["Security"].unique())
print()
df2.groupby(["HC_segments", "GICS Sector"])['Security'].count()
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df2, x="HC_segments", y=variable)
plt.tight_layout(pad=2.0)
"""**Let us try using Ward linkage as it has more distinct and separated clusters (as seen from it's dendrogram before). 6 appears to be the appropriate number of clusters from the dendrogram for Ward linkage.**
### Creating Final model using sklearn
"""
HCmodel = AgglomerativeClustering(n_clusters=6, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
# creating a copy of the original data
df2 = df.copy()
# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_
"""### Cluster Profiling"""
hc_cluster_profile = df2.groupby("HC_segments").mean()
hc_cluster_profile["count_in_each_segment"] = (
df2.groupby("HC_segments")["Security"].count().values
)
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
## Complete the code to print the companies in each cluster
for cl in df2["HC_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df2[df2["HC_segments"] == cl]["Security"].unique())
print()
df2.groupby(["HC_segments", "GICS Sector"])['Security'].count()
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df2, x="HC_segments", y=variable)
plt.tight_layout(pad=2.0)
"""### Hierarchical Clusters
**Cluster 0**
- 11 stocks, comprised mostly of stocks within theConsumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Information Technology, and Telecommunications Services.
- Companies within this cluster have:
- Low Volatility
- Very low current Price
- Low Return on Equity
- Highest Net income
- Highest Cash Flow
- Highest Estimated Outstanding Shares
**Cluster 1**
- 7 stocks, comprised mostly of stocks within the Consumer Discretionary, Consumer Staples, Energy, Financials, and Industrials
- Companies within this cluster have:
- Moderate Volatility
- Low current Price
- Highest Return on Equity
- Low Net income
- Moderate Cash Flow
- Low Estimated Outstanding Shares
**Cluster 2**
- 12 stocks Consumer Discretionary, Consumer Staples, Health Care, Information Technology,Real Estate, and Telecommunications Services.
- Companies within this cluster have:
- Moderate Volatility
- Higest current Price
- Very low Return on Equity
- Low Net income
- low Cash Flow
**Cluster 3**
- 285 stocks, a vast majority of which are in the Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care,Industrials, Information Technology, Materials, Real Estate, Telecommunications Services, and Utilities.
Companies within this cluster have:
- low Volatility
- low current Price
- Very low Return on Equity