Trade & Ahead.py

# -*- coding: utf-8 -*-
"""USL_Project_LearnerNotebook_FullCode+%282%29.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1fiFSNvhEChPaXsvtrfFivpq85-xTBntQ

# Unsupervised Learning: Trade&Ahead

**Marks: 60**

### Context

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.


### Objective

Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.

### Data Dictionary

- Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
- Company: Name of the company
- GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
- GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
- Current Price: Current stock price in dollars
- Price Change: Percentage change in the stock price in 13 weeks
- Volatility: Standard deviation of the stock price over the past 13 weeks
- ROE: A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
- Cash Ratio: The ratio of a  company's total reserves of cash and cash equivalents to its total current liabilities
- Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
- Net Income: Revenues minus expenses, interest, and taxes (in dollars)
- Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
- Estimated Shares Outstanding: Company's stock currently held by all its shareholders
- P/E Ratio: Ratio of the company's current stock price to the earnings per share
- P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)

## Importing necessary libraries and data
"""

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='darkgrid')

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist, pdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")

"""## Load dataset"""

#Connecting Google drive with Google colab
# Reading the data-set into Google colab
from google.colab import drive
drive.mount('/content/drive')

#Reading the "stock_data.csv" dataset into a dataframe (i.e.loading the data)
path="/content/drive/My Drive/stock_data.csv"
data = pd.read_csv(path)

"""## Data Overview

- Observations
- Sanity checks
-These can be achieved by doing the following;

1.   Viewing the first and last few rows of the dataset
2.   Checking the shape of the dataset
3. Ensuring that the data is stored in the correct format, it's important to identify the data types.
4. Getting the statistical summary for the variables.
5. Checking for missing values.
6. Checking for duplicates

### View the first and last 5 few rows of the dataset
"""

# returning the first 5 rows using the dataframe head method
data.head()

# returning the last 5 rows using dataframe tail method
data.tail()

"""#### Checking the shape of the dataset"""

#checking shape of the dataframe to find out the number of rows and columns using the dataframe shape command
print("There are", data.shape[0], 'rows and', data.shape[1], "columns.")

"""### Checking the columns data types for the dataset"""

# Using the dataframe info() method to print a concise summary of the DataFrame
data.info()

"""**Observation**


*   The dataset contains 15 series (columns) of which four of the series is of the object datatype (Ticker Symbol,Security,GICS Sector and GICS Sub Industry), seven(7) of the series are of the float datatype (Current Price,Price Change, Volatility, Earnings Per Share, Estimated Shares Outstanding,P/E Ratio, and P/B Ratio), while four(4) of the series are of the integer datatype (ROE, Cash Ratio, Net Cash Flow and Net Income).
* All of these dtypes are appropriate for their respective columns

* Total memory usage is approximately 40.0+ KB.

### Creating a copy of original data
"""

# copying the data to another variable to avoid any changes to original data
df = data.copy()

"""### Getting the statistical summary for the variables."""

# checking the statistical summary of the data using describe command and transposing.
df.describe(include='all').T

"""**Observation**

* There are 340  observations present in all

* Differences between mean and median values indicate skewness in the data

*  There are 11 GICS Sectors of which most of the are of the industrial sector

*  There are 104 GICS Sub induustry
*  50% of the current stock prices are below 59.705 dollars, however, the  maximum current stock price is approximately 1275 dollars.
*  75% of the variations or change in price within the selected companies is not below 10.69 dollars.
*  75% of the volatility which is a measure of the standard deviation of the stock price for the past 13 weeks was found to be below 1.6956
*  75% of the companies has its return on equity (ROE) lass than 27%
*  50% of the companies has a favoured cash ratio of 47.0. This means that most of the companies has enough cash or cash equivalent to entirely pay off all their short term debts.
*  25% of the companies has a negative net cash flow of -193906500.0 dollars. This means that the companies are paying more for obligations and liabilities than what it earns through operations.
*  75% of the companies under study has a net income of 1899000000 dollars. This indicates that after all expenses in the companies including salaries, wages, cost of good or materials, and taxes. They seem to have enough to take home.
*  75% of the companies has an Earning per share below 4.62 dollars. This means that majority of the companies if invested upon will not give a profit greater than 4.62 dollars for each share bought by an investor.

### Checking for missing values
"""

# Checking for missing values
df.isnull().sum()

"""**Observation**
No null values present, therefore no missing values listed

###Checking unique values
"""

df.nunique()

"""### Checking for duplicate values"""

# checking for duplicate values
df.duplicated().sum()

"""**Observation**
* There are no duplicate values in the dataset

## Exploratory Data Analysis (EDA)

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Questions**:

1. What does the distribution of stock prices look like?
2. The stocks of which economic sector have seen the maximum price increase on average?
3. How are the different variables correlated with each other?
4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?

### Univariate analysis
"""

from matplotlib import patches
import random
#creating a histogram and Boxplot using function

def histobox_plot(df, column, figsize=(15, 10), kde=False, bins=None):
    #plt.figure(figsize = (20,10))
    # set a grey background (use sns.set_theme() if seaborn version 0.11.0 or above)
    sns.set(style="darkgrid")
    # creating a figure composed of two matplotlib.Axes objects (ax_box and ax_hist)
    f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)},figsize=figsize,)
    # assigning a graph to each ax
    sns.boxplot(df, x=column, ax=ax_box,showmeans=True, color="violet")
    sns.histplot(data=df, x=column, ax=ax_hist)
    ax_hist.axvline(
        data[column].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist.axvline(
        data[column].median(), color="black", linestyle="-"
    )  # Add median to the histogram
    # Remove x axis name for the boxplot
    ax_box.set(xlabel='')
    for p in ax_hist.patches:
      height = p.get_height() # get the height of each bar
      # adding text to each bar
      ax_hist.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
      y = height+0.2, # y-coordinate position of data label, padded 0.2 above bar
      s = '{:.0f}'.format(height), # data label, formatted to ignore decimals
      ha = 'center') # sets horizontal alignment (ha) to center

"""**`Current Price`**"""

histobox_plot(df, 'Current Price')

"""**Observation**
* The average current stock price of the companies under study is 80.86 dollars which is higher than the median 59.705 dollars indicating that the distribution is a right-skewed with 49 of the 340 stocks having twice the median value of all stocks
* As expected, no stock is listed at less of less than 0 dollars
* There are outliers

**`Price Change`**
"""

histobox_plot(df, 'Price Change')

"""**Observation**
* The average variation in price or price change of the companies under study is 4.078 dollars which is lesser than the median 4.819 dollars indicating that the distribution is a left-skewed
* While there is a tendency for the distribution to favor lower volatilities, it is important to note the presence of extended tails in both positive and negative price changes. Among the most volatile stocks, price fluctuations span from a significant 47% decrease to a substantial 55% increase over a 13-week period.
* There are outliers

**`Volatility`**
"""

histobox_plot(df, 'Volatility')

"""**Observation**
* The mean volatility of the companies under study is 1.526 which is higher than the median 1.386 dollars indicating that the distribution is a right-skewed
* There are outliers

**`ROE`**
"""

histobox_plot(df, 'ROE')

"""**Observation**
* The mean Return on Equity (ROE) of the companies under study is 39.60% which is higher than the median 15.0%  indicating that the distribution is a right-skewed.
* There are no stocks listed with values below 0 for either metric. For instance, there are 24 stocks listed with returns on equity below 5, while 25 stocks are listed with returns exceeding 100 percent.
* There are outliers

**`Cash Ratio`**
"""

histobox_plot(df, 'Cash Ratio')

"""**Observation**
* The mean cash ratio of the companies under study is 70.02  which is higher than the median 47 indicating that the distribution is a right-skewed
* There are outliers

**`Net Cash Flow`**
"""

histobox_plot(df, 'Net Cash Flow')

"""**Observation**
* The mean Net Cash Flow of the companies under study is 55537620.59 dollars which is higher than the median 2098000 dollars indicating that the distribution is a right-skewed
* There are outliers

**`Net Income`**
"""

histobox_plot(df, 'Net Income')

"""**Observation**
* The mean Net Income of the companies under study is 1494384602.94 dollars which is higher than the median 707336000 dollars indicating that the distribution is a right-skewed
* There are outliers

**`Earnings Per Share`**
"""

histobox_plot(df, 'Earnings Per Share')

"""**Observation**
* The mean Earnings per share of the companies under study is 2.77 dollars which is lesser than the median 2.90 dollars indicating that the distribution is a left-skewed
* There are outliers

**`Estimated Shares Outstanding`**
"""

histobox_plot(df, 'Estimated Shares Outstanding')

"""**Observation**
* The mean Estimated Shares Outstanding of the companies under study is 577028337.75 dollars which is higher than the median 309675137.8 dollars indicating that the distribution is a right-skewed
* There are outliers

**`P/E Ratio`**
"""

histobox_plot(df, 'P/E Ratio')

"""**Observation**
* The mean P/E ratio  of the companies under study is 32.6 which is greater than the median 20.82 dollars indicating that the distribution is a right-skewed
* There are outliers

**`P/B Ratio`**
"""

histobox_plot(df, 'P/B Ratio')

"""**Observation**
* The P/B of the companies under study is -1.718  which is lesser than the median -1.067 indicating that the distribution is a left-skewed
* The distribution for P/B ratios is mostly centered around 0 but with long positive and negative
    - For example, 175 of the 340 total stocks are shown to below the 25th percentile and above the 75th percentile and
    - Additionally, 31 of the stocks are outliers

"""

# function to create labeled barplots


def labeled_barplot(df, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(df[feature])  # length of the column
    count = df[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=df,
        x=feature,
        palette="Paired",
        order=df[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

"""**`GICS Sector`**"""

labeled_barplot(df, 'GICS Sector', perc=True)

"""**`GICS Sub Industry`**"""

labeled_barplot(df, 'GICS Sub Industry', perc=True)

"""### Bivariate Analysis"""

# #Using heatmap to check correlation between series
plt.figure(figsize=(15, 7))
sns.heatmap(
    df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

"""**Observation**
:Some variables exhibit moderate correlations (+/- .40) with one another.
* Volatility demonstrates a negative correlation with price change, implying that an increase in a stock's volatility is likely accompanied by a decrease in its price.
* Net income displays a negative correlation with volatility, suggesting that higher net income is associated with lower price volatility.
* Net income also exhibits a positive correlation with earnings per share (EPS) and estimated shares outstanding.
* EPS shows a positive correlation with the current price, indicating that an increase in a company's EPS is likely to lead to a corresponding rise in its stock price.
* EPS is inversely correlated with return on equity (ROE), implying that as a company generates more equity for shareholders, the subsequent net income will yield a comparatively lower return.

**The stocks of which economic sector have seen the maximum price increase on average?.**
"""

df.groupby('GICS Sector')['Price Change'].mean().sort_values()

plt.figure(figsize=(15,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Price Change', ci=False)
for i in ax.containers:
    ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()

"""**Observation**
In the preceding period, stocks within the healthcare sector have demonstrated the most notable average price escalation.

**Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?.**
"""

df.groupby('GICS Sector')['Cash Ratio'].mean().sort_values(ascending=False)

plt.figure(figsize=(20,8))
ax=sns.barplot(data=df, x='GICS Sector', y='Cash Ratio', ci=False)
for i in ax.containers:
    ax.bar_label(i,)
plt.show()

"""**Observation**
* The IT and Telecommunications sectors, characterized as relatively newer and less regulated industries, exhibit notably elevated average cash ratios compared to their peer sectors.
* In contrast, the Utilities sector, which operates within a highly regulated environment, records the most modest average cash ratios among all sectors.

**P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. Let's see how the P/E ratio varies, on average, across economic sectors.**
"""

df.groupby('GICS Sector')['P/E Ratio'].mean().sort_values(ascending=False)

plt.figure(figsize=(15,8))

ax=sns.barplot(data=df, x='GICS Sector', y='P/E Ratio', ci=False)
for i in ax.containers:
    ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()

"""**Observation**
* Energy enterprises maintain a substantial lead in average P/E ratios across all sectors, while telecommunications companies feature the lowest average P/E ratios.

**Volatility is responsible for the oscillations observed in stock prices. Stocks characterized by high volatility tend to experience more pronounced fluctuations, thereby increasing their investment risk. Now, let's examine the average variations in volatility within different economic sectors.**
"""

df.groupby('GICS Sector')['Volatility'].mean().sort_values(ascending=False)

plt.figure(figsize=(15,8))

ax=sns.barplot(data=df, x='GICS Sector', y='Volatility', ci=False)
for i in ax.containers:
    ax.bar_label(i,)
plt.xticks(rotation=90)
plt.show()

"""**Observation**
* Among the sectors, the energy sector exhibits the highest level of volatility, closely trailed by the material sector. In contrast, the utilities sector demonstrates the lowest level of volatility.

**The Earnings Per Share (EPS) metric plays a significant role in influencing stock investments. EPS represents the portion of a company's profit allocated to each outstanding share of its common stock. It serves as a crucial indicator of a company's financial health and performance, impacting investor decisions Now, let's examine the average variations in Earnings Per Share within different economic sectors.**
"""

df.groupby('GICS Sector')['Earnings Per Share'].mean().sort_values(ascending=False)

plt.figure(figsize=(15,8))

ax=sns.barplot(data=df, x='GICS Sector', y='Earnings Per Share', ci=False)
for i in ax.containers:
    ax.bar_label(i,)

plt.xticks(rotation=90)
plt.show()

"""**Observation**
* The health sector was found to have the highest earnings per share, an investor who invest in the health sector is likely to have a profit of 4.54 dollars or above per share in invested.

## Data Preprocessing

- Duplicate value check
- Missing value treatment
- Outlier check
- Feature engineering (if needed)
- Any other preprocessing steps (if needed)

### Outlier Check

- Plotting the boxplots of all numerical columns to check for outliers.
"""

plt.figure(figsize=(15, 12))

numeric_columns = df.select_dtypes(include=np.number).columns.tolist()

for i, variable in enumerate(numeric_columns):
    plt.subplot(3, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

"""**Observation**
* Each of these variables contains a few or several outliers; nevertheless, these values do not seem implausible considering the inherent nature of stock prices and historical anticipations.

### Scaling

- Prior to engaging in the clustering process, it is imperative to preprocess and scale the data. This step ensures that each variable exerts a comparable influence on the clustering outcome, thus preventing any single variable from overshadowing others. By scaling the data, we create a balanced and equitable platform for the clustering analysis to unfold effectively.
"""

num_col = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio','Net Cash Flow','Net Income','Earnings Per Share','Estimated Shares Outstanding','P/E Ratio','P/B Ratio']

num_col

# Scaling the data set before clustering
scaler = StandardScaler()
subset = df[numeric_columns].copy()
subset_scaled = scaler.fit_transform(subset)

# Creating a dataframe from the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)

#create pairplot for scaled dataframe
sns.pairplot(subset_scaled_df,diag_kind='kde')

subset_scaled_df.head()

"""## EDA

- It is a good idea to explore the data once again after manipulating it.
"""


"""## K-means Clustering

### Checking Elbow Plot
"""

k_means_df = subset_scaled_df.copy()

#print average distortions for range of kmeans models fitted to scaled dataset
clusters = range(1, 15)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k, random_state=1)
    model.fit(subset_scaled_df)
    prediction = model.predict(k_means_df)
    distortion = (
        sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
        / k_means_df.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()

#fit KMeans model and use visualizaer to indicate optimal K value
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df)
visualizer.show()

"""### Checking the silhouette scores"""

sil_score = []
cluster_list =  range(2, 15)
for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)
    preds = clusterer.fit_predict((subset_scaled_df))
    score = silhouette_score(k_means_df, preds)
    sil_score.append(score)
    print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))

plt.plot(cluster_list, sil_score)
plt.show()

model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(14, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(13, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(12, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(11, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(10, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(9, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(8, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(2, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()

"""**Observation**
* TThe K value was chosen based on Elbow method and the distortion which has 6 as its optimal K. Therefore 6 was choosen as the optimal K value for the K-Mean clustering.
* The Silhouette could not give a good optimal K due to the fact that no K could cut through all the clusters and however k greater than 10 values seems to cut through all but that gives a lot of clustering.
"""

# final K-means model
kmeans = KMeans(n_clusters=6, random_state=1)
kmeans.fit(k_means_df)

# creating a copy of the original data
df1 = df.copy()

# adding kmeans cluster labels to the original and scaled dataframes
k_means_df["KM_segments"] = kmeans.labels_
df1["KM_segments"] = kmeans.labels_

"""### Cluster Profiling"""

km_cluster_profile = df1.groupby("KM_segments").mean()

km_cluster_profile["count_in_each_segment"] = (
    df1.groupby("KM_segments")["Security"].count().values
)

km_cluster_profile.style.highlight_max(color="lightgreen", axis=0)

## Complete the code to print the companies in each cluster
for cl in df1["KM_segments"].unique():
    print("In cluster {}, the following companies are present:".format(cl))
    print(df1[df1["KM_segments"] == cl]["Security"].unique())
    print()

#print number of stocks within each sector for all of the clusters
df1.groupby(["KM_segments", "GICS Sector"])['Security'].count()

plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

# selecting numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=df1, x="KM_segments", y=variable)

plt.tight_layout(pad=2.0)


"""### KMeans Clusters
**Cluster 0**
- 276  which comprised sectors from Consumer Discretionary,Consumer Staples,Energy,Financials,Health Care,Industrials,Information Technology,Materials,Real Estate,Telecommunications Services, and Utilities
- Companies within this cluster have:
    - Low Volatility
    - Low net income
    - low cashflow
    - low cash ratio
    - low P/E Ratio

**Cluster 1**
- 3 stocks, comprised mostly of stocks within the Consumer Discretionary, Health Care, and Information Technology
- Companies within this cluster have:
- low Volatility
- Highest current Price
- High price change
- lowest Return on Equity
- Low  Net income
- low Cash Flow


**Cluster 2**
- 16 stocks from companies like Consumer Discretionary,Consumer Staples, Energy,Financials,Health Care,Information Technology,Real Estate,Telecommunications Services.
- Companies within this cluster have:

- Moderate Volatility
- Highest current Price
- Low price change
- low Return on Equity
- Low  Net income
- low Cash Flow

**Cluster 3**
- 3 stocks, comprised  of stocks from the Energy sectors
- Companies within this cluster have:
  - Highest Volatility
- low current Price
- Low price change
- Highest Return on Equity
- Lowest  Net income
- low Cash Flow

**Cluster 4**
- 11 stocks, comprised  of stocks from the  Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Information Technology, and Telecommunications Services sectors.
- Companies within this cluster have:
- low Volatility
- low current Price
- Moderate price change
- Low Return on Equity
- Highest  Net income
- low and high Cash Flow

**Cluster 5**
- 28 stocks, comprised  of stocks from the Energy, Health Care, Industrials, Information Technology, and  Materials sectors.
- Companies within this cluster have:
- High  Volatility
- low current Price
- Low and negative price change
- Low Return on Equity
- low  Net income
- low  Cash Flow

## Hierarchical Clustering

### Computing Cophenetic Correlation
"""

hc_df = subset_scaled_df.copy()

# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(hc_df, metric=dm, method=lm)
        c, coph_dists = cophenet(Z, pdist(hc_df))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm

# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)

"""**Let's explore different linkage methods with Euclidean distance only.**"""

# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]

for lm in linkage_methods:
    Z = linkage(hc_df, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(hc_df))
    print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0] = "euclidean"
        high_dm_lm[1] = lm

# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
    "Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
        high_cophenet_corr, high_dm_lm[1]
    )
)

"""**Let's view the dendrograms for the different linkage methods with Euclidean distance.**

### Checking Dendrograms
"""

# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []

# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(hc_df, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

    compare.append([method, coph_corr])

"""**Observation**
- The highest cophenetic correlation is observed for the average and centroid linkage methods, yet the dendrogram associated with the average linkage method seems to yield more coherent clusters.
- The average linkage method suggests that the optimal number of clusters is 5.
"""

# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient")
df_cc

"""### Creating model using sklearn"""

HCmodel = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="average")
HCmodel.fit(hc_df)

# creating a copy of the original data
df2 = df.copy()

# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_

"""### Cluster Profiling"""

hc_cluster_profile = df2.groupby("HC_segments").mean()

hc_cluster_profile["count_in_each_segment"] = (
    df2.groupby("HC_segments")["Security"].count().values
)

hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)

## Complete the code to print the companies in each cluster
for cl in df2["HC_segments"].unique():
    print("In cluster {}, the following companies are present:".format(cl))
    print(df2[df2["HC_segments"] == cl]["Security"].unique())
    print()

df2.groupby(["HC_segments", "GICS Sector"])['Security'].count()

plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=df2, x="HC_segments", y=variable)

plt.tight_layout(pad=2.0)

"""**Let us try using Ward linkage as it has more distinct and separated clusters (as seen from it's dendrogram before). 6 appears to be the appropriate number of clusters from the dendrogram for Ward linkage.**

### Creating  Final model using sklearn
"""

HCmodel = AgglomerativeClustering(n_clusters=6, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)

# creating a copy of the original data
df2 = df.copy()

# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_

"""### Cluster Profiling"""

hc_cluster_profile = df2.groupby("HC_segments").mean()

hc_cluster_profile["count_in_each_segment"] = (
    df2.groupby("HC_segments")["Security"].count().values
)

hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)

## Complete the code to print the companies in each cluster
for cl in df2["HC_segments"].unique():
    print("In cluster {}, the following companies are present:".format(cl))
    print(df2[df2["HC_segments"] == cl]["Security"].unique())
    print()

df2.groupby(["HC_segments", "GICS Sector"])['Security'].count()

plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=df2, x="HC_segments", y=variable)

plt.tight_layout(pad=2.0)

"""### Hierarchical Clusters

**Cluster 0**
 - 11 stocks, comprised mostly of stocks within theConsumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Information Technology,  and Telecommunications Services.

- Companies within this cluster have:
- Low Volatility
- Very low current Price
- Low Return on Equity
- Highest Net income
- Highest Cash Flow
- Highest Estimated Outstanding Shares

**Cluster 1**
- 7 stocks, comprised mostly of stocks within the Consumer Discretionary, Consumer Staples, Energy, Financials, and Industrials

- Companies within this cluster have:
- Moderate Volatility
- Low current Price
- Highest Return on Equity
- Low Net income
- Moderate Cash Flow
- Low Estimated Outstanding Shares

**Cluster 2**
- 12 stocks Consumer Discretionary, Consumer Staples, Health Care, Information Technology,Real Estate, and Telecommunications Services.
- Companies within this cluster have:
- Moderate Volatility
- Higest current Price
- Very low Return on Equity
- Low Net income
- low Cash Flow

**Cluster 3**
- 285 stocks, a vast majority of which are in the Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care,Industrials, Information Technology, Materials, Real Estate, Telecommunications Services, and  Utilities.
Companies within this cluster have:
- low Volatility
- low current Price
- Very low Return on Equity
- Low Net income
- low Cash Flow

**Cluster 4**
- 23 stocks, from Energy,Information Technology,Materials and,Consumer Discretionary sectors.
Companies within this cluster have:
- Highest Volatility
- low current Price
- low Return on Equity
- Very Low Net income
- low Cash Flow

**Cluster 5**
- 3 stocks, from Consumer Discretionary, Health Care, Information Technology Sectors.
Companies within this cluster have:
- High Volatility
- High current Price
- Highest price change
- lowest Return on Equity
- Low  Net income
- low Cash Flow

## K-means vs Hierarchical Clustering

You compare several things, like:
- Which clustering technique took less time for execution?
- Which clustering technique gave you more distinct clusters, or are they the same?
- How many observations are there in the similar clusters of both algorithms?
- How many clusters are obtained as the appropriate number of clusters from both algorithms?

You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.

**Which clustering technique took less time for execution?**
- Both the KMeans model and the Agglomerative Clustering model successfully fit the dataset in approximately 0.1 seconds.
**Which clustering technique gave you more distinct clusters, or are they the same? How many observations are there in the similar clusters of both algorithms?**
Both algorithms yield comparable clusters, featuring a dominant cluster encompassing a majority of the stocks, while the remaining five clusters consist of 2 to 28 stocks each.

**How many clusters are obtained as the appropriate number of clusters from both algorithms?**
In the case of both algorithms, the utilization of 6 clusters resulted in well-defined clusters, each containing an ample number of observations to effectively distinguish the representative "type" of stock within the cluster.

**Differences or similarities in the cluster profiles from both the clustering techniques**
- Both algorithms produced analogous clusters by considering the outliers present across the 11 variables.

## Actionable Insights and Recommendations

-

- Trade&Ahead should initially assess their clients' financial objectives, risk tolerance, and investment patterns before suggesting a cluster that aligns with these requirements as a potential stock portfolio.

- Nevertheless, a considerable number of these clusters, based on their constituent stock attributes, essentially serve as alternatives to standard indexes. Achieving these objectives could potentially be more straightforward using such indexes.

- Alternatively, Trade&Ahead could employ these clusters as a foundational step for conducting comprehensive financial statement analysis. This approach could specifically highlight stocks that deviate from the cluster "profile."

- If the process of handpicking individual stocks is integral to a client's investment approach, Trade&Ahead may then be well-positioned to identify stocks that are likely to outperform their counterparts (resulting in a buy recommendation) or underperform (prompting a sell recommendation).
"""