Skip to content

Commit 3fbacbe

Browse files
committed
porfolio_1
1 parent a827302 commit 3fbacbe

File tree

3 files changed

+717
-9
lines changed

3 files changed

+717
-9
lines changed

Cleansing_Exploration/project2.qmd

+263-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: "Client Report - [Insert Project Title]"
2+
title: "Client Report - Finding Relationships in Baseball"
33
subtitle: "Course DS 250"
4-
author: "[STUDENT NAME]"
4+
author: "Brian Munoz"
55
format:
66
html:
77
self-contained: true
@@ -25,4 +25,264 @@ execute:
2525

2626
---
2727

28-
### Paste in a template
28+
29+
```{python}
30+
import pandas as pd
31+
import numpy as np
32+
import sqlite3
33+
import matplotlib.pyplot as plt
34+
import plotly.graph_objects as go
35+
from plotly.subplots import make_subplots
36+
```
37+
38+
39+
### Baseball, a game of perspective
40+
41+
_This report will allow us to observe the importance of not limiting ourselves to the most recent results. We will observe how the effectiveness of the players changes as they participate in more games. The success of those of players who have played at BYU-Idaho. And finally we will compare the effectiveness in which two great teams use their resources and how this affects their number of victories. _
42+
43+
## QUESTION|TASK 1
44+
45+
__Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.__
46+
47+
```{python}
48+
#| label: Q1
49+
#| code-summary: BYU_Idaho list of students
50+
#| fig-align: center
51+
52+
conn = sqlite3.connect('lahmansbaseballdb.sqlite')
53+
54+
cur = conn.cursor()
55+
56+
query = """
57+
SELECT DISTINCT s.playerID, cp.schoolID, s.salary, s.yearID, s.teamID
58+
FROM salaries s
59+
JOIN collegeplaying cp ON s.playerID = cp.playerID
60+
WHERE s.playerID IN (SELECT playerID FROM collegeplaying WHERE schoolID = "idbyuid")
61+
ORDER BY s.salary DESC
62+
"""
63+
64+
cur.execute(query)
65+
results = cur.fetchall()
66+
67+
df = pd.DataFrame(results, columns=['playerID', 'schoolID', 'salary', 'yearID', 'teamID'])
68+
69+
print(df)
70+
71+
```
72+
73+
```{python}
74+
#| label: Q1-chart
75+
#| fig-align: center
76+
77+
78+
query = """
79+
SELECT yearID, AVG(salary) as avg_salary
80+
FROM salaries
81+
GROUP BY yearID
82+
ORDER BY yearID
83+
"""
84+
85+
cur.execute(query)
86+
results = cur.fetchall()
87+
88+
df = pd.DataFrame(results, columns=['yearID', 'avg_salary'])
89+
90+
plt.figure(figsize=(15, 8))
91+
bars = plt.bar(df['yearID'], df['avg_salary'], color='skyblue', alpha=0.7)
92+
93+
# Guide line
94+
plt.plot(df['yearID'], df['avg_salary'], color='red', linewidth=2, marker='o')
95+
96+
plt.title('Average MLB Salary by Year', fontsize=16)
97+
plt.xlabel('Year', fontsize=12)
98+
plt.ylabel('Average Salary ($)', fontsize=12)
99+
plt.xticks(rotation=45)
100+
101+
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x:,.0f}'))
102+
103+
for bar in bars:
104+
height = bar.get_height()
105+
plt.text(bar.get_x() + bar.get_width()/2., height/2,
106+
f'${height:,.0f}',
107+
ha='center', va='center', rotation=90, color='white', fontweight='bold')
108+
109+
plt.tight_layout()
110+
plt.show()
111+
112+
```
113+
114+
## QUESTION|TASK 2
115+
116+
__This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)__
117+
118+
##### A. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
119+
120+
-They where some players that only where at the bat onces which amde their average batting score extremly higher than other players in comparition-
121+
122+
```{python}
123+
#| label: Q2
124+
#| code-summary: 1 game table
125+
126+
query = """
127+
SELECT playerID, yearID,
128+
CAST(SUM(H) AS FLOAT) AS total_hits,
129+
CAST(SUM(AB) AS FLOAT) AS total_at_bats,
130+
(CAST(SUM(H) AS FLOAT) / CAST(SUM(AB) AS FLOAT))*100 AS batting_average_percentage
131+
FROM batting
132+
WHERE H >= 1
133+
GROUP BY playerID, yearID
134+
ORDER BY batting_average_percentage DESC, playerID
135+
LIMIT 5
136+
"""
137+
138+
cur.execute(query)
139+
results = cur.fetchall()
140+
141+
df = pd.DataFrame(results, columns=['playerID', 'yearID', 'total_hits', 'total_at_bats', 'batting_average'])
142+
143+
pd.set_option('display.float_format', '{:.2f}'.format)
144+
print("\nFormatted results:")
145+
print(df)
146+
147+
```
148+
149+
##### B. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
150+
151+
-We can see that now that we are looking for players who make more calls, the percentage of calls has decreased more drastically.-
152+
153+
```{python}
154+
#| label: Q2-chart
155+
#| code-summary: 10 games table
156+
#| fig-align: center
157+
158+
query = """
159+
SELECT playerID, yearID,
160+
CAST(SUM(H) AS FLOAT) AS total_hits,
161+
CAST(SUM(AB) AS FLOAT) AS total_at_bats,
162+
(CAST(SUM(H) AS FLOAT) / CAST(SUM(AB) AS FLOAT))*100 AS batting_average_percentage
163+
FROM batting
164+
WHERE H >= 10
165+
GROUP BY playerID, yearID
166+
ORDER BY batting_average_percentage DESC, playerID
167+
LIMIT 5
168+
"""
169+
170+
cur.execute(query)
171+
results = cur.fetchall()
172+
173+
df = pd.DataFrame(results, columns=['playerID', 'yearID', 'total_hits', 'total_at_bats', 'batting_average'])
174+
175+
pd.set_option('display.float_format', '{:.2f}'.format)
176+
print("\nFormatted results:")
177+
print(df)
178+
```
179+
180+
##### C. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.
181+
182+
-Now we can observe players who not only performed well, but also had greater participation in their teams. The following table shows that as the hitting percentage decreases, the participation of the players increases. In conclusion, it is normal to expect hitting percentage to drop as players participate in more games.-
183+
184+
185+
```{python}
186+
#| label: Q2-table
187+
#| code-summary: 100 games table
188+
189+
query = """
190+
SELECT playerID, yearID,
191+
CAST(SUM(H) AS FLOAT) AS total_hits,
192+
CAST(SUM(AB) AS FLOAT) AS total_at_bats,
193+
(CAST(SUM(H) AS FLOAT) / CAST(SUM(AB) AS FLOAT))*100 AS batting_average_percentage
194+
FROM batting
195+
WHERE H >= 100
196+
GROUP BY playerID, yearID
197+
ORDER BY batting_average_percentage DESC, playerID
198+
LIMIT 5
199+
"""
200+
201+
cur.execute(query)
202+
results = cur.fetchall()
203+
204+
df = pd.DataFrame(results, columns=['playerID', 'yearID', 'total_hits', 'total_at_bats', 'batting_average'])
205+
206+
pd.set_option('display.float_format', '{:.2f}'.format)
207+
print("\nFormatted results:")
208+
print(df)
209+
210+
```
211+
212+
213+
## QUESTION|TASK 3
214+
215+
__Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Plotly Express to visualize the comparison. What do you learn?__
216+
217+
_I did a comparison of Total Salary and Wins of Yankees vs White Sox for the past 25 years. This allow us to see that even if the Yankees have a higher wins cound, White Sox have show great efficiency by having a great win record and espending almost 50% less than the Yankees._
218+
219+
```{python}
220+
#| label: Q3
221+
#| code-summary: Yankees vs Sox (25 years)
222+
# Include and execute your code here
223+
224+
query = """
225+
SELECT t.name,
226+
ROUND(SUM(s.salary) / 1000000, 2) as team_total_salary,
227+
ROUND(SUM(t.W), 2) as total_wins
228+
FROM teams t
229+
JOIN salaries s ON t.teamID = s.teamID AND t.yearID = s.yearID
230+
WHERE t.teamID IN ("NYA","CHA")
231+
AND t.name != 'New York Highlanders'
232+
AND t.yearID BETWEEN 1992 AND 2016 -- Filter for the past 25 years
233+
GROUP BY t.name
234+
"""
235+
236+
# Execute the query and load results into a DataFrame
237+
df = pd.read_sql_query(query, conn)
238+
239+
# Create two subplots side by side
240+
fig = make_subplots(rows=1, cols=2, subplot_titles=("Total Salary (Millions $)", "Total Wins"))
241+
242+
# Add bar chart for total salary
243+
fig.add_trace(
244+
go.Bar(
245+
x=df['name'],
246+
y=df['team_total_salary'],
247+
name="Total Salary",
248+
text=df['team_total_salary'].apply(lambda x: f'{x:.2f}M'),
249+
textposition='inside',
250+
insidetextanchor='middle',
251+
marker_color='blue'
252+
),
253+
row=1, col=1
254+
)
255+
256+
# Add bar chart for total wins
257+
fig.add_trace(
258+
go.Bar(
259+
x=df['name'],
260+
y=df['total_wins'],
261+
name="Total Wins",
262+
text=df['total_wins'].apply(lambda x: f'{x:.0f}'),
263+
textposition='inside',
264+
insidetextanchor='middle',
265+
marker_color='green'
266+
),
267+
row=1, col=2
268+
)
269+
270+
# Update layout
271+
fig.update_layout(
272+
title={
273+
'text': "Yankees vs White Sox (1992-2016)",
274+
'x': 0.5,
275+
'xanchor': 'center'
276+
},
277+
showlegend=False,
278+
height=600,
279+
width=1000
280+
)
281+
282+
# Show the plot
283+
fig.show()
284+
285+
286+
```
287+
288+

0 commit comments

Comments
 (0)