### Filtering Text
`LIKE` Clause used to search for a pattern with a wildcard
- `%` wildcard used to match zero,one, or many characters
- `_` wildcard used to match a single character
``` SQL
SELECT name
FROM people
WHERE name LIKE `Ade%';
```
would return matches like Adel Karam, Adelaide Kane, Aden Young
```SQL
SELECT name
FROM people
WHERE name LIKE 'Ev_';
```
would return matches like Eve or Eva. However Eve Mendes would not be returned unless the clause is `Ev_ Mendes`.
`NOT LIKE` clause follows the same pattern, except filtering the opposite.
```SQL
SELECT name
FROM people
WHERE name LIKE`%r';
```
would find names ending in r.
```SQL
SELECT name
FROM people
WHERE name LIKE '__t%'
```
would select name where the third character is t.
`IN` Clause can help clean up multiple our clauses on the same columns
``` SQL
SELECT year
FROM films
WHERE year = 1920 OR year = 1930 OR year = 1940;
```
can become
``` SQL
SELECT year
FROM films
WHERE year IN (1920,1930,1940);
```
`COUNT` Clause will count all rows if using *, but with a column specified it will count all rows with non-null values.
**Key's** in Tables uniquely identify records within the table. They are often used to identify distinct records and create relationships in a database structure through joins.
`JOIN` General Structure:
```SQL
FROM left_table AS alias1
xxx JOIN right_table AS alias2
ON alias1.column = alias2.column
```
When joining on two identical column names:
```SQL
FROM left_table AS alias1
xxx JOIN right_table AS alias2
USING(columns_name)
```
Aliases can be used in scenarios where the column names are identical. If they are, without aliasing the names of the columns the code will throw an error.
When doing joins, it is often common to write out the join first then write the `SELECT` statement so that you can properly alias identical columns.
Relationships in SQL are key to joins
- One-to-many relationships (like an artist having many songs) where an object can be associated with many objects
- One-to_one relationships (like everything finger having a unique fingerprint) where an object is only associated with another object
- Many-to-many relationships (like languages in countries. German, French, and Dutch are all offical languages of Belgium, and German can be spoken in Germany and Belgium) where multiple objects are associated with multiple other objects.
#### Manipulating Text
For all caps or all lower case columns, if you want to just capitalize the first letter use the following
```SQL
SELECT vendor_id,
INITCAP(vendor_name) AS vendor_name
FROM vendors
LIMIT 3;
```
Apply lowercase to an entire column
```SQL
SELECT lower(bleh)
```
Apply uppercase to an entire column
```SQL
SELECT upper(bleh)
```
#### Replacing Null Values
to replace null values use the COALESCE statement
```SQL
SELECT name, city, COALESCE(state,'None') AS state
FROM vendors;
```
### CTEs (Common Tables Expressions)
Used to more easily write and maintain complex queries via increased readability and simplification.
```SQL
WITH overdue AS (
SELECT
reference,
date('2020-04-01') - due_date AS days_overdue
FROM invoices)
SELECT count(*)
FROM overdue
WHERE days_overdue > 60
```
### OUTER JOINS
`LEFT JOIN` contains all records in the left table, and those that match in the right. Returns null values when no match is found in the right table. Can be written as `LEFT OUTER JOIN`.
`RIGHT JOIN` does the same as `LEFT JOIN` except swapping the primary table. Any `RIGHT JOIN` can be wrriten as a `LEFT JOIN` so it much less common.
The Syntax for `LEFT/RIGHT JOIN` follow the same structure as `INNER JOIN` except replacing INNER with LEFT/RIGHT.
`FULL JOIN` Combines a Left and Right joins. It returns null values for where there wasn't a match in either the Right or Left tables. The Syntax matches both the LEFT/RIGHT/INNER join code but replacing the type by FULL.
**Self Joins** are tables joined with themselves. Used to compare parts of the same table (Ie lets say we have a table of prime ministers. We want to join the same table on continent to match all prime minsters with with prime ministers within their region.)
```SQL
SELECT
p1.country_code,
p1.size AS size2010,
p2.size AS size2015
FROM populations AS p1
INNER JOIN populations AS p2
ON p1.country_code = p2.country_code
WHERE p1.year = 2010
-- Filter such that p1.year is always five years before p2.year
AND p1.year = p2.year - 5;
```
This query is a Self Join that is used to compare country populations within the same table in different years. In this example it joins on country code and compares the populations of each country from 2010 and 2015.
### Set Theory for SQL Joins
<img title="a title" alt="Alt text" src="Static/SetTheory.png">
Set operations do not require a column to join on. Instead, Set operations require the same number of columns and Sames field type for each columns. The columns are then stacked on top of eachother (why we need the same datatype) in order.
The resulting set will take the aliases and field names from the left table.
Helpful to write the syntax for each side of the set operations before writing the Set operation itself.
1. Union/Union All
- `UNION` Takes two tables as input and returns all records (except duplicates) from both
- `UNION ALL` Takes two tables as input and returns all records including duplicates
```SQL
SELECT *
FROM left_table
UNION/UNION ALL
SELECT *
FROM right_table;
```
2. Intersect
- `INTERSECT` takes two tables as input and only returns records in both tables.
```SQL
SELECT *
FROM left_table
INTERSECT
SELECT *
FROM right_table
```
Ex: Counties with prime ministers and persidents:
```SQL
SELECT country as intersect_country
FROM prime_ministers
INTERSECT
SELECT country
from presidents;
```
This table returns only records for countries with a president and prime minister
3. Except
### Subqueries
Calling joins will always add extra columns to the new set.
**Additive Joins** like `Inner Join` add columns with different names to the result set with their unique names. Fields with the same name get added with the same name. This can cause duplicate name columns (solved with aliasing).
**Semi Join** chooses records in the first table where a condition is met in the second. Will only return records in the table being filtered, therefore not including extra columns where both tables align (therefore not **Additive**)
```SQL
SELECT president, country, continent
FROM presidents
WHERE country IN
(SELECT country
FROM states
WHERE indep_year < 1880);
```
This query returns records in the presidents column where the county's independence year is less than 1880. Since the indep_year column is in the states tables, we can write a **subquiery** in the `WHERE` clause to filter on the presidents table.
**ANTI JOIN** Opposite of Semi Join. Chooses records in the left table where the column in the left does not match values in the column in the right table.
``` SQL
SELECT president, country, continent
FROM presidents
WHERE continent LIKE `%America`
AND country NOT IN
(SELECT country
FROM states
WHERE indep_year < 1880);
```
Countries in the Americas where indep_year is greater than or equal to 1880.
Most Common place for **Subqueires**
### Subquiries inside `SELECT`
```SQL
SELECT DISTINCT continent,
(SELECT COUNT(*)
FROM monatchs
WHERE states.continent = monarch.continent) AS monarch_count
FROM states;
```
The following statement will group by distinct continets in the states table and count how many monarchs from the monarchs table are associated.
This Query:
```SQL
SELECT countries.name AS country,
-- Subquery that provides the count of cities
(SELECT COUNT(DISTINCT name)
FROM cities
WHERE countries.code = cities.country_code) AS cities_num
FROM countries
ORDER BY cities_num DESC, country
LIMIT 9;
```
Is equivelent to this Left Join Group by:
```SQL
-- Find top nine countries with the most cities
SELECT countries.name AS country, COUNT(DISTINCT cities.name) AS cities_num
FROM countries
LEFT JOIN cities
ON countries.code = cities.country_code
GROUP BY country
-- Order by count of cities as cities_num
ORDER BY cities_num DESC, country ASC
LIMIT 9;
```
As they both count the number of cities that exist in the cities table, associated with the countries table.
### Subqueries inside FROM
```SQL
SELECT DISTINCT monarchs.continent, sub.most_recent
FROM monarchs,
(SELECT
continet,
MAX(indep_year) AS most_recent
FROM states
GROUP BY continent) AS sub
WHERE monarchs.continet = sub.continent
ORDER BY continet
```
This query will return continents with monarchs and the year the most recent country gained independence.
When linking two tables in the where clause, mae sure to use distinct on the identifyer column.
This subquery is defined as a temporary table that can be referenced in the outside query.
Subqueries inside `FROM`` can help select columns from multiple tables in a single query.
Say you are interested in determining the number of languages spoken for each country. You want to present this information alongside each country's local_name, which is a field only present in the countries table and not in the languages table. You'll use a subquery inside FROM to bring information from these two tables together!
```SQL
-- Select local_name and lang_num from appropriate tables
SELECT DISTINCT countries.local_name, sub.lang_num
FROM countries,
(SELECT code, COUNT(*) AS lang_num
FROM languages
GROUP BY code) AS sub
-- Where codes match
WHERE countries.code = sub.code
ORDER BY lang_num DESC;
```
### Subquery challenge
Suppose you're interested in analyzing inflation and unemployment rate for certain countries in 2015. You are not interested in countries with "Republic" or "Monarchy" as their form of government, but are interested in all other forms of government, such as emirate federations, socialist states, and commonwealths.
```SQL
-- Select relevant fields
SELECT economies.code, economies.inflation_rate, economies.unemployment_rate
FROM economies
WHERE year = 2015
AND code NOT IN
-- Subquery returning country codes filtered on gov_form
(SELECT code
FROM countries
WHERE gov_form LIKE '%Republic%' OR gov_form LIKE '%Monarchy%')
ORDER BY inflation_rate;
```
Using '%word%' as a wildcard expression will return all rows that **Contains** that word anywhere in the column being filtered.
Your task is to determine the top 10 capital cities in Europe and the Americas by city_perc, a metric you'll calculate. city_perc is a percentage that calculates the "proper" population in a city as a percentage of the total population in the wider metro area, as follows:
```SQL
-- Select fields from cities
SELECT name, country_code, city_proper_pop, metroarea_pop, city_proper_pop / metroarea_pop * 100 AS city_perc
FROM cities
-- Use subquery to filter city name
WHERE name IN
(SELECT capital
FROM countries
WHERE continent = 'Europe' OR continent LIKE '%America') AND
-- Add filter condition such that metroarea_pop does not have null values
metroarea_pop IS NOT NULL
ORDER BY city_perc DESC
-- Sort and limit the result
LIMIT 10;
```
### Detecting Duplicates
You can detect duplicates by using the `GROUP BY` and `HAVING` clauses
```SQL
SELECT
column1,
...
columnn
FROM table
GROUP BY
column1,...,columnn
HAVING
COUNT(*) > 1;
```
Also, the `ROW_NUMBER()` function determines where groups of column values are duplicates in a dataset. Must use the `OVER` clause which defines the window the function will be applied.
#### Enumerating duplciates rows
```SQL
SELECT
camis,
name,
boro,
inspection_date,
violation_code,
ROW_NUMBER() OVER(
PARTITION BY
camis,
name,
boro,
inspection_date,
violation_code
) - 1 AS duplicate
FROM
restraunt_inspection;
```
This query adds a column indicating the # of times a record is duplicated based on the order (ie the first record that is duplicate is assigned 1, second 2, etc), then subtracts one. This allows us to filter by this column = 0 to remove all duplicate entries.
#### Resolving impartial dupes
Let's say two records have all of the same identifiers, but there's a column called Score (ie test scores). Now lets say the same student has two records, but two different scores. If there is no documentation regarding why there could be dupes like this, then it is common to replace these from an aggregate function (Average,min,max).
EX: Detecting Duplicated parking tickets
```SQL
SELECT
summons_number,
-- Use ROW_NUMBER() to define duplicate window
ROW_NUMBER() OVER(
PARTITION BY
plate_id,
issue_date,
violation_time,
house_number,
street_name
-- Modify ROW_NUMBER() value to define duplicate column
) - 1 AS duplicate,
plate_id,
issue_date,
violation_time,
house_number,
street_name
FROM
parking_violation;
```
EX: Selecting Duplicate Parking tickets
```SQL
SELECT
-- Include all columns
*
FROM (
SELECT
summons_number,
ROW_NUMBER() OVER(
PARTITION BY
plate_id,
issue_date,
violation_time,
house_number,
street_name
) - 1 AS duplicate,
plate_id,
issue_date,
violation_time,
house_number,
street_name
FROM
parking_violation
) sub
WHERE
-- Only return records where duplicate is 1 or more
duplicate > 0;
```
EX: Return the duplicate records and keep only the lowest fee
```SQL
SELECT
-- Include SELECT list columns
summons_number,
MIN(fee) AS fee
FROM
parking_violation
GROUP BY
-- Define column for GROUP BY
summons_number
HAVING
-- Restrict to summons numbers with count greater than 1
COUNT(*) > 1;
```
### Inconsistent data
Given a set of business rules, finding inconsistencies that do not match the given situations guidlines.
For example: Restraunt inspection
- core corresponds to # of violatons
- A (0-13), B(14-27), C(28+)
- Scenarios for grades:
- A on initial inspection
- Re-inspection with A, B, or C.
You can write simple SQL queries that detect these inconsistencies above using where clauses.
Notes:
- Diversity of approaches
- Careful thought requires
- Domain knowledge
- which values are valid
- Reasons for duplication
- Appropriate fill-in values
EX: write a query that makes sure tickets were issued during the violation hours.
```SQL
SELECT
-- Specify return columns
summons_number,
violation_time,
from_hours_in_effect,
to_hours_in_effect
FROM
parking_violation
WHERE
-- Condition on values outside of the restricted range
violation_time NOT BETWEEN from_hours_in_effect AND to_hours_in_effect;
```
### Case Statement Subqueries
In this example, we use a subquery to fill in deleted information from another table.
The table we access the zip code we want to populate is nyc_zip_codes, and the table we want to populate is film_permit.
The film_permit table had the borough deleted, which is the column we want to populate from the nyc_zip_codes table.
```SQL
SELECT
event_id,
CASE
WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Manhattan') THEN 'Manhattan'
-- Match Brooklyn zip codes
WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Brooklyn') THEN 'Brooklyn'
-- Match Bronx zip codes
WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Bronx') THEN 'Bronx'
-- Match Queens zip codes
WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Queens') THEN 'Queens'
-- Match Staten Island zip codes
WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Staten Island') THEN 'Staten Island'
-- Use default for non-matching zip_code
ELSE NULL
END as borough
FROM
film_permit;
```
### Column Types and Contraints in SQL
Foreign key: value that exists in the referenced column, or null
Primary Key: unique, not null
Unique: values must all be different except for null
Check Contraints: conditions on the values
- column1 > 0
#### Data Types:
- Common
- Numeric
- Character
- Date/Time
- Boolean
- Special
- Arrays
- Monetary
- Binary
- Geometric
- Network Address
- XML
- JSON
- and more!
You can find the specific types in the entity relationship diagram.
Values can be converted temporarily using the `CAST` function (called casting)
```SQL
SELECT CAST (value AS new_type);
```
or alternatively
```SQL
SELECT value::new_type;
```
#### Extract Date From DateTime
```SQL
SELECT EXTRACT(MONTH from column)
```
### Case Statement
Contain a When,Then, and Else statement finished with End
```SQL
CASE WHEN x = 1 THEN 'a'
WHEN x = 2 THEN 'b'
ELSE 'Tie' END AS outcome
```
multiple conditions in WHEN statement
```SQL
CASE WHEN x = 1 OR x = 2 THEN 'a or b'
```
Filtering on the Case Statement
```SQL
CASE WHEN x = 1 THEN 'a'
WHEN x = 2 THEN 'b'
ELSE 'Tie' END AS outcome
FROM BLEH
WHERE CASE WHEN x = 1 THEN 'a'
WHEN x = 2 THEN 'b'
ELSE 'Tie' END IS NOT NULL;
```
Not including the else statement populates NULL, then we filter out the nulls in the WHERE statement.
Case statements with aggregate function
```SQL
SELECT
a,
COUNT(CASE WHEN x = 1 or x = 2 then id) AS home_win
FROM bleh
GROUP BY a
```
**Percentages with CASE and AVG**
```SQL
SELECT
a,
AVG(CASE WHEN x = 1 THEN 1
WHEN x != 2 THEN 0
END) AS pct
```
### Statistical Extermination Theory
The field of stats is the practice and study of collecting and analyzing data under two main branches.
1. **Descriptive / Summary Statistics**: describing or summarizing our data
2. **Inferential Statistics**: collect a sample of data, and apply the results to the population that the sample represents.
Stats require **specific measurable question** like is rock music more popular than jazz?
Can't be used to find out **why**.
**Numeric Data**
1. continuous data (stock prices)
2. interval/count data (number of cups of coffee)
**Categorical Data**
1. Nominal Data which is unordered (eye color)
2. Ordinal Data which is ordered (how strongly do you feel about this product)
**Measures of Center**
- Mean
- Median
- Mode
When data is not symmetrical (contains outliers) it is best to use the median over the mean.
**Measures of Spread**
Tells how much variety exists in our data.
- range = max - min
- variance = SUM(Distance - Mean)^2 of all data points divided by # of points
- standard deviation = sqrt(variance)
- Closer to 0 means more symmetrical
- quartiles
- split the data into four equal parts
- 0%,25%,50%,75%,100%
- 50% = median
- IQR = 3rd Quartile - 1st Quartile
- less affected by outliers than any other measures of spread
**Probability and Distributions**
What are the chances?
P(event) = # way event can happen / total # of possible outcomes
**Sampling with Replacement** is when a sample is put back into the available pool once selected. This is an example of independent probability.
**Independent Probability** two events are *independent* if the probability of the second event does not change based on the outcome of the first event.
**Ex:**
- Chances that the next order is greater than the mean?
- number of orders greater than the mean divided by the total # of orders.
**Probability Distribution** describes the probability of each possible outcome in a scenario
**Expected Value** the mean of the probability distribution.
- Multiplying each value with its probability.
Probability can also be calculated by graphing the probability distribution and finding the area under the curve at the given point.
When all outcomes have the same probability, it is called **Discrete Uniform Distribution**
**Law of Large Numbers States** that as the size of your sample increases, the sample mean will approach the **expected value**.
**Conditional Probability**
**Sampling without Replacement** is when a sample is not put back into the available pool once selected. This is an example of dependent probability as the probability of the second event is affected by the outcome of the first event.
**Conditional Probability** is used to calculate the probability of dependent events. The probability of one event is **conditional** on the outcome of another.
![[conditional_probability.png]]![[condit_proba_2.png]]
**Continuous Distributions**
1. Continuous Uniform Distribution
- Straight Line
2. Many Non-Continuous distributions
1. Bimodal Distribution (modeling soft cover vs hard cover book prices)
2. Normal Distribution (modeling age of retirement)
Regardless, all probability distribution's must have an area under the curve equal to 1.
**Normal Distribution**
![[normal_distribution.png]]
Curve never hits 0.
The normal distribution is described by its mean and standard deviation.
68% of the area falls within 1 STD of the mean and 95% 2 STD from the mean and 99.7% from 3 STD.
**A normal distribution is required for many statistical tests.**
We can interpret a distribution by its **skewness** and **kurtosis**
![[skewness.png]]
Positive skewed distribution can also be called right skewed, and negative left skewed.
**Kurtosis** is a way of describing the occurrence of extreme values in a distribution.
1. Positive/Leptokurtic Kurtosis
- Large peak around the mean and small STD
2. Mesokurtic Kurtosis
- Used to describe the normal distribution
3. Negative/Platykurtic Kurtosis
- Lower peak around the mean and high STD
![[Kurtosis.png]]
**Binomial Distribution**
- probability distribution of the **number of successes** in a sequence of **independent events**
A great example of this is flipping a coin, or any binary outcomes.
Described by **n** and **p**:
- **n**: total number of events
- **p** probability of success
![[binomial_distribution.png]]
**Expected Value** of a binomial distribution is equal to n x p
The binomial distribution requires independent events product binary outcomes.
**Central Limit Theorem** states the sampling distribution of a statistic becomes closer to the normal distribution as the size of the sample increases.
The distribution of a summary statistic (mean) is called a **Sampling Distribution**.
The CLT only applies when the **samples are random and independent**.
Therefore, taking the mean of the **Sampling Distribution** is an estimate of the summary statistic across the entire population.
Benefits of the CLT is that if you do not have the resources to collect all data for your population, you can provide and estimate. N>= 30 is the general rule of thumb to be able to apply the CLT.
**The Poisson Distribution**
The Poisson Process:
- The Average # of events in a period is known
- but the time or space between events is random
- Ex: number of animals adopted from an animal shelter per week.
The **Poisson Distribution** describes the probability of some # of events occurring over a fixed period of time.
- Ex: probability of 12 people arriving at a restaurant per hour
Described by lambda, which is equal to the **average number of events per time interval.**
- Ex: average number of patrons per hour = 20
Lambda is also the **expected value** of the distribution, and is the the value occurring at the distributions peak.
**Hypothesis Testing**
Used to compare populations.
Examples:
- Can a change in price lead to increased revenue?
- Will changing a website address result in increased traffic?
- Is a medication effective in the treatment of a health condition?
The **Null Hypothesis** is the start, where you assume no difference exists.
The **Alternative Hypothesis** can take many forms.
Workflow
1. Define the target populations
- Adult Women taking or not taking vitamin C supplements
2. Develop null and alternative hypothesis
- NH: Births are equally likely to be male or female in both populations.
- AH: More births are female among women taking vitamin C supplements.
3. Collect sample data
4. Perform statistical tests on the sample data
5. Draw conclusions on the populations
**Independent and Dependent Variables**
- Vitamin C is the independent
- Birth gender ratio is the dependent
**Experiments**
A subset of **Hypothesis Testing**. What effects does the treatment have on the response?
Ex: What is the effect of an advertisement on the number of products purchased?
- Treatment: advertisement
- Response: number of products purchases
**Controlled Experiments** participants are assigned to either the treatment group or the control group
- Groups should be comparable to avoid introducing bias.
**Randomization** is the gold standard for experiments where participants are assigned to each group randomly. This is called a **Randomized Controlled Trial**.
**Blinding** is also a gold standard, where participants do not know which group they're in (placebo).
**Double-blind randomized controlled trial** person administering the treatment/running the study doesn't know whether the treatment is real or a placebo.
**Randomized Controlled Trials vs. A/B Testing**
- Randomized controlled trial can have multiple treatment groups.
- Ie administering different amounts of a medication
- A/B testing only split evenly into two groups
- popular in marketing and engineering.
**Correlation**
Scatterplots are a great way to visualize the correlation between two variables.
**Pearson Correlation Coefficient**
- quantifies the strength of relationship between two variables.
- Number between -1 and 1
- Magnitude corresponds to strength
- +/- corresponds to direction
The **Pearson Correlation Coefficient** can only be used for linear relationships. Changes between variables are proportionate.
![[Pasted image 20231101092841.png]]
**Correlation** != **Causation**
**Confounding Variables** is not measured, but may affect the relationship between our variables.
**Interpreting Hypothesis Test Results**
When drawing conclusions in hypothesis testing, we use **p-value**.
**p**: probability of achieving this result, assuming the null hypothesis is true.
![[Screenshot 2023-11-01 at 9.35.00 AM.png]]
The p-value is the total area under the curve where our two sampling distributions overlap.
To reduce the risk of drawing a false conclusion, we set a probability threshold for rejecting the **null hypothesis**.
This is known as the **significance level**. This is decided before collecting data to minimize bias.
A typical threshold is 0.05, which is a 5% chance of wrongly concluding the alternative hypothesis.
if **p** <= **significance level** we reject the null hypothesis, and our results are **statistically significant**.
![[Screenshot 2023-11-01 at 9.38.55 AM.png]]
Sampling with replacement is often used, also known as **bootstrapping**, in hypothesis tests in order to arrive at a more accurate sampling distribution.
#### Subqueries
A query nested inside another query.
```SQL
SELECT column
FROM (SELECT column
FROM table) as subquery;
```
Can be placed in:
1. SELECT
2. FROM
3. WHERE
4. GROUP BY
Subqueries are the first thing processed in a SQL query, and only processed once.
**WHERE CLAUSE**
Used to filter results by something you would need to calculate separately.
Filtering on calculated field
```SQL
SELECT *
FROM table
WHERE value > (SELECT AVG(column) FROM table);
```
Filtering on a list
```SQL
SELECT *
FROM table
WHERE value IN (SELECT column FROM match WHERE column = 123)
```
**FROM Clause**
- Used to restructure and transform data
- transform data from long to wide before selecting.
- pre-filtering data
- Calculating Aggregates of Aggregates
Aggregate of Aggregate
```SQL
SELECT *
FROM (SELECT
column,
AVG(other_columns)
FROM table
GROUP BY column) AS subquery
```
To Consider:
- You can create multiple subqueries in the FROM statement
- Alias them
- Join Them
- You can join a subquery to a table in from
- include joining columns in both tables
```SQL
SELECT
-- Select country name and the count match IDs
c.name AS country_name,
COUNT(sub.id) AS matches
FROM country AS c
-- Inner join the subquery onto country
-- Select the country id and match id columns
INNER JOIN (SELECT id, country_id
FROM match
-- Filter the subquery by matches with 10+ goals
WHERE (home_goal + away_goal) >= 10) AS sub
ON c.id = sub.country_id
GROUP BY country_name;
```
**SELECT Clause***
Used to return a **single value**.
- Include aggregate values to compare to individual values
- Select value greater than the mean.
```SQL
SELECT
column,
(one + two) -
(SELECT AVG(one + two)
FROM table
WHERE column = '2023') AS diff_from_mean
FROM table
WHERE diff_from_mean > 0 AND column = '2023'
```
If you have filters placed in the subquery, make sure the main query also contains these filters. The subquery size cannot be greater than the main queries size.
**Best Practicies of Subqueries**
- Use whitespace and indents
- Annotate your queries
#### Correlated subqueries, Nested Queries, and Common Table Expressions
**Correlated Subquery**
- Uses values from the outer query to generate a result
- Re-run for every row generated in the final dataset
- Think of a for loop
**Example Comparison b/w Correlated Subquery and Subquery**
Which match stages tend to have a higher than average number of goals scored?
Subquery
```SQL
SELECT
s.stage
ROUND(s.avg_goals,2) AS avg_goals,
(SELECT AVG(home_goal + away_goal) FROM match
WHERE season = '2012/2013') AS overall_avg
FROM
(SELECT
stage,
AVG(home_goal + away_goal) AS avg_goals
FROM match
WHERE season = '2012/2013'
GROUP BY stage) AS s
WHERE
s.avg_goals > (SELECT AVG(home_goal + away_goal)
FROM match
WHERE season = '2012/2013')
```
Correlated Subquery
```SQL
SELECT
s.stage
ROUND(s.avg_goals,2) AS avg_goals,
(SELECT AVG(home_goal + away_goal) FROM match
WHERE season = '2012/2013') AS overall_avg
FROM
(SELECT
stage,
AVG(home_goal + away_goal) AS avg_goals
FROM match
WHERE season = '2012/2013'
GROUP BY stage) AS s
WHERE
s.avg_goals > (SELECT AVG(home_goal + away_goal
FROM match AS m
WHERE s.stage > m.stage)
```
The only difference being the last filter in the Where statement `s.stage > m.stage`. This filters the outer table's avg being greater than the subquery in where.
**Differences**
Simple subquery runs *independently* from the main query while Correlated Subqueries are *dependent*.
Simple subqueries are evaluated once, while correlated subqueries are evaluated once for each row in the dataset.
- Since correlated subqueries are evaluate in loops, **they can significantly slow down query time**.
**Simpler Example**
What is the average number of goals scored in each country?
Join Example
```SQL
SELECT
c.name AS country,
AVG(m.home_goal + m.away_goal)
FROM country as c
LEFT JOIN match as m
ON c.id = m.country_id
GROUP BY
country;
```
Correlated Subquery example
```SQL
SELECT
c.name AS country,
(SELECT
AVG(home_goal + away_goal)
FROM match AS m
WHERE
m.country_id = c.id) AS avg_goals
FROM country AS c
GROUP BY country;
```
Here we can see the correlated subquery references the main query and calculates the average for each country in the country dataset. Here, the correlated subquery can be used instead of the Left Join.
**Identifying outliers example**
Here we return rows where the total # of goals is three times the average total number of goals.
```SQL
SELECT
-- Select country ID, date, home, and away goals from match
main.country_id,
main.date,
main.home_goal,
main.away_goal
FROM match AS main
WHERE
-- Filter the main query by the subquery
(home_goal + away_goal) >
(SELECT AVG((sub.home_goal + sub.away_goal) * 3)
FROM match AS sub
-- Join the main query to the subquery in WHERE
WHERE main.country_id = sub.country_id);
```
**Filtering Example**
Identify the match that had the max number goals in each country and season.
```SQL
SELECT
-- Select country ID, date, home, and away goals from match
main.country_id,
main.date,
main.home_goal,
main.away_goal
FROM match AS main
WHERE
-- Filter for matches with the highest number of goals scored
(home_goal + away_goal) =
(SELECT MAX(sub.home_goal + sub.away_goal)
FROM match AS sub
WHERE main.country_id = sub.country_id
AND main.season = sub.season);
```
**Nested Subqueries**
Subqueries inside another subquery.
Sometimes data needs multiple layers of manipulation, this is where the subquery comes in.
```SQL
SELECT %% Main Query! %%
EXTRACT(MONTH FROM date) AS month
SUM(m.home_goal + m.away_goal) AS total_goals,
SUM(m.home_goal + m.away_goal) -
(SELECT AVG(goals) %% Outer Query! %%
FROM (SELECT %% Inner Query! %%
EXTRACT(MONTH FROM date) AS month,
SUM(home_goal + away_goal) AS goals
FROM match
GROUP BY month) AS s) AS diff
FROM match AS m
GROUP BY monthl
```
This query returns how the average monthly goals differs from the total monthly average goals.
Nested subqueries can be correlated or uncorrelated
- Or a combination!
- Can reference info from the outer query or the main query
In the real world, you will probably find that nesting multiple subqueries is a task you don't have to perform often. In some cases, however, you may find yourself struggling to properly group by the column you want, or to calculate information requiring multiple mathematical transformations (i.e., an `AVG` of a `COUNT`).
**Common Table Expression CTEs**
CTEs are declared before the main query, and are used to help make complex subqueries more readable!
```SQL
WITH cte AS (
SELECT col1, col2,
FROM table
)
SELECT
AVG(col1) AS avg_col
FROM cte;
```
Can be referenced like any other table in the database.
You can declare multiple CTEs by placing commas
```SQL
WITH cte1 AS (
SELECT col1,
FROM table
),
cte2 AS (
SELECT col2,
FROM table
)
```
Benefits of CTEs over subqueries
- Executed once
- CTE stored in memory
- Improved query performance
- Improving organization of queries
- Referencing other CTEs
- Reference itself (`SELF JOIN`) known as a **Recursive CTE**
**Differentiating Joins, Subqueries, and CTEs**
Joins
- Combine 2+ tables
- Simple operations/aggregations
Correlated Subquery
- Match subqueries and tables
- Avoid limits of joins
- High processing time
Multiple/Nested Subqueries
- Multi-step transformations
- Improve accuracy and reproducibility
CTEs
- Organize subqueries
- Can reference other subqueries
Which to use?
- Depends on your database/question
- The technique that best allows you to:
- Use and reuse your queries
- Generate clear and accurate results
**Comparing Results**
Q: How do we get away team and home team names in the final query?
Joins
```SQL
SELECT
m.date,
-- Get the home and away team names
home.hometeam,
away.awayteam,
m.home_goal,
m.away_goal
FROM match AS m
-- Join the home subquery to the match table
LEFT JOIN (
SELECT match.id, team.team_long_name AS hometeam
FROM match
LEFT JOIN team
ON match.hometeam_id = team.team_api_id) AS home
ON home.id = m.id
-- Join the away subquery to the match table
LEFT JOIN (
SELECT match.id, team.team_long_name AS awayteam
FROM match
LEFT JOIN team
-- Get the away team ID in the subquery
ON match.awayteam_id = team.team_api_id) AS away
ON away.id = m.id;
```
Correlated Subquery
```SQL
SELECT
m.date,
(SELECT team_long_name
FROM team AS t
WHERE t.team_api_id = m.hometeam_id) AS hometeam,
-- Connect the team to the match table
(SELECT team_long_name
FROM team AS t
WHERE t.team_api_id = m.awayteam_id) AS awayteam,
-- Select home and away goals
m.home_goal,
m.away_goal
FROM match AS m;
```
CTEs
```SQL
WITH home AS (
SELECT m.id, m.date,
t.team_long_name AS hometeam, m.home_goal
FROM match AS m
LEFT JOIN team AS t
ON m.hometeam_id = t.team_api_id),
-- Declare and set up the away CTE
away AS (
SELECT m.id, m.date,
t.team_long_name AS awayteam, m.away_goal
FROM match AS m
LEFT JOIN team AS t
ON m.awayteam_id = t.team_api_id)
-- Select date, home_goal, and away_goal
SELECT
home.date,
home.hometeam,
away.awayteam,
home.home_goal,
away.away_goal
-- Join away and home on the id column
FROM home
INNER JOIN away
ON home.id = away.id;
```