Intermediate SQL - Portfolio & Notes

### Filtering Text `LIKE` Clause used to search for a pattern with a wildcard - `%` wildcard used to match zero,one, or many characters - `_` wildcard used to match a single character ``` SQL SELECT name FROM people WHERE name LIKE `Ade%'; ``` would return matches like Adel Karam, Adelaide Kane, Aden Young ```SQL SELECT name FROM people WHERE name LIKE 'Ev_'; ``` would return matches like Eve or Eva. However Eve Mendes would not be returned unless the clause is `Ev_ Mendes`. `NOT LIKE` clause follows the same pattern, except filtering the opposite. ```SQL SELECT name FROM people WHERE name LIKE`%r'; ``` would find names ending in r. ```SQL SELECT name FROM people WHERE name LIKE '__t%' ``` would select name where the third character is t. `IN` Clause can help clean up multiple our clauses on the same columns ``` SQL SELECT year FROM films WHERE year = 1920 OR year = 1930 OR year = 1940; ``` can become ``` SQL SELECT year FROM films WHERE year IN (1920,1930,1940); ``` `COUNT` Clause will count all rows if using *, but with a column specified it will count all rows with non-null values. **Key's** in Tables uniquely identify records within the table. They are often used to identify distinct records and create relationships in a database structure through joins. `JOIN` General Structure: ```SQL FROM left_table AS alias1 xxx JOIN right_table AS alias2 ON alias1.column = alias2.column ``` When joining on two identical column names: ```SQL FROM left_table AS alias1 xxx JOIN right_table AS alias2 USING(columns_name) ``` Aliases can be used in scenarios where the column names are identical. If they are, without aliasing the names of the columns the code will throw an error. When doing joins, it is often common to write out the join first then write the `SELECT` statement so that you can properly alias identical columns. Relationships in SQL are key to joins - One-to-many relationships (like an artist having many songs) where an object can be associated with many objects - One-to_one relationships (like everything finger having a unique fingerprint) where an object is only associated with another object - Many-to-many relationships (like languages in countries. German, French, and Dutch are all offical languages of Belgium, and German can be spoken in Germany and Belgium) where multiple objects are associated with multiple other objects. #### Manipulating Text For all caps or all lower case columns, if you want to just capitalize the first letter use the following ```SQL SELECT vendor_id, INITCAP(vendor_name) AS vendor_name FROM vendors LIMIT 3; ``` Apply lowercase to an entire column ```SQL SELECT lower(bleh) ``` Apply uppercase to an entire column ```SQL SELECT upper(bleh) ``` #### Replacing Null Values to replace null values use the COALESCE statement ```SQL SELECT name, city, COALESCE(state,'None') AS state FROM vendors; ``` ### CTEs (Common Tables Expressions) Used to more easily write and maintain complex queries via increased readability and simplification. ```SQL WITH overdue AS ( SELECT reference, date('2020-04-01') - due_date AS days_overdue FROM invoices) SELECT count(*) FROM overdue WHERE days_overdue > 60 ``` ### OUTER JOINS `LEFT JOIN` contains all records in the left table, and those that match in the right. Returns null values when no match is found in the right table. Can be written as `LEFT OUTER JOIN`. `RIGHT JOIN` does the same as `LEFT JOIN` except swapping the primary table. Any `RIGHT JOIN` can be wrriten as a `LEFT JOIN` so it much less common. The Syntax for `LEFT/RIGHT JOIN` follow the same structure as `INNER JOIN` except replacing INNER with LEFT/RIGHT. `FULL JOIN` Combines a Left and Right joins. It returns null values for where there wasn't a match in either the Right or Left tables. The Syntax matches both the LEFT/RIGHT/INNER join code but replacing the type by FULL. **Self Joins** are tables joined with themselves. Used to compare parts of the same table (Ie lets say we have a table of prime ministers. We want to join the same table on continent to match all prime minsters with with prime ministers within their region.) ```SQL SELECT p1.country_code, p1.size AS size2010, p2.size AS size2015 FROM populations AS p1 INNER JOIN populations AS p2 ON p1.country_code = p2.country_code WHERE p1.year = 2010 -- Filter such that p1.year is always five years before p2.year AND p1.year = p2.year - 5; ``` This query is a Self Join that is used to compare country populations within the same table in different years. In this example it joins on country code and compares the populations of each country from 2010 and 2015. ### Set Theory for SQL Joins <img title="a title" alt="Alt text" src="Static/SetTheory.png"> Set operations do not require a column to join on. Instead, Set operations require the same number of columns and Sames field type for each columns. The columns are then stacked on top of eachother (why we need the same datatype) in order. The resulting set will take the aliases and field names from the left table. Helpful to write the syntax for each side of the set operations before writing the Set operation itself. 1. Union/Union All - `UNION` Takes two tables as input and returns all records (except duplicates) from both - `UNION ALL` Takes two tables as input and returns all records including duplicates ```SQL SELECT * FROM left_table UNION/UNION ALL SELECT * FROM right_table; ``` 2. Intersect - `INTERSECT` takes two tables as input and only returns records in both tables. ```SQL SELECT * FROM left_table INTERSECT SELECT * FROM right_table ``` Ex: Counties with prime ministers and persidents: ```SQL SELECT country as intersect_country FROM prime_ministers INTERSECT SELECT country from presidents; ``` This table returns only records for countries with a president and prime minister 3. Except ### Subqueries Calling joins will always add extra columns to the new set. **Additive Joins** like `Inner Join` add columns with different names to the result set with their unique names. Fields with the same name get added with the same name. This can cause duplicate name columns (solved with aliasing). **Semi Join** chooses records in the first table where a condition is met in the second. Will only return records in the table being filtered, therefore not including extra columns where both tables align (therefore not **Additive**) ```SQL SELECT president, country, continent FROM presidents WHERE country IN (SELECT country FROM states WHERE indep_year < 1880); ``` This query returns records in the presidents column where the county's independence year is less than 1880. Since the indep_year column is in the states tables, we can write a **subquiery** in the `WHERE` clause to filter on the presidents table. **ANTI JOIN** Opposite of Semi Join. Chooses records in the left table where the column in the left does not match values in the column in the right table. ``` SQL SELECT president, country, continent FROM presidents WHERE continent LIKE `%America` AND country NOT IN (SELECT country FROM states WHERE indep_year < 1880); ``` Countries in the Americas where indep_year is greater than or equal to 1880. Most Common place for **Subqueires** ### Subquiries inside `SELECT` ```SQL SELECT DISTINCT continent, (SELECT COUNT(*) FROM monatchs WHERE states.continent = monarch.continent) AS monarch_count FROM states; ``` The following statement will group by distinct continets in the states table and count how many monarchs from the monarchs table are associated. This Query: ```SQL SELECT countries.name AS country, -- Subquery that provides the count of cities (SELECT COUNT(DISTINCT name) FROM cities WHERE countries.code = cities.country_code) AS cities_num FROM countries ORDER BY cities_num DESC, country LIMIT 9; ``` Is equivelent to this Left Join Group by: ```SQL -- Find top nine countries with the most cities SELECT countries.name AS country, COUNT(DISTINCT cities.name) AS cities_num FROM countries LEFT JOIN cities ON countries.code = cities.country_code GROUP BY country -- Order by count of cities as cities_num ORDER BY cities_num DESC, country ASC LIMIT 9; ``` As they both count the number of cities that exist in the cities table, associated with the countries table. ### Subqueries inside FROM ```SQL SELECT DISTINCT monarchs.continent, sub.most_recent FROM monarchs, (SELECT continet, MAX(indep_year) AS most_recent FROM states GROUP BY continent) AS sub WHERE monarchs.continet = sub.continent ORDER BY continet ``` This query will return continents with monarchs and the year the most recent country gained independence. When linking two tables in the where clause, mae sure to use distinct on the identifyer column. This subquery is defined as a temporary table that can be referenced in the outside query. Subqueries inside `FROM`` can help select columns from multiple tables in a single query. Say you are interested in determining the number of languages spoken for each country. You want to present this information alongside each country's local_name, which is a field only present in the countries table and not in the languages table. You'll use a subquery inside FROM to bring information from these two tables together! ```SQL -- Select local_name and lang_num from appropriate tables SELECT DISTINCT countries.local_name, sub.lang_num FROM countries, (SELECT code, COUNT(*) AS lang_num FROM languages GROUP BY code) AS sub -- Where codes match WHERE countries.code = sub.code ORDER BY lang_num DESC; ``` ### Subquery challenge Suppose you're interested in analyzing inflation and unemployment rate for certain countries in 2015. You are not interested in countries with "Republic" or "Monarchy" as their form of government, but are interested in all other forms of government, such as emirate federations, socialist states, and commonwealths. ```SQL -- Select relevant fields SELECT economies.code, economies.inflation_rate, economies.unemployment_rate FROM economies WHERE year = 2015 AND code NOT IN -- Subquery returning country codes filtered on gov_form (SELECT code FROM countries WHERE gov_form LIKE '%Republic%' OR gov_form LIKE '%Monarchy%') ORDER BY inflation_rate; ``` Using '%word%' as a wildcard expression will return all rows that **Contains** that word anywhere in the column being filtered. Your task is to determine the top 10 capital cities in Europe and the Americas by city_perc, a metric you'll calculate. city_perc is a percentage that calculates the "proper" population in a city as a percentage of the total population in the wider metro area, as follows: ```SQL -- Select fields from cities SELECT name, country_code, city_proper_pop, metroarea_pop, city_proper_pop / metroarea_pop * 100 AS city_perc FROM cities -- Use subquery to filter city name WHERE name IN (SELECT capital FROM countries WHERE continent = 'Europe' OR continent LIKE '%America') AND -- Add filter condition such that metroarea_pop does not have null values metroarea_pop IS NOT NULL ORDER BY city_perc DESC -- Sort and limit the result LIMIT 10; ``` ### Detecting Duplicates You can detect duplicates by using the `GROUP BY` and `HAVING` clauses ```SQL SELECT column1, ... columnn FROM table GROUP BY column1,...,columnn HAVING COUNT(*) > 1; ``` Also, the `ROW_NUMBER()` function determines where groups of column values are duplicates in a dataset. Must use the `OVER` clause which defines the window the function will be applied. #### Enumerating duplciates rows ```SQL SELECT camis, name, boro, inspection_date, violation_code, ROW_NUMBER() OVER( PARTITION BY camis, name, boro, inspection_date, violation_code ) - 1 AS duplicate FROM restraunt_inspection; ``` This query adds a column indicating the # of times a record is duplicated based on the order (ie the first record that is duplicate is assigned 1, second 2, etc), then subtracts one. This allows us to filter by this column = 0 to remove all duplicate entries. #### Resolving impartial dupes Let's say two records have all of the same identifiers, but there's a column called Score (ie test scores). Now lets say the same student has two records, but two different scores. If there is no documentation regarding why there could be dupes like this, then it is common to replace these from an aggregate function (Average,min,max). EX: Detecting Duplicated parking tickets ```SQL SELECT summons_number, -- Use ROW_NUMBER() to define duplicate window ROW_NUMBER() OVER( PARTITION BY plate_id, issue_date, violation_time, house_number, street_name -- Modify ROW_NUMBER() value to define duplicate column ) - 1 AS duplicate, plate_id, issue_date, violation_time, house_number, street_name FROM parking_violation; ``` EX: Selecting Duplicate Parking tickets ```SQL SELECT -- Include all columns * FROM ( SELECT summons_number, ROW_NUMBER() OVER( PARTITION BY plate_id, issue_date, violation_time, house_number, street_name ) - 1 AS duplicate, plate_id, issue_date, violation_time, house_number, street_name FROM parking_violation ) sub WHERE -- Only return records where duplicate is 1 or more duplicate > 0; ``` EX: Return the duplicate records and keep only the lowest fee ```SQL SELECT -- Include SELECT list columns summons_number, MIN(fee) AS fee FROM parking_violation GROUP BY -- Define column for GROUP BY summons_number HAVING -- Restrict to summons numbers with count greater than 1 COUNT(*) > 1; ``` ### Inconsistent data Given a set of business rules, finding inconsistencies that do not match the given situations guidlines. For example: Restraunt inspection - core corresponds to # of violatons - A (0-13), B(14-27), C(28+) - Scenarios for grades: - A on initial inspection - Re-inspection with A, B, or C. You can write simple SQL queries that detect these inconsistencies above using where clauses. Notes: - Diversity of approaches - Careful thought requires - Domain knowledge - which values are valid - Reasons for duplication - Appropriate fill-in values EX: write a query that makes sure tickets were issued during the violation hours. ```SQL SELECT -- Specify return columns summons_number, violation_time, from_hours_in_effect, to_hours_in_effect FROM parking_violation WHERE -- Condition on values outside of the restricted range violation_time NOT BETWEEN from_hours_in_effect AND to_hours_in_effect; ``` ### Case Statement Subqueries In this example, we use a subquery to fill in deleted information from another table. The table we access the zip code we want to populate is nyc_zip_codes, and the table we want to populate is film_permit. The film_permit table had the borough deleted, which is the column we want to populate from the nyc_zip_codes table. ```SQL SELECT event_id, CASE WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Manhattan') THEN 'Manhattan' -- Match Brooklyn zip codes WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Brooklyn') THEN 'Brooklyn' -- Match Bronx zip codes WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Bronx') THEN 'Bronx' -- Match Queens zip codes WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Queens') THEN 'Queens' -- Match Staten Island zip codes WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Staten Island') THEN 'Staten Island' -- Use default for non-matching zip_code ELSE NULL END as borough FROM film_permit; ``` ### Column Types and Contraints in SQL Foreign key: value that exists in the referenced column, or null Primary Key: unique, not null Unique: values must all be different except for null Check Contraints: conditions on the values - column1 > 0 #### Data Types: - Common - Numeric - Character - Date/Time - Boolean - Special - Arrays - Monetary - Binary - Geometric - Network Address - XML - JSON - and more! You can find the specific types in the entity relationship diagram. Values can be converted temporarily using the `CAST` function (called casting) ```SQL SELECT CAST (value AS new_type); ``` or alternatively ```SQL SELECT value::new_type; ``` #### Extract Date From DateTime ```SQL SELECT EXTRACT(MONTH from column) ``` ### Case Statement Contain a When,Then, and Else statement finished with End ```SQL CASE WHEN x = 1 THEN 'a' WHEN x = 2 THEN 'b' ELSE 'Tie' END AS outcome ``` multiple conditions in WHEN statement ```SQL CASE WHEN x = 1 OR x = 2 THEN 'a or b' ``` Filtering on the Case Statement ```SQL CASE WHEN x = 1 THEN 'a' WHEN x = 2 THEN 'b' ELSE 'Tie' END AS outcome FROM BLEH WHERE CASE WHEN x = 1 THEN 'a' WHEN x = 2 THEN 'b' ELSE 'Tie' END IS NOT NULL; ``` Not including the else statement populates NULL, then we filter out the nulls in the WHERE statement. Case statements with aggregate function ```SQL SELECT a, COUNT(CASE WHEN x = 1 or x = 2 then id) AS home_win FROM bleh GROUP BY a ``` **Percentages with CASE and AVG** ```SQL SELECT a, AVG(CASE WHEN x = 1 THEN 1 WHEN x != 2 THEN 0 END) AS pct ``` ### Statistical Extermination Theory The field of stats is the practice and study of collecting and analyzing data under two main branches. 1. **Descriptive / Summary Statistics**: describing or summarizing our data 2. **Inferential Statistics**: collect a sample of data, and apply the results to the population that the sample represents. Stats require **specific measurable question** like is rock music more popular than jazz? Can't be used to find out **why**. **Numeric Data** 1. continuous data (stock prices) 2. interval/count data (number of cups of coffee) **Categorical Data** 1. Nominal Data which is unordered (eye color) 2. Ordinal Data which is ordered (how strongly do you feel about this product) **Measures of Center** - Mean - Median - Mode When data is not symmetrical (contains outliers) it is best to use the median over the mean. **Measures of Spread** Tells how much variety exists in our data. - range = max - min - variance = SUM(Distance - Mean)^2 of all data points divided by # of points - standard deviation = sqrt(variance) - Closer to 0 means more symmetrical - quartiles - split the data into four equal parts - 0%,25%,50%,75%,100% - 50% = median - IQR = 3rd Quartile - 1st Quartile - less affected by outliers than any other measures of spread **Probability and Distributions** What are the chances? P(event) = # way event can happen / total # of possible outcomes **Sampling with Replacement** is when a sample is put back into the available pool once selected. This is an example of independent probability. **Independent Probability** two events are *independent* if the probability of the second event does not change based on the outcome of the first event. **Ex:** - Chances that the next order is greater than the mean? - number of orders greater than the mean divided by the total # of orders. **Probability Distribution** describes the probability of each possible outcome in a scenario **Expected Value** the mean of the probability distribution. - Multiplying each value with its probability. Probability can also be calculated by graphing the probability distribution and finding the area under the curve at the given point. When all outcomes have the same probability, it is called **Discrete Uniform Distribution** **Law of Large Numbers States** that as the size of your sample increases, the sample mean will approach the **expected value**. **Conditional Probability** **Sampling without Replacement** is when a sample is not put back into the available pool once selected. This is an example of dependent probability as the probability of the second event is affected by the outcome of the first event. **Conditional Probability** is used to calculate the probability of dependent events. The probability of one event is **conditional** on the outcome of another. ![[conditional_probability.png]]![[condit_proba_2.png]] **Continuous Distributions** 1. Continuous Uniform Distribution - Straight Line 2. Many Non-Continuous distributions 1. Bimodal Distribution (modeling soft cover vs hard cover book prices) 2. Normal Distribution (modeling age of retirement) Regardless, all probability distribution's must have an area under the curve equal to 1. **Normal Distribution** ![[normal_distribution.png]] Curve never hits 0. The normal distribution is described by its mean and standard deviation. 68% of the area falls within 1 STD of the mean and 95% 2 STD from the mean and 99.7% from 3 STD. **A normal distribution is required for many statistical tests.** We can interpret a distribution by its **skewness** and **kurtosis** ![[skewness.png]] Positive skewed distribution can also be called right skewed, and negative left skewed. **Kurtosis** is a way of describing the occurrence of extreme values in a distribution. 1. Positive/Leptokurtic Kurtosis - Large peak around the mean and small STD 2. Mesokurtic Kurtosis - Used to describe the normal distribution 3. Negative/Platykurtic Kurtosis - Lower peak around the mean and high STD ![[Kurtosis.png]] **Binomial Distribution** - probability distribution of the **number of successes** in a sequence of **independent events** A great example of this is flipping a coin, or any binary outcomes. Described by **n** and **p**: - **n**: total number of events - **p** probability of success ![[binomial_distribution.png]] **Expected Value** of a binomial distribution is equal to n x p The binomial distribution requires independent events product binary outcomes. **Central Limit Theorem** states the sampling distribution of a statistic becomes closer to the normal distribution as the size of the sample increases. The distribution of a summary statistic (mean) is called a **Sampling Distribution**. The CLT only applies when the **samples are random and independent**. Therefore, taking the mean of the **Sampling Distribution** is an estimate of the summary statistic across the entire population. Benefits of the CLT is that if you do not have the resources to collect all data for your population, you can provide and estimate. N>= 30 is the general rule of thumb to be able to apply the CLT. **The Poisson Distribution** The Poisson Process: - The Average # of events in a period is known - but the time or space between events is random - Ex: number of animals adopted from an animal shelter per week. The **Poisson Distribution** describes the probability of some # of events occurring over a fixed period of time. - Ex: probability of 12 people arriving at a restaurant per hour Described by lambda, which is equal to the **average number of events per time interval.** - Ex: average number of patrons per hour = 20 Lambda is also the **expected value** of the distribution, and is the the value occurring at the distributions peak. **Hypothesis Testing** Used to compare populations. Examples: - Can a change in price lead to increased revenue? - Will changing a website address result in increased traffic? - Is a medication effective in the treatment of a health condition? The **Null Hypothesis** is the start, where you assume no difference exists. The **Alternative Hypothesis** can take many forms. Workflow 1. Define the target populations - Adult Women taking or not taking vitamin C supplements 2. Develop null and alternative hypothesis - NH: Births are equally likely to be male or female in both populations. - AH: More births are female among women taking vitamin C supplements. 3. Collect sample data 4. Perform statistical tests on the sample data 5. Draw conclusions on the populations **Independent and Dependent Variables** - Vitamin C is the independent - Birth gender ratio is the dependent **Experiments** A subset of **Hypothesis Testing**. What effects does the treatment have on the response? Ex: What is the effect of an advertisement on the number of products purchased? - Treatment: advertisement - Response: number of products purchases **Controlled Experiments** participants are assigned to either the treatment group or the control group - Groups should be comparable to avoid introducing bias. **Randomization** is the gold standard for experiments where participants are assigned to each group randomly. This is called a **Randomized Controlled Trial**. **Blinding** is also a gold standard, where participants do not know which group they're in (placebo). **Double-blind randomized controlled trial** person administering the treatment/running the study doesn't know whether the treatment is real or a placebo. **Randomized Controlled Trials vs. A/B Testing** - Randomized controlled trial can have multiple treatment groups. - Ie administering different amounts of a medication - A/B testing only split evenly into two groups - popular in marketing and engineering. **Correlation** Scatterplots are a great way to visualize the correlation between two variables. **Pearson Correlation Coefficient** - quantifies the strength of relationship between two variables. - Number between -1 and 1 - Magnitude corresponds to strength - +/- corresponds to direction The **Pearson Correlation Coefficient** can only be used for linear relationships. Changes between variables are proportionate. ![[Pasted image 20231101092841.png]] **Correlation** != **Causation** **Confounding Variables** is not measured, but may affect the relationship between our variables. **Interpreting Hypothesis Test Results** When drawing conclusions in hypothesis testing, we use **p-value**. **p**: probability of achieving this result, assuming the null hypothesis is true. ![[Screenshot 2023-11-01 at 9.35.00 AM.png]] The p-value is the total area under the curve where our two sampling distributions overlap. To reduce the risk of drawing a false conclusion, we set a probability threshold for rejecting the **null hypothesis**. This is known as the **significance level**. This is decided before collecting data to minimize bias. A typical threshold is 0.05, which is a 5% chance of wrongly concluding the alternative hypothesis. if **p** <= **significance level** we reject the null hypothesis, and our results are **statistically significant**. ![[Screenshot 2023-11-01 at 9.38.55 AM.png]] Sampling with replacement is often used, also known as **bootstrapping**, in hypothesis tests in order to arrive at a more accurate sampling distribution. #### Subqueries A query nested inside another query. ```SQL SELECT column FROM (SELECT column FROM table) as subquery; ``` Can be placed in: 1. SELECT 2. FROM 3. WHERE 4. GROUP BY Subqueries are the first thing processed in a SQL query, and only processed once. **WHERE CLAUSE** Used to filter results by something you would need to calculate separately. Filtering on calculated field ```SQL SELECT * FROM table WHERE value > (SELECT AVG(column) FROM table); ``` Filtering on a list ```SQL SELECT * FROM table WHERE value IN (SELECT column FROM match WHERE column = 123) ``` **FROM Clause** - Used to restructure and transform data - transform data from long to wide before selecting. - pre-filtering data - Calculating Aggregates of Aggregates Aggregate of Aggregate ```SQL SELECT * FROM (SELECT column, AVG(other_columns) FROM table GROUP BY column) AS subquery ``` To Consider: - You can create multiple subqueries in the FROM statement - Alias them - Join Them - You can join a subquery to a table in from - include joining columns in both tables ```SQL SELECT -- Select country name and the count match IDs c.name AS country_name, COUNT(sub.id) AS matches FROM country AS c -- Inner join the subquery onto country -- Select the country id and match id columns INNER JOIN (SELECT id, country_id FROM match -- Filter the subquery by matches with 10+ goals WHERE (home_goal + away_goal) >= 10) AS sub ON c.id = sub.country_id GROUP BY country_name; ``` **SELECT Clause*** Used to return a **single value**. - Include aggregate values to compare to individual values - Select value greater than the mean. ```SQL SELECT column, (one + two) - (SELECT AVG(one + two) FROM table WHERE column = '2023') AS diff_from_mean FROM table WHERE diff_from_mean > 0 AND column = '2023' ``` If you have filters placed in the subquery, make sure the main query also contains these filters. The subquery size cannot be greater than the main queries size. **Best Practicies of Subqueries** - Use whitespace and indents - Annotate your queries #### Correlated subqueries, Nested Queries, and Common Table Expressions **Correlated Subquery** - Uses values from the outer query to generate a result - Re-run for every row generated in the final dataset - Think of a for loop **Example Comparison b/w Correlated Subquery and Subquery** Which match stages tend to have a higher than average number of goals scored? Subquery ```SQL SELECT s.stage ROUND(s.avg_goals,2) AS avg_goals, (SELECT AVG(home_goal + away_goal) FROM match WHERE season = '2012/2013') AS overall_avg FROM (SELECT stage, AVG(home_goal + away_goal) AS avg_goals FROM match WHERE season = '2012/2013' GROUP BY stage) AS s WHERE s.avg_goals > (SELECT AVG(home_goal + away_goal) FROM match WHERE season = '2012/2013') ``` Correlated Subquery ```SQL SELECT s.stage ROUND(s.avg_goals,2) AS avg_goals, (SELECT AVG(home_goal + away_goal) FROM match WHERE season = '2012/2013') AS overall_avg FROM (SELECT stage, AVG(home_goal + away_goal) AS avg_goals FROM match WHERE season = '2012/2013' GROUP BY stage) AS s WHERE s.avg_goals > (SELECT AVG(home_goal + away_goal FROM match AS m WHERE s.stage > m.stage) ``` The only difference being the last filter in the Where statement `s.stage > m.stage`. This filters the outer table's avg being greater than the subquery in where. **Differences** Simple subquery runs *independently* from the main query while Correlated Subqueries are *dependent*. Simple subqueries are evaluated once, while correlated subqueries are evaluated once for each row in the dataset. - Since correlated subqueries are evaluate in loops, **they can significantly slow down query time**. **Simpler Example** What is the average number of goals scored in each country? Join Example ```SQL SELECT c.name AS country, AVG(m.home_goal + m.away_goal) FROM country as c LEFT JOIN match as m ON c.id = m.country_id GROUP BY country; ``` Correlated Subquery example ```SQL SELECT c.name AS country, (SELECT AVG(home_goal + away_goal) FROM match AS m WHERE m.country_id = c.id) AS avg_goals FROM country AS c GROUP BY country; ``` Here we can see the correlated subquery references the main query and calculates the average for each country in the country dataset. Here, the correlated subquery can be used instead of the Left Join. **Identifying outliers example** Here we return rows where the total # of goals is three times the average total number of goals. ```SQL SELECT -- Select country ID, date, home, and away goals from match main.country_id, main.date, main.home_goal, main.away_goal FROM match AS main WHERE -- Filter the main query by the subquery (home_goal + away_goal) > (SELECT AVG((sub.home_goal + sub.away_goal) * 3) FROM match AS sub -- Join the main query to the subquery in WHERE WHERE main.country_id = sub.country_id); ``` **Filtering Example** Identify the match that had the max number goals in each country and season. ```SQL SELECT -- Select country ID, date, home, and away goals from match main.country_id, main.date, main.home_goal, main.away_goal FROM match AS main WHERE -- Filter for matches with the highest number of goals scored (home_goal + away_goal) = (SELECT MAX(sub.home_goal + sub.away_goal) FROM match AS sub WHERE main.country_id = sub.country_id AND main.season = sub.season); ``` **Nested Subqueries** Subqueries inside another subquery. Sometimes data needs multiple layers of manipulation, this is where the subquery comes in. ```SQL SELECT %% Main Query! %% EXTRACT(MONTH FROM date) AS month SUM(m.home_goal + m.away_goal) AS total_goals, SUM(m.home_goal + m.away_goal) - (SELECT AVG(goals) %% Outer Query! %% FROM (SELECT %% Inner Query! %% EXTRACT(MONTH FROM date) AS month, SUM(home_goal + away_goal) AS goals FROM match GROUP BY month) AS s) AS diff FROM match AS m GROUP BY monthl ``` This query returns how the average monthly goals differs from the total monthly average goals. Nested subqueries can be correlated or uncorrelated - Or a combination! - Can reference info from the outer query or the main query In the real world, you will probably find that nesting multiple subqueries is a task you don't have to perform often. In some cases, however, you may find yourself struggling to properly group by the column you want, or to calculate information requiring multiple mathematical transformations (i.e., an `AVG` of a `COUNT`). **Common Table Expression CTEs** CTEs are declared before the main query, and are used to help make complex subqueries more readable! ```SQL WITH cte AS ( SELECT col1, col2, FROM table ) SELECT AVG(col1) AS avg_col FROM cte; ``` Can be referenced like any other table in the database. You can declare multiple CTEs by placing commas ```SQL WITH cte1 AS ( SELECT col1, FROM table ), cte2 AS ( SELECT col2, FROM table ) ``` Benefits of CTEs over subqueries - Executed once - CTE stored in memory - Improved query performance - Improving organization of queries - Referencing other CTEs - Reference itself (`SELF JOIN`) known as a **Recursive CTE** **Differentiating Joins, Subqueries, and CTEs** Joins - Combine 2+ tables - Simple operations/aggregations Correlated Subquery - Match subqueries and tables - Avoid limits of joins - High processing time Multiple/Nested Subqueries - Multi-step transformations - Improve accuracy and reproducibility CTEs - Organize subqueries - Can reference other subqueries Which to use? - Depends on your database/question - The technique that best allows you to: - Use and reuse your queries - Generate clear and accurate results **Comparing Results** Q: How do we get away team and home team names in the final query? Joins ```SQL SELECT m.date, -- Get the home and away team names home.hometeam, away.awayteam, m.home_goal, m.away_goal FROM match AS m -- Join the home subquery to the match table LEFT JOIN ( SELECT match.id, team.team_long_name AS hometeam FROM match LEFT JOIN team ON match.hometeam_id = team.team_api_id) AS home ON home.id = m.id -- Join the away subquery to the match table LEFT JOIN ( SELECT match.id, team.team_long_name AS awayteam FROM match LEFT JOIN team -- Get the away team ID in the subquery ON match.awayteam_id = team.team_api_id) AS away ON away.id = m.id; ``` Correlated Subquery ```SQL SELECT m.date, (SELECT team_long_name FROM team AS t WHERE t.team_api_id = m.hometeam_id) AS hometeam, -- Connect the team to the match table (SELECT team_long_name FROM team AS t WHERE t.team_api_id = m.awayteam_id) AS awayteam, -- Select home and away goals m.home_goal, m.away_goal FROM match AS m; ``` CTEs ```SQL WITH home AS ( SELECT m.id, m.date, t.team_long_name AS hometeam, m.home_goal FROM match AS m LEFT JOIN team AS t ON m.hometeam_id = t.team_api_id), -- Declare and set up the away CTE away AS ( SELECT m.id, m.date, t.team_long_name AS awayteam, m.away_goal FROM match AS m LEFT JOIN team AS t ON m.awayteam_id = t.team_api_id) -- Select date, home_goal, and away_goal SELECT home.date, home.hometeam, away.awayteam, home.home_goal, away.away_goal -- Join away and home on the id column FROM home INNER JOIN away ON home.id = away.id; ```