### Matplotlib #### Line & Scatter Plots #matplotlib is a visualization package in python. ```python # Importing into python import matplotlib.pyplot as plt ``` ```python # Basic Example With Default Line Plot of Two Variables import matplotlib.pyplot as plt year = [1950, 1970, 1990, 2010] pop = [2.519, 3.692, 5.263, 6.972] plt.plot(year, pop) plt.show() ``` ![[Screenshot 2024-06-28 at 2.00.34 PM.png]] ```python # Basic Example With Scatter Visualization import matplotlib.pyplot as plt year = [1950, 1970, 1990, 2010] pop = [2.519, 3.692, 5.263, 6.972] # Here we define plt.scatter instead of plt.plot plt.scatter(year, pop) plt.show() ``` ![[Screenshot 2024-06-28 at 2.02.41 PM.png]] **Scaling Variables** in #matplotlib is easy! Let's say we want to plot an axis on a logarithmic scale: ```python # Some Plot plt.scatter(x, y) # x-axis plt.xscale('log') # y-axis plt.yscale('log') ``` All possible values for scaling in the **pyplot** module include: 1. **linear**: default, where data is spaced evenly in equal increments. 2. **log**: scales logarithmically. Useful when data spans many orders of magnitude. 3. **symlog**: symmetrical logarithmic scale where both data data range around zero and the +/- data range can be shown. 4. **logit**: scales the axis accoriding to the logit transformation. Useful when dealing with probabilities or logistic transformation. #### Histogram **Histograms** split continuous data into bins specified by the user. If n bins equals N, the dataset size, the graph would be bar chart of size with max and min height of 1. A **histogram** counts the number of data points in each bin and displays this count as a bar. ```python # A Silly example import matplotlib.pyplot as plt values = range(1, 101) plt.hist(values,4) # We expect to see 4 equal bar charts with a height of 25 plt.show() ``` ![[Screenshot 2024-06-28 at 2.30.27 PM.png]] ```python # A Random example import matplotlib.pyplot as plt import numpy as np values = np.random.randint(0, 101, size=100) plt.hist(values, 4) # We expect a random distribution of values from [0,101) where N=100 plt.show() ``` ![[Screenshot 2024-06-28 at 2.36.23 PM.png]] When wanting to display multiple graphs (i.e. resetting) use `plt.clf()`. #### Customizing Plots Some helpful steps for any graph: 1. Label x and y axes 2. Add a chart title 3. Custom axis ticks for readability & clarity 4. Custom axis tick displays Using our random example above ```python import matplotlib.pyplot as plt import numpy as np values = np.random.randint(0, 101, size=100) plt.hist(values, 4) plt.xlabel('Bucket') # 1 plt.ylabel('Count') plt.title('Histogram') # 2 plt.yticks(np.arange(0, 41, 10), [s + ' Million' for s in np.arange(0, 41, 10).astype(str)]) # 3, 4 plt.show() ``` ![[Screenshot 2024-06-28 at 2.52.41 PM.png]] **Adding Color!** The c argument in plt.scatter allows for a dictionary to be passed into it to display different colors for different values. **Adding Grid Lines!** Adding `plt.grid(TRUE)` will add grid lines to the plot. **Adding Call Outs!** Using `plt.text(x,y, 'some text')` will create a call out on the graph at (x, y) with the value **some text**. ### Dictionaries & Pandas #### Dictionaries A #dictionary is a collection type that is organized in **key** / **value** pairs. ```python # List Example letters = [a, b, c] numbers = [1, 2, 3] # Accessing list values numbers[letters.index('a')] # returns 1 ``` ```python # Dictionary Example # Creating a dictionary x_dict = {'a': 1, 'b': 2, 'c': 3} # Acessing dictionaries values x_dict['a'] # returns 1 ``` The `keys()` method will return all **keys** of a dictionary. A #dictionary can not contain duplicate keys. If there are duplicates, the value provided by the last key in the dictionary will override the original. ```python my_dict = {'a':1, 'b':2, 'c':3, 'a': 5} print(my_dict) # returns {'a': 5, 'b': 2, 'c': 3} ``` **Keys** also have to be #immutable, meaning they can not be changed. A #list is a #mutable object since it can change. **Adding & Updating elements to a dictionary**: ```python # adding 'new' my_dict['new'] = 6 # changing 'new' my_dict['new'] = -1 ``` **Checking if a they already exists in a dictionary**: ```python 'new' in my_dict # returns True ``` **Deleting elements in a dictionary**: ```python del(my_dict['new']) 'new' in my_dict # returns False now ``` ![[Screenshot 2024-06-28 at 4.09.23 PM.png]] #### Pandas **Pandas** is built on **NumPy**. Where **NumPy** 2D arrays require the same data type, **Pandas** offers **DataFrames** that can store multiple data types in it's variables. **DataFrame** from a Dictionary ```python import pandas as pd my_dict = { 'column1': [1,2,3], 'column2': [4,5,6], 'column3': [7,8,9] } df = pd.DataFrame(my_dict) ``` **DataFrames** default index is 0 through N-1 (dataset size). If you would like to specify a specific value for the index: ```python df.index = ['index1', 'index2', 'index3'] ``` - Note, the size of the list passed above must match the size of the dataset. **DataFrame** from a `.csv` file ```python # Using the same example above (pretend we saved as a .csv) import pandas as pd df = pd.read_csv('mypath/df.csv', index_col = 0) # since we defined an index above ``` Just like lists, **NumPy** arrays, and dictionaries, **DataFrames** can also be indexed using square brackets **Accessing a column in a DataFrame using square brackets** ```python df['column1'] # returns a series object df[['column1','column2']] # returns a DataFrame ``` This returns a Pandas **series** which is like a 1D array. **Accessing rows in a DataFrame using square brackets** ```python # Only can be done using slicing df[1:4] # returns rows 2 through 4 ``` Two more methods for accessing rows are using: 1. **.loc** (label-based) 2. **.iloc** (integer position-based) **.loc** Given an NxM **DataFrame** ```python # returns an 2xM (2x3) series of the row at index = 'index1' df.loc['index1'] # returns a DataFrame row at index = 'index1' df.loc[['index1']] # returns our original DataFrame df.loc[['index1', 'index2', 'index3']] # returns our original DataFrame with only the first column and second column df.loc[['index1', 'index2', 'index3'], ['column1', 'column2']] # returns all rows in the dataframe, but removing column3 df.loc[:, ['column1', 'column2']] ``` **.iloc** Given an NxM **DataFrame** ```python # returns an 2xM (2x3) series of the row at index = 0 df.iloc[0] # returns a DataFrame row at index = 0 df.iloc[[0]] # returns our original DataFrame df.iloc[[0, 1, 2]] # returns our original DataFrame with only the first column and second column df.iloc[[0, 1, 2], [0, 1]] # returns all rows in the dataframe, but removing column3 df.iloc[:, [0, 1]] ``` ### Logic, Control Flow and Filtering #### Comparison Operators ```python 1 < 3 # True 2 > 3 # False 1 <= 3 # True 1 >= 1 # True 1 == 2 # False 1 != 2 # True ``` **Strings**, **integers**, and many other python data types can use **comparison operators.** Strings and numerics combinations do not work, however **floats** and **integers** do work properly. ```python # String 'abc' < 'acb' # True # String and Numeric 'Hello' > 1 # ERROR # Float and Integer 1.2 < 3 # True ``` #### Boolean Operators **And** statement is only true when the expressions on the left and right sides both are **True**. ```python True and True # True True and False # False ``` **Or** statement is true when the expression on the left or right is/are **True**. ```python True or True # True True or False # True False or False # False ``` **Not** statement negates, or reverses, the boolean value of an expression. ```python not False # True not True # False ``` With #numpy-array ```python # To get the True False arrays np.logical_and(bmi > 21, bmi <22) np.logical_or(bmi > 21, bmi <22) np.logical_not(bmi > 21) # To get the values in the array matching the conditions bmi[np.logical_and(bmi > 21, bmi <22)] bmi[np.logical_or(bmi > 21, bmi <22)] bmi[np.logical_not(bmi > 21)] ``` #### if, elif, else Using conditional logic built on above in **comparison** and **boolean** operators, we can use **if, elif, and else** statements to generate statements based on logic. **if** ```python if condition: expression # Ex z = 4 if z % 2 == 0: print('z is even') # This returns since 4 is divisible by 2. ``` **else** ```python if condition: expression else: expression # Ex z = 5 if z % 2 == 0: print('z is even') else: print('z is odd') # This returns since 5 is not divisible by 2. ``` **elif** ```python if condition: expression elif condition: expression else: expression # Ex z = 5 if z % 2 == 0: print('z is divisible by 2') elif z % 5 == 0: print('z is divisible by 5') # This returns since 5 is divisible by 5. else: print('z is neither divisible by 2 nor by 3') ``` #### Filtering pandas DataFrames Combining indexing DataFrames using [], iloc[]. or loc[] above and **comparison** and **boolean** operators, we can filter DataFrames! ```python # Using [] brics[brics['area'] > 8] # Using .loc brics.loc[brics.loc[:, 'area'] > 8, :] # Using .iloc brics.iloc[brics.iloc[:, 1] > 8, :] ``` Using **Boolean Operators** ```python # Using [] brics[np.logical_and(brics['area'] > 8, brics['area'] < 10)] # Using .loc brics.loc[np.logical_and(brics.loc[:, 'area'] > 8, brics.loc[:, 'area'] < 10)] # Using .iloc brics.iloc[np.logical_and(brics.iloc[:, 1] > 8, brics.iloc[:, 1] < 10)] ``` **** ### Loops #### while loop similar to the **if** statement, but instead of executing only once, executes continuously while the condition is met. ```python while condition: expression # Ex error = 50.0 while error > 1: # The condition is not met until the 3rd loop error = error / 4 print(error) ``` #### for loop when working with sets or dictionaries, and need to perform operations on each element, **for loops** are amazing! ```python for var in seq: expression # Ex fam = [11,12,36,37] for age in fam: # This loop will print out all 4 elements in fam print(age) # Ex with index for index, age in enumerate(fam): print(f'index {str(index)}: {str(height)}') ``` #### Looping Data Structures **Dictionaries** ```python for var in seq: expression # Ex animals = { 'dog': 'bark, 'cow': 'moo' } for key, value in animals.items(): print(f'{key} -- {value}') ``` **NumPy Array** ```python for var in seq: expression # Ex heights = np.array([1,2,3,4]) weights = np.array([1,2,3,4]) measures = np.array(heights,weights) for val in measures: # Returns two print statements for each array print(val) for val in np.nditer(measures): # Collapses the arrays, printing 8 times print(val) ``` **pandas DataFrame** ```python for lab, row in brics.iterrows(): print(lab) # Prints the index of the row print(row) # Prints the row details print(row['capital']) # Prints the row's capital column value # Add column brics.loc[lab, 'name_length'] = len(row['country']) ``` The adding column example above is very inefficient. A better approach for calculating an entire dataframe column by applying a function on a particular column in element-wise fashion is using **apply**. ```python brics['name_length'] = brics['country'].apply(len) ``` **** ### Case Study: Hacker Statistics Hacker Statistics: through multiple simulations, generate a distribution of outcomes. Then from their we can calculate probabilities.