### Matplotlib
#### Line & Scatter Plots
#matplotlib is a visualization package in python.
```python
# Importing into python
import matplotlib.pyplot as plt
```
```python
# Basic Example With Default Line Plot of Two Variables
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop)
plt.show()
```
![[Screenshot 2024-06-28 at 2.00.34 PM.png]]
```python
# Basic Example With Scatter Visualization
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
# Here we define plt.scatter instead of plt.plot
plt.scatter(year, pop)
plt.show()
```
![[Screenshot 2024-06-28 at 2.02.41 PM.png]]
**Scaling Variables** in #matplotlib is easy! Let's say we want to plot an axis on a logarithmic scale:
```python
# Some Plot
plt.scatter(x, y)
# x-axis
plt.xscale('log')
# y-axis
plt.yscale('log')
```
All possible values for scaling in the **pyplot** module include:
1. **linear**: default, where data is spaced evenly in equal increments.
2. **log**: scales logarithmically. Useful when data spans many orders of magnitude.
3. **symlog**: symmetrical logarithmic scale where both data data range around zero and the +/- data range can be shown.
4. **logit**: scales the axis accoriding to the logit transformation. Useful when dealing with probabilities or logistic transformation.
#### Histogram
**Histograms** split continuous data into bins specified by the user. If n bins equals N, the dataset size, the graph would be bar chart of size with max and min height of 1.
A **histogram** counts the number of data points in each bin and displays this count as a bar.
```python
# A Silly example
import matplotlib.pyplot as plt
values = range(1, 101)
plt.hist(values,4)
# We expect to see 4 equal bar charts with a height of 25
plt.show()
```
![[Screenshot 2024-06-28 at 2.30.27 PM.png]]
```python
# A Random example
import matplotlib.pyplot as plt
import numpy as np
values = np.random.randint(0, 101, size=100)
plt.hist(values, 4)
# We expect a random distribution of values from [0,101) where N=100
plt.show()
```
![[Screenshot 2024-06-28 at 2.36.23 PM.png]]
When wanting to display multiple graphs (i.e. resetting) use `plt.clf()`.
#### Customizing Plots
Some helpful steps for any graph:
1. Label x and y axes
2. Add a chart title
3. Custom axis ticks for readability & clarity
4. Custom axis tick displays
Using our random example above
```python
import matplotlib.pyplot as plt
import numpy as np
values = np.random.randint(0, 101, size=100)
plt.hist(values, 4)
plt.xlabel('Bucket') # 1
plt.ylabel('Count')
plt.title('Histogram') # 2
plt.yticks(np.arange(0, 41, 10),
[s + ' Million' for s in np.arange(0, 41, 10).astype(str)]) # 3, 4
plt.show()
```
![[Screenshot 2024-06-28 at 2.52.41 PM.png]]
**Adding Color!** The c argument in plt.scatter allows for a dictionary to be passed into it to display different colors for different values.
**Adding Grid Lines!** Adding `plt.grid(TRUE)` will add grid lines to the plot.
**Adding Call Outs!** Using `plt.text(x,y, 'some text')` will create a call out on the graph at (x, y) with the value **some text**.
### Dictionaries & Pandas
#### Dictionaries
A #dictionary is a collection type that is organized in **key** / **value** pairs.
```python
# List Example
letters = [a, b, c]
numbers = [1, 2, 3]
# Accessing list values
numbers[letters.index('a')] # returns 1
```
```python
# Dictionary Example
# Creating a dictionary
x_dict = {'a': 1, 'b': 2, 'c': 3}
# Acessing dictionaries values
x_dict['a'] # returns 1
```
The `keys()` method will return all **keys** of a dictionary.
A #dictionary can not contain duplicate keys. If there are duplicates, the value provided by the last key in the dictionary will override the original.
```python
my_dict = {'a':1, 'b':2, 'c':3, 'a': 5}
print(my_dict) # returns {'a': 5, 'b': 2, 'c': 3}
```
**Keys** also have to be #immutable, meaning they can not be changed. A #list is a #mutable object since it can change.
**Adding & Updating elements to a dictionary**:
```python
# adding 'new'
my_dict['new'] = 6
# changing 'new'
my_dict['new'] = -1
```
**Checking if a they already exists in a dictionary**:
```python
'new' in my_dict # returns True
```
**Deleting elements in a dictionary**:
```python
del(my_dict['new'])
'new' in my_dict # returns False now
```
![[Screenshot 2024-06-28 at 4.09.23 PM.png]]
#### Pandas
**Pandas** is built on **NumPy**. Where **NumPy** 2D arrays require the same data type, **Pandas** offers **DataFrames** that can store multiple data types in it's variables.
**DataFrame** from a Dictionary
```python
import pandas as pd
my_dict = {
'column1': [1,2,3],
'column2': [4,5,6],
'column3': [7,8,9]
}
df = pd.DataFrame(my_dict)
```
**DataFrames** default index is 0 through N-1 (dataset size). If you would like to specify a specific value for the index:
```python
df.index = ['index1', 'index2', 'index3']
```
- Note, the size of the list passed above must match the size of the dataset.
**DataFrame** from a `.csv` file
```python
# Using the same example above (pretend we saved as a .csv)
import pandas as pd
df = pd.read_csv('mypath/df.csv', index_col = 0) # since we defined an index above
```
Just like lists, **NumPy** arrays, and dictionaries, **DataFrames** can also be indexed using square brackets
**Accessing a column in a DataFrame using square brackets**
```python
df['column1'] # returns a series object
df[['column1','column2']] # returns a DataFrame
```
This returns a Pandas **series** which is like a 1D array.
**Accessing rows in a DataFrame using square brackets**
```python
# Only can be done using slicing
df[1:4] # returns rows 2 through 4
```
Two more methods for accessing rows are using:
1. **.loc** (label-based)
2. **.iloc** (integer position-based)
**.loc** Given an NxM **DataFrame**
```python
# returns an 2xM (2x3) series of the row at index = 'index1'
df.loc['index1']
# returns a DataFrame row at index = 'index1'
df.loc[['index1']]
# returns our original DataFrame
df.loc[['index1', 'index2', 'index3']]
# returns our original DataFrame with only the first column and second column
df.loc[['index1', 'index2', 'index3'], ['column1', 'column2']]
# returns all rows in the dataframe, but removing column3
df.loc[:, ['column1', 'column2']]
```
**.iloc** Given an NxM **DataFrame**
```python
# returns an 2xM (2x3) series of the row at index = 0
df.iloc[0]
# returns a DataFrame row at index = 0
df.iloc[[0]]
# returns our original DataFrame
df.iloc[[0, 1, 2]]
# returns our original DataFrame with only the first column and second column
df.iloc[[0, 1, 2], [0, 1]]
# returns all rows in the dataframe, but removing column3
df.iloc[:, [0, 1]]
```
### Logic, Control Flow and Filtering
#### Comparison Operators
```python
1 < 3 # True
2 > 3 # False
1 <= 3 # True
1 >= 1 # True
1 == 2 # False
1 != 2 # True
```
**Strings**, **integers**, and many other python data types can use **comparison operators.** Strings and numerics combinations do not work, however **floats** and **integers** do work properly.
```python
# String
'abc' < 'acb' # True
# String and Numeric
'Hello' > 1 # ERROR
# Float and Integer
1.2 < 3 # True
```
#### Boolean Operators
**And** statement is only true when the expressions on the left and right sides both are **True**.
```python
True and True # True
True and False # False
```
**Or** statement is true when the expression on the left or right is/are **True**.
```python
True or True # True
True or False # True
False or False # False
```
**Not** statement negates, or reverses, the boolean value of an expression.
```python
not False # True
not True # False
```
With #numpy-array
```python
# To get the True False arrays
np.logical_and(bmi > 21, bmi <22)
np.logical_or(bmi > 21, bmi <22)
np.logical_not(bmi > 21)
# To get the values in the array matching the conditions
bmi[np.logical_and(bmi > 21, bmi <22)]
bmi[np.logical_or(bmi > 21, bmi <22)]
bmi[np.logical_not(bmi > 21)]
```
#### if, elif, else
Using conditional logic built on above in **comparison** and **boolean** operators, we can use **if, elif, and else** statements to generate statements based on logic.
**if**
```python
if condition:
expression
# Ex
z = 4
if z % 2 == 0:
print('z is even') # This returns since 4 is divisible by 2.
```
**else**
```python
if condition:
expression
else:
expression
# Ex
z = 5
if z % 2 == 0:
print('z is even')
else:
print('z is odd') # This returns since 5 is not divisible by 2.
```
**elif**
```python
if condition:
expression
elif condition:
expression
else:
expression
# Ex
z = 5
if z % 2 == 0:
print('z is divisible by 2')
elif z % 5 == 0:
print('z is divisible by 5') # This returns since 5 is divisible by 5.
else:
print('z is neither divisible by 2 nor by 3')
```
#### Filtering pandas DataFrames
Combining indexing DataFrames using [], iloc[]. or loc[] above and **comparison** and **boolean** operators, we can filter DataFrames!
```python
# Using []
brics[brics['area'] > 8]
# Using .loc
brics.loc[brics.loc[:, 'area'] > 8, :]
# Using .iloc
brics.iloc[brics.iloc[:, 1] > 8, :]
```
Using **Boolean Operators**
```python
# Using []
brics[np.logical_and(brics['area'] > 8, brics['area'] < 10)]
# Using .loc
brics.loc[np.logical_and(brics.loc[:, 'area'] > 8, brics.loc[:, 'area'] < 10)]
# Using .iloc
brics.iloc[np.logical_and(brics.iloc[:, 1] > 8, brics.iloc[:, 1] < 10)]
```
****
### Loops
#### while loop
similar to the **if** statement, but instead of executing only once, executes continuously while the condition is met.
```python
while condition:
expression
# Ex
error = 50.0
while error > 1: # The condition is not met until the 3rd loop
error = error / 4
print(error)
```
#### for loop
when working with sets or dictionaries, and need to perform operations on each element, **for loops** are amazing!
```python
for var in seq:
expression
# Ex
fam = [11,12,36,37]
for age in fam: # This loop will print out all 4 elements in fam
print(age)
# Ex with index
for index, age in enumerate(fam):
print(f'index {str(index)}: {str(height)}')
```
#### Looping Data Structures
**Dictionaries**
```python
for var in seq:
expression
# Ex
animals = {
'dog': 'bark,
'cow': 'moo'
}
for key, value in animals.items():
print(f'{key} -- {value}')
```
**NumPy Array**
```python
for var in seq:
expression
# Ex
heights = np.array([1,2,3,4])
weights = np.array([1,2,3,4])
measures = np.array(heights,weights)
for val in measures: # Returns two print statements for each array
print(val)
for val in np.nditer(measures): # Collapses the arrays, printing 8 times
print(val)
```
**pandas DataFrame**
```python
for lab, row in brics.iterrows():
print(lab) # Prints the index of the row
print(row) # Prints the row details
print(row['capital']) # Prints the row's capital column value
# Add column
brics.loc[lab, 'name_length'] = len(row['country'])
```
The adding column example above is very inefficient. A better approach for calculating an entire dataframe column by applying a function on a particular column in element-wise fashion is using **apply**.
```python
brics['name_length'] = brics['country'].apply(len)
```
****
### Case Study: Hacker Statistics
Hacker Statistics: through multiple simulations, generate a distribution of outcomes. Then from their we can calculate probabilities.