### Pandas

<ul>
<li>Pandas is a Python library used to analyse, clean, explore, and manipulate data.
<li>Pandas empowers us to analyse big data and draw conclusions using statistical methodologies. It facilitates the transformation of disorderly datasets, making them more readable and relevant for analysis.
<li>The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
<li>Pandas is usually imported under the pd alias:
    <br><b>import pandas as pd</b>	
</ul>


### DataFrames

In [3]:
# A Pandas DataFrame is a 2-dimensional data structure, like a 2-dimensional array, or a table with rows and columns. 

# Creating a simple Pandas DataFrame:
import pandas as pd # Pandas is usually imported under the pd alias
data = {
    "calories": [420, 380, 390],
    "duration":[50, 40, 45]
}

# load data into a DataFrame:
df = pd.DataFrame(data)
print("Duration of activity or exercise and calories burnt:\n\n",df)

# The shape property returns a tuple containing the number of rows and columns of the DataFrame:
df.shape


Duration of activity or exercise and calories burnt:

    calories  duration
0       420        50
1       380        40
2       390        45


(3, 2)

### Data Selection: loc and iloc

In [5]:
# Pandas use the loc attribute to return one or more specified row(s):
# loc is label-based indexing using row and column labels to index into the DataFrame.
print("1st row (row 0):")
print(df.loc[0]) # row 0, 1st row 

1st row (row 0):
calories    420
duration     50
Name: 0, dtype: int64


In [7]:
print("1st row and 2nd row:")
print(df.loc[[0,1]]) # rows 0 and 1

1st row and 2nd row:
   calories  duration
0       420        50
1       380        40


In [20]:
print("Column duration:")
print(df['duration']) 

Column duration:
0    50
1    40
2    45
Name: duration, dtype: int64


In [9]:
print("Columns calories and duration:")
print(df[['calories','duration']]) 

Columns calories and duration:
   calories  duration
0       420        50
1       380        40
2       390        45


In [11]:
# printing some rows and columns.
print("1st row and 2nd row of columns calories and duration")
print(df.loc[[0,1], ['calories', 'duration']]) # rows 0 and 1, of the first column

# iloc is integer-based indexing to selecting data by row and column positions using integer indices.
print("\n1st row and 2nd row of the first and second columns:")
print(df.iloc[[0,1], [0,1]]) # rows 0 and 1, of the first and second column

1st row and 2nd row of columns calories and duration
   calories  duration
0       420        50
1       380        40

1st row and 2nd row of the first and second columns:
   calories  duration
0       420        50
1       380        40


### Reading Data: read_csv(), read_json()

In [13]:
#Loading and Viewing data

# Load a CSV file into a DataFrame:
df1 = pd.read_csv('data/exercise_metrics.csv')

# Load the JSON file into a DataFrame:
df2 = pd.read_json('data/exercise_metrics.json')

### Data Selection and Viewing: head() and tail()

In [15]:
'''
Viewing:
The head() method returns the headers and a specified number of rows, starting from the top.
The tail() method returns the headers and a specified number of rows, starting from the bottom.
'''
# Get a quick overview by printing the first 10 rows of data.cvs
print(df1.head(10)) # default 5
print("\nRow 3:") 
print(df1.iloc[3])

# Print the last 5 rows of data.json
print("\n") # leave 1 blank line
print(df2.tail())

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.0
6        60    110       136     374.0
7        45    104       134     253.3
8        30    109       133     195.1
9        60     98       124     269.0

Row 3:
Duration     45.0
Pulse       109.0
Maxpulse    175.0
Calories    282.4
Name: 3, dtype: float64


     Duration  Pulse  Maxpulse  Calories
164        60    105       140     290.8
165        60    110       145     300.4
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4


### Data Selection: df[start:stop:step]

In [17]:
'''
df[start:stop:step]
start: The index of the first row to include.
stop: The index of the first row to exclude.
step: The step size (i.e., how many rows to skip between selections).
'''
df1[1:10:2]  # try: df1[1:10:1]

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
5,60,102,127,300.0
6,60,110,136,374.0
7,45,104,134,253.3
8,30,109,133,195.1
9,60,98,124,269.0


### Correlations

In [8]:
# Correlations: Is there any linear relationship between the colum
df1.corr()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
Duration,1.0,-0.155408,0.009403,0.922717
Pulse,-0.155408,1.0,0.786535,0.025121
Maxpulse,0.009403,0.786535,1.0,0.203813
Calories,0.922717,0.025121,0.203813,1.0
