Pandas is a popular Python data analysis tool.
It provides easy to use and highly efficient datastructures.
These data structures deal with numeric or labeled data, stored in the form of tables.
Data Structures in Pandas
Two fundamental data structures used in pandas are,
Series: A 1-D array.
Data Frame: A 2-D array or two or more Series joined together
Series is a 1-D array, holding data values of a single variable, captured from multiple observations.
Few examples are:
Height of each student, belonging to a Class 'C'.
Amount of daily rainfall received at Station 'X', in July2017.
Total sales of a product 'P' in every quarter of 2016.
A Data Frame is 2-D shaped and contains data of differentparameters, captured from multiple observations.
Each observation is represented by a single row, and eachparameter by a single column.Each column can hold different data type.
Few examples are:
Height and Weight of all students, belonging to a Class'C'.Daily Rainfall received and Average Temperature of a location 'X', in theyear 2017.
https://www.youtube.com/watch?v=CLoNO-XxNXU
Flexibility in python
Pandas good working with large data sets ,we could do lot of manipulation of the data
We need to save in CSV we could load the data in the DataFrame that is object type which is used in Pandas
import pandas as pd
they don’t have to find the pandas again and again so we usepd as the alias for pandas
We need to installpandas
pip install pandas
Data Access refers to extracting data present in defined data structures.
Pandas provide utilities like loc and iloc to get data from a Series, a DataFrame, or a Panel.
Accessing a Single Value
Individual elements can be accessed by specifying either index number or index value, inside the square brackets.
import pandas as pd
import numpy as np
z = np.arange(10, 16)
s = pd.Series(z, index=list('abcdef'))
#Accessing 3rd element of s.
s[2] # ---> Returns '12'
#Accessing 4th element of s.
s['d'] # ---> Returns '13'
Accessing a Single Value
It is also possible to access a single element by passing index number or index value, as an argument to get method.
s.get(2) # ---> Returns '12'
s.get('d') # ---> Returns '13'
=========================================================================
Accessing a Slice
A Series can be sliced in a way, very similar to slicing a python list.
Expression 1
s[1:4]
Output
b 11
c 12
d 13
dtype: int32
Expression 2
s['b':'e']
Output
b 11
c 12
d 13
e 14
dtype: int32
Elements corresponding to startand end index values are included, when index values are used for slicing.
Accessing Data from a Data Frame
Pandas allows .loc, .iloc methods for selecting rows.
Using square brackets ([]) is also allowed, especially forselecting columns.
More details can be gathered from the shown video.
How to Access Data using DataFrames with Pandas
https://www.youtube.com/watch?v=qYc58lb--Q4
Knowing a Series
It is possible to understand a Series better by usingdescribe method.The method provides details like mean, std, etc. about aseries.
Example
importpandas as pd
importnumpy as np
temp =pd.Series(28 + 10*np.random.randn(10))
print(temp.describe())
Output
count 10.000000
mean 30.335711
std 8.402697
min 10.874673
25% 27.431943
50% 31.286962
75% 35.148773
max 40.770861
dtype: float64
Knowing a DataFrame
Two methods majorly info and describecan be used to know about the data, present in a data frame.
importpandas as pd
importnumpy as np
· We need to populate the data *
df = pd.DataFrame({'temp':pd.Series(28 +10*np.random.randn(10)),
'rain':pd.Series(100 + 50*np.random.randn(10)),
'location':list('AAAAABBBBB')})
print(df.info())
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
location 10non-null object
rain 10non-null float64
temp 10non-null float64
dtypes: float64(2), object(1)
memory usage: 320.0+ bytes
Knowing a Data Frame
describemethod by default provides details of only numeric fields.
Example
print(df.describe())
Output
rain temp
count 10.000000 10.000000
mean 108.860520 28.631922
std 55.584867 4.866241
min 19.512179 21.327725
25% 56.911505 25.658738
50% 128.776209 29.564648
75% 156.972247 31.496084
max 164.159265 36.086240
the output program explaining a Data Frame
Creating 3 columns
Fetching some rows
Output for fetching the row
Fetch a column
I/O with Pandas
Pandas provides support for reading/writing data from/to some sources.
For example : read_csv is used to read datafrom a CSV file and to_csv is utilized to write data to a CSV file.
https://pandas.pydata.org/docs/
https://www.youtube.com/c/Zenva/playlists
EACH FILE IN PANDAS couldbe the same and handled the same
I/O Methods
The following video shows the details of I/O methodsused in pandas.
How to fetch the Data from the websites using Pandas :
Reading Data from URL
Read HTML Page using Pandas | read_html() |Web-scrapping Tutorial
We are importingthe methods requests and importing a method seaborn as sns
import requests
importseaborn as sns
importpandas as pd
we could see the output as the data removed from the data set
it pulled the entire table
This method onlyworks on tables if they are no tables it wont work
Taking a new URL from the different website
df1 = pd.read_html(url)
df1(0)
We had pandas as very powerful tool /where we could fetchthe data in different ways
Reading Data from Databases
Pandas also supports reading data from Database tables.
The following video illustrates reading data from a table ofMYSQL database.
four things we learn mysequel
1.Reading data from the Database
2. How to define the custom index using the Data col
2. Reading data from chunks
3. Parametrized Query
https://www.youtube.com/watch?v=yab4oWYypPA
The code is done in visual studio
1 step is importing dependencies and libraries
Creating a connection from the database sql
We will pass the query where the data could be fetched inpython
df = pd.read_sql(“SELECT * FROMCOUNTRY ORDER BY CONTINENT<CODE”, conn)
Displaying a DataFrame by
df
We are adding the coloumn ascontinent and code
Output displaying continent and code
Now we could see data in chunks
The output which we receive
Output
The country name given is India
How to handle large datasets
https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c
Reading Data from Json
pandas provides the utilities read_json and to_json to dealwith JSON strings or files.
Consider the below string EmployeeRecords for understandingconversion of a JSON string into a data frame.
Example
EmployeeRecords = [{'EmployeeID':451621,'EmployeeName':'Preeti Jain', 'DOJ':'30-Aug-2008'},
{'EmployeeID':123621, 'EmployeeName':'Ashok Kumar','DOJ':'25-Sep-2016'},
{'EmployeeID':451589, 'EmployeeName':'Johnty Rhodes','DOJ':'04-Nov-2016'}]
Reading Data from JSON
Example
import json
emp_records_json_str = json.dumps(EmployeeRecords)
df = pd.read_json(emp_records_json_str, orient='records',convert_dates=['DOJ'])
print(df)
Output
DOJ EmployeeID EmployeeName
0 2008-08-30 451621 Preeti Jain
1 2016-09-25 123621 Ashok Kumar
2 2016-11-04 451589 Johnty Rhodes
orient argument defines how data is organised in JSONstring.
Indexing
Indexing
Indexing refers to labeling data elements of a Series, aData Frame.
These labels can be utilized for selecting portion of datafrom any of the defined data structures.
Indexing a Data Frame
A single level index can be set to a data frame, by passinga list of values to either using index attribute or index argument of DataFramefunction.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5,2))
df.index = [ 'row_' + str(i) for i in range(1, 6) ]
df
Output
0 1
row_1 0.919754 0.063280
row_2 0.803853 0.758804
row_3 0.871375 0.428759
row_4 0.128372 0.416698
row_5 0.991222 0.546599
DateTime Indexes
Pandas support generating a range of dates, with methodslike date_range, bdate_range.
https://www.youtube.com/watch?v=yCgJGsg0Xa4
Hierarchical Indexing
In addition to single level indexing, pandas supportsmultilevel or hierarchical indexing.
The below illustrates creating two levels of index for aData Frame.
https://www.youtube.com/watch?v=nE21ZlXiByY