More Pandas - cpedrasa.github.io

For today’s blog assignment, we are supposed to write about something that we did not cover on the course. It took me a while to think about what I could possibly add since the data science course materials are truly comprehensive. So I thought it would be useful to provide more Pandas references on accessing data, saving data and the pandas.cut data manipulation example that we may not have covered in the sections in detail.

The Pandas sections provided numerous examples on accessing data and saving data using the following:

pd.read_csv() – to_csv
pd.read_excel() – to_excel
pd.read_json() – to_json

A reference to the full pandas documentation was provided but here are additional file formats that can be read and saved in pandas.

Data Description	Read	Write/Save
Feather Format	read_feather	to_feather
Msgpack	read_msgpack	to_msg_pack
Stata	read_stata	to_stata
SAS	read_sas
Python Pickle Format	read_pickle	to_pickle
SQL	read_sql	to_sql

The course provided examples to read from SQLite database but for those who would like to connect to the Microsoft SQL database, the following code worked for me.

import pandas as pd 
import pyodbc

server = 'server' 
db = 'database'  
Uid = 'uid'                        #Enter your credentials here
password ='pwd'

#Connection
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=' + server + ';DATABASE=' + db + ';Uid=;PWD=;Trusted_Connection=yes')

#query db
sql = ''' SELECT TOP 5* FROM Medications;'''
df = pd.read_sql(sql, conn) df.head()

Finally, a pandas data manipulation function that could be useful for changing a continuous variable to a categorical variable is the pandas function called cut. It could be a useful binning tool for our data exploration tasks. Please see the example of how pandas.cut could convert ages to groups of age ranges.

import pandas as pd
A= [8,34,86,90,3,31,44,77,38,16,45,35,32,12,42,34,34,48,51]

Age_Col = pd.DataFrame({'Age':A})
ages = ['0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-64','65-69','70-74','75-79','80-84','85-89','90-95']
Age_Col.head()

Age_Col['age_group'] = pd.cut(Age_Col.Age,range(0,100,5),right=False, labels=ages)
Age_Col[['Age','age_group']].head(10)

Out[ ]:

i	Age	age_group
0	8	5-9
1	34	30-34
2	86	85-89
3	90	90-95
4	3	0-4
5	31	30-34
6	44	40-44
7	77	75-79
8	38	35-39
9	16	15-19