Learning Pandas - cpedrasa.github.io

Eight weeks have passed since I started the self-paced data science program and the sections I have found challenging so far are the ones about pandas. I think I spent more than two weeks on those sections trying to learn the syntax for data analysis and manipulation, until I came across the Data Science Career Student Resource on Pandas Recipe for Success with links to tutorial videos. I am a visual learner and I found it really helpful to build up the muscle memory for python pandas by coding along and spending time practicing DataFrames manipulations and performing operations. Part of the reason I think that I spent more time learning pandas is because I currently have an issue at work where I still need to deal with flat file interfaces, and I wanted to explore time-saving tools to help with data analysis across flat files, table outputs from SQL, and cache databases. After learning pandas DataFrames, here are a few tips on how pandas has helped me to be more organized and to speed up my work.

I could quickly read files in various formats (i.e. csv, tab delimited, json, pipe-delimited files) without opening every single flat file and filtering columns to find specific information.

import pandas as pd 
df = pd.read_table(‘file.csv')
df = pd.read_table(r'file.txt', sep="|")
df[df['Insurance1'].str.startswith('MCARE') & df[‘ProcedureDescription’].str.startswith('Coronary artery bypass graft’)]

import json
with open('NSUH.json') as data_file:    
data = json.load(data_file)

I didn’t have to import multiple flat files into SQL in order to link the files/tables to perform data analysis and manipulations.

pd.concat([df1, df3], join=“inner")

Creating filters, pivots, and charts can be completed easily. There is no longer a need to create multiple versions of excel files to show the different filters, pivot tables, and charts that I have created for data exploration.

import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame()

df.plot.<TAB>
df.plot.area     df.plot.barh     df.plot.hist     df.plot.line     
df.plot.bar      df.plot.box       df.plot.pie.    df.plot.scatter

Pandas bridges the limitation I have on not having access to linked database servers by providing copies of the data frames from different source systems for analysis and manipulation.
```
from pandasql import sqldf
 Pysqldf = lambda q: sqldf(q, globals())
 q = """select Name from df order by PassengerId desc limit 10”""
```

import pandas as pd 
import pyodbc

    #Parameters
server = 'server'
db = 'database'
Uid = 'uid'
password ='pwd'

    #Connection
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=' + server + ';DATABASE=' + db + ';Uid=<UID>;PWD=<password>;Trusted_Connection=yes')

    #query db
sql = ''' SELECT TOP 5* 
                    FROM Medications;'''
    
df = pd.read_sql(sql, conn)
df.head()

It’s been a great learning experience so far and I look forward to learning more data science tools to help improve and speed up my work.