top of page
Paul

How I use Python pandas - a quick introduction

I will give a quick introduction into how I use the Python pandas library, detailing the Python code I write to load a csv file into a pandas dataframe, the code to filter the data once in the dataframe, and finally the code to output the dataframe to a csv file. I think of a pandas dataframe as like a spreadsheet in memory with rows and columns. I use Python pandas for tasks such as data cleaning, transformation, exploration and analysis. In case you are not familiar with Python pandas it is an open-source data manipulation and analysis library that provides data structures and functions to efficiently work with structured data, such as tabular data in spreadsheets or databases.


To use pandas you will need to first pip install pandas. pip install pandas In your Python code import pandas. import pandas as pd The following code loads a csv file into a pandas dataframe including the field header names.

file_path = "netflix_titles.csv" df = pd.read_csv(file_path) If you would like to see a description of the numeric columns in your dataframe, use the describe() method. It outputs information such as mean, count etc.

print(df.describe()) The above will output information as follows for each numeric column in your dataframe. release_year

count 8807.000000

mean 2014.180198

std 8.819312

min 1925.000000

25% 2013.000000

50% 2017.000000

75% 2019.000000

max 2021.000000 You can print the name of each column in your dataframe using a loop. print("Column Headings:")

for column_name in df.columns:

print(column_name) The names of your columns will be output as follows. Column Headings:

show_id

type

title ... You can filter the data and assign the results to a new dataframe. Below the dataframe df1 will have rows where the column type has

the value "Movie". To refer to an individual column in a dataframe you use square brackets df["type"]. You can then use logic operators like "==", "!=" etc Note additional square brackets are used to surround the condition df[...].

df1 = df[df["type"] == "Movie"] If you would like to print a subset of the columns in a dataframe you can list the ones you want output e.g type", "title". The columns are contained within two square brackets df1[[...]]

print(df1[["type", "title"]]) More complicated filtering can be applied to dataframes using "and" logic. The dataframe df1 below will have rows where the column type has the value "Movie" and the column release_year is greater than 2014. The symbol & is used to indicate the "and" logic. Also note both the individual type and release_year conditions are contained within parenthesis "(...)", and additional square brackets are used to surround the combined conditions df[...].

df1 = df[

(df["type"] == "Movie")

& (df["release_year"] > 2014)

] The following code prints a subset of the columns.

print(

df1[

[

"type",

"title",

"release_year",

]

]

) The filtering can also include or logic. The dataframe df1 further below will only contain rows where the type is "Movie" and the release_year is greater than 2014, or instead the rating has either the value "TV-14" or "TV-PG". Note the combined "and" & logic for type and release_year is surround by parenthesis "(...)". The vertical bar | is used to indicate "or" logic, and the or condition is contained within parenthesis (...). The method isin(...) is used to filter on the values ["TV-14", "TV-PG"] in the rating column. Finally additional square brackets are used to surround the combined conditions df[...].

df1 = df[

(

(df["type"] == "Movie")

& (df["release_year"] > 2014)

)

| (

df["rating"].isin(

["TV-14", "TV-PG"]

)

)

] If you would like to return a specific number of rows from a dataframe you can use the head(...), method passing to it the argument the number of the rows you require. The dataframe df1 will be assigned just 2 rows of data from df.


df1 = df.head(2)

To output a dataframe to a csv file you can use the method to_csv(...). The argument "netflix_data.csv" is the name of the file to create. The argument index=False indicates we don't want a unique row id generated in the output and we just want the dataframe raw column data.


df1.to_csv(

"netflix_data.csv", index=False

)




12 views0 comments

ความคิดเห็น


bottom of page