Comprehensive guide on importing tables from PostgreSQL as Pandas DataFrames
Start your free 7-days trial now!
Sample data in PostgreSQL
Suppose we have the following users table in our PostgreSQL database called test_db:
id | name | age ----+-------+------ 1 | alex | 20 2 | cathy | 30 3 | bob | NULL
test_db database and users tableFirst create the test_db database like so:
CREATE DATABASE test_db;
If you are using psql in the command line, use \c to connect with the test_db database:
\c test_db
Once connected, we can create the users table like so:
DROP TABLE IF EXISTS users;CREATE TABLE users ( id INT GENERATED ALWAYS AS IDENTITY, name TEXT NOT NULL, age INT, PRIMARY KEY(id));INSERT INTO users (name, age) VALUES('alex', 20),('cathy', 30),('bob', NULL);
Connecting to the database using sqlalchemy
To connect to our PostgreSQL server in Python, we must first establish a connection to the server. We can do so by using the sqlalchemy library:
from sqlalchemy import create_engine
The create_engine(~) method creates a connection to the server.
Connecting as root user
Now, to connect to the test_db database on the PostgreSQL server hosted on port 5432 on your local machine (localhost) as the user called postgres:
engine = create_engine('postgresql://postgres@localhost:5432/test_db')
By convention, PostgreSQL servers are hosted on port 5432.
Troubleshooting for Mac users
sqlalchemy requires that you have the psycopg2 library installed. Use conda or pip to install this library:
conda install psycopg2pip install psycopg2
When running the above code, you may encounter the following issue:
OperationalError: (psycopg2.OperationalError) connection to server at "localhost" (::1),port 5432 failed: FATAL: role "postgres" does not exist
This can happen if you are using a Mac and used homebrew to install your Postgres server. The fix is to create a user called postgres by invoking the createruser program that should already be installed.
For non-M1 Mac users, run the following in the terminal:
/usr/local/opt/postgres/bin/createuser -s postgres
For M1 Mac users, run the following in the terminal:
/opt/homebrew/bin/createuser -s postgres
Here, the -s stands for superuser; you are creating a new user called postgres with superuser privileges.
Connecting with password
Suppose you have a role with username alex with login password '12345'. You can connect to the database test_db hosted on your local machine like so:
engine = create_engine('postgresql://alex:12345@localhost:5432/test_db')
Connecting to remote database
To connect to a remote database (e.g. PostgreSQL managed by Heroku), you need the following database credentials:
Host: (e.g. test_host)Port: 5432 (by convention)User: (e.g. alex)Password: (e.g. 12345)Database: (e.g. test_db)
These information should be provided by your PostgreSQL providers like Heroku or Supabase.
Now, to make the connection:
engine = create_engine('postgresql://alex:12345@test_host:5432/test_db')
Using Pandas built-in method to read Postgres tables as DataFrame
Pandas has a series of convenient methods to create DataFrames from Postgres tables:
read_sql_table(~): reads an entire table as a DataFrameread_sql_query(~): reads the result of a SQL query (e.g.SELECT name FROM users;) as a DataFrameread_sql(~): you can either pass in a table name (which will callread_sql_table(~)) or a SQL query (which will callread_sql_query(~))
For each of these methods, you must pass the engine object defined above to the argument con.
Using read_sql_table(~)
Let's read the users table by using the read_sql_table(~) method:
import pandas as pdpd.read_sql_table('users', con=engine)
id name age0 1 alex 20.01 2 cathy 30.02 3 bob NaN
To set the id column as the index for the DataFrame, set the index_col argument:
pd.read_sql_table('users', index_col='id', con=engine)
name ageid 1 alex 20.02 cathy 30.03 bob NaN
Specifying columns to extract
As seen in the above results, all columns are read by default. To read specific columns instead, pass a list of column labels:
pd.read_sql_table('users', columns=['name'], con=engine)
name0 alex1 cathy2 bob
Here, we are only reading the name column.
Reading by chunk
The size of your Postgres tables can be quite big - to the point where the entire table cannot fit into memory. In such cases, you may wish to read the table in chunks using the chunksize argument:
Note the following:
chunksize=2means that we read 2 records from the table at oncethe return type of the
read_sql_table(~)method when you specifychunksizeis an iterator, which means that you can loop through it usingfor.
Using read_sql_query(~)
The first argument of read_sql_query(~) method is a SQL query, and the method returns the result of the query as a Pandas DataFrame. Hers's an example:
pd.read_sql_query('SELECT * FROM users;', con=engine)
id name age0 1 alex 20.01 2 cathy 30.02 3 bob NaN
We can modify our SQL query to retrieve only the data we are interested in. For instance, to obtain the name column where id is larger or equal to 2:
pd.read_sql_query('SELECT name FROM users WHERE id >= 2;', con=engine)
name0 cathy1 bob
Arguments such as chunksize and index_col that we explored for read_sql_table(~) are also available for read_sql_query(~).
Unlike read_sql_table(~), the read_sql_query(~) method has a dtype argument which is a map indicating the desired data type of the columns of the returned DataFrame:
df = pd.read_sql_query('SELECT name, age FROM users;', con=engine)df.dtypes
name objectage float64dtype: object
By default, Pandas will respect the type indicated by the table's schema. By specifying dname, we can modify the data types on the fly:
map_dtype = { 'name': 'string', 'age': 'f4' # 4 bytes (32-bits)}
df = pd.read_sql_query('SELECT name, age FROM users;', con=engine, dtype=map_dtype)df.dtypes
name stringage float32dtype: object
Here, we indicate that we want the name column to be of type string, while the age column to be of type float32.
Using read_sql(~)
The read_sql(~) method is a hybrid of real_sql_table(~) and real_sql_query(~). You can either pass a table name:
pd.read_sql('users', con=engine)
id name age0 1 alex 20.01 2 cathy 30.02 3 bob NaN
Or you can pass a SQL query:
pd.read_sql('SELECT name, age FROM users;', con=engine)
name age0 alex 20.01 cathy 30.02 bob NaN
read_sql(~) has the same arguments as read_sql_table(~).