Using apply method in parallel to Pandas DataFrame
apply(~) method uses a single core, which means that a single thread is used to perform this method. If your machine has multiple cores, then you would be able to execute the
apply(~) method in parallel.
apply(~) in parallel, use Dask, which is an easy-to-use library that performs Pandas' operations in parallel by splitting up the DataFrame into smaller partitions.
Consider the following Pandas DataFrame with one million rows:
To convert a Pandas DataFrame into a Dask DataFrame with 5 partitions:
import dask.dataframe as ddddf = dd.from_pandas(df, npartitions=5)ddfAnpartitions=50 float64200000 ...... ...800000 ...999999 ...Dask Name: from_pandas, 5 tasks
apply(~) in parallel:
def foo(row, a, x=10):return (row.sum() + a) * x# axis=1 means that we are applying a function row-wise# meta='float' means that the type of the resulting Dask Series is floatdask_series = ddf.apply(foo, axis=1, args=(10,), x=100, meta='float')ddf['B'] = dask_series# Convert Dask DataFrame back to Pandas DataFramedf_new = ddf.compute()df_new.head()A B0 3.869780 1386.9780241 2.194392 1219.4392202 4.292990 1429.2989603 3.486840 1348.6840154 0.470887 1047.088674
apply(~) method takes 19 seconds to complete, while Pandas'
apply(~) method takes 50 seconds - that's more than x2 speedup!