Operators Module Description#
Brief description#
Most of the functions that are evaluated by QueryCompiler can be categorized into
one of the patterns: Map, TreeReduce, Binary, Reduce, etc., called core operators. The modin.core.dataframe.algebra
module provides templates to easily build such types of functions. These templates
are supposed to be used at the QueryCompiler level since each built function accepts
and returns QueryCompiler.
High-Level Module Overview#
Each template class implements a
register
method, which takes functions to apply and
instantiate the related template. Functions that are passed to register
will be executed
against converted to pandas and preprocessed in a template-specific way partition, so the function
would take one of the pandas object: pandas.DataFrame
, pandas.Series
or pandas.DataFrameGroupbyObject
.
Note
Currently, functions that are built in that way are supported only in a pandas storage format (i.e. can be used only in PandasQueryCompiler).
Algebra module provides templates for this type of function:
Map operator#
Uniformly apply a function argument to each partition in parallel. Note: map function should not change the shape of the partitions.
- class modin.core.dataframe.algebra.map.Map#
Builder class for Map operator.
- classmethod apply(df, func, func_args=None, func_kwargs=None, **kwargs)#
Apply a function to a Modin DataFrame using the operators scheme.
- Parameters
df (modin.pandas.DataFrame or modin.pandas.Series) – DataFrame object to apply the operator against.
func (callable(pandas.DataFrame, *args, **kwargs) -> pandas.DataFrame) – A function to apply.
func_args (tuple, optional) – Positional arguments to pass to the func.
func_kwargs (dict, optional) – Keyword arguments to pass to the func.
**kwargs (dict) – Aditional arguments to pass to the
cls.register()
.
- Return type
return_type
- classmethod register(function, *call_args, **call_kwds)#
Build Map operator that will be performed across each partition.
- Parameters
function (callable(pandas.DataFrame) -> pandas.DataFrame) – Function that will be applied to the each partition. Function takes pandas.DataFrame and returns pandas.DataFrame of the same shape.
*call_args (args) – Args that will be passed to the returned function.
**call_kwds (kwargs) – Kwargs that will be passed to the returned function.
- Returns
Function that takes query compiler and executes map function.
- Return type
callable
Reduce operator#
Applies an argument function that reduces each column or row on the specified axis into a scalar, but requires knowledge about the whole axis. Be aware that providing this knowledge may be expensive because the execution engine has to concatenate partitions along the specified axis. Also, note that the execution engine expects that the reduce function returns a one dimensional frame.
- class modin.core.dataframe.algebra.reduce.Reduce#
Builder class for Reduce operator.
- classmethod apply(df, func, axis=0, func_args=None, func_kwargs=None)#
Apply a reduction function to each row/column partition of the dataframe.
- Parameters
df (modin.pandas.DataFrame or modin.pandas.Series) – DataFrame object to apply the operator against.
func (callable(pandas.DataFrame, *args, **kwargs) -> Union[pandas.Series, pandas.DataFrame[1xN]]) – A function to apply.
axis (int, default: 0) – Whether to apply the function across rows (
axis=0
) or across columns (axis=1
).func_args (tuple, optional) – Positional arguments to pass to the func.
func_kwargs (dict, optional) – Keyword arguments to pass to the func.
- Return type
modin.pandas.Series
- classmethod register(reduce_function, axis=None)#
Build Reduce operator that will be performed across rows/columns.
It’s used if func reduces the dimension of partitions in contrast to Fold.
- Parameters
reduce_function (callable(pandas.DataFrame) -> pandas.Series) – Source function.
axis (int, optional) – Axis to apply function along.
- Returns
Function that takes query compiler and executes Reduce function.
- Return type
callable
TreeReduce operator#
Applies an argument function that reduces specified axis into a scalar. First applies map function to each partition in parallel, then concatenates resulted partitions along the specified axis and applies reduce function. In contrast with Map function template, here you’re allowed to change partition shape in the map phase. Note that the execution engine expects that the reduce function returns a one dimensional frame.
- class modin.core.dataframe.algebra.tree_reduce.TreeReduce#
Builder class for TreeReduce operator.
- classmethod apply(df, map_function, reduce_function, axis=0, func_args=None, func_kwargs=None)#
Apply a map-reduce function to the dataframe.
- Parameters
df (modin.pandas.DataFrame or modin.pandas.Series) – DataFrame object to apply the operator against.
map_function (callable(pandas.DataFrame, *args, **kwargs) -> pandas.DataFrame) – A map function to apply to every partition.
reduce_function (callable(pandas.DataFrame, *args, **kwargs) -> Union[pandas.Series, pandas.DataFrame[1xN]]) – A reduction function to apply to the results of the map functions.
axis (int, default: 0) – Whether to apply the reduce function across rows (
axis=0
) or across columns (axis=1
).func_args (tuple, optional) – Positional arguments to pass to the funcs.
func_kwargs (dict, optional) – Keyword arguments to pass to the funcs.
- Return type
modin.pandas.Series
- classmethod register(map_function, reduce_function=None, axis=None)#
Build TreeReduce operator.
- Parameters
map_function (callable(pandas.DataFrame) -> pandas.DataFrame) – Source map function.
reduce_function (callable(pandas.DataFrame) -> pandas.Series, optional) – Source reduce function.
axis (int, optional) – Specifies axis to apply function along.
- Returns
Function that takes query compiler and executes passed functions with TreeReduce algorithm.
- Return type
callable
Binary operator#
Applies an argument function, that takes exactly two operands (first is always QueryCompiler). If both operands are query compilers then the execution engine broadcasts partitions of the right operand to the left.
Warning
To be able to do frame broadcasting, partitioning along the index axis of both frames has to be equal, otherwise they need to be aligned first. The execution engine will do it automatically but note that this requires repartitioning, which is a much more expensive operation than the binary function itself.
- class modin.core.dataframe.algebra.binary.Binary#
Builder class for Binary operator.
- classmethod apply(left, right, func, axis=0, func_args=None, func_kwargs=None, **kwargs)#
Apply a binary function row-wise or column-wise.
- Parameters
left (modin.pandas.DataFrame or modin.pandas.Series) – Left operand.
right (modin.pandas.DataFrame or modin.pandas.Series) – Right operand.
func (callable(pandas.DataFrame, pandas.DataFrame, *args, axis, **kwargs) -> pandas.DataFrame) – A binary function to apply left and right.
axis (int, default: 0) – Whether to apply the function across rows (
axis=0
) or across columns (axis=1
).func_args (tuple, optional) – Positional arguments to pass to the funcs.
func_kwargs (dict, optional) – Keyword arguments to pass to the funcs.
**kwargs (dict) – Additional arguments to pass to the
cls.register()
.
- Return type
The same type as df.
- classmethod register(func, join_type='outer', labels='replace', infer_dtypes=None)#
Build template binary operator.
- Parameters
func (callable(pandas.DataFrame, [pandas.DataFrame, list-like, scalar]) -> pandas.DataFrame) – Binary function to execute. Have to be able to accept at least two arguments.
join_type ({'left', 'right', 'outer', 'inner', None}, default: 'outer') – Type of join that will be used if indices of operands are not aligned.
labels ({"keep", "replace", "drop"}, default: "replace") – Whether keep labels from left Modin DataFrame, replace them with labels from joined DataFrame or drop altogether to make them be computed lazily later.
infer_dtypes ({"common_cast", "float", "bool", None}, default: None) –
- How dtypes should be inferred.
If “common_cast”, casts to common dtype of operand columns.
If “float”, performs type casting by finding common dtype. If the common dtype is any of the integer types, perform type casting to float. Used in case of truediv.
If “bool”, dtypes would be a boolean series with same size as that of operands.
If
None
, do not infer new dtypes (they will be computed manually once accessed).
- Returns
Function that takes query compiler and executes binary operation.
- Return type
callable
Fold operator#
Applies an argument function that requires knowledge of the whole axis. Be aware that providing this knowledge may be expensive because the execution engine has to concatenate partitions along the specified axis.
- class modin.core.dataframe.algebra.fold.Fold#
Builder class for Fold functions.
- classmethod apply(df, func, fold_axis=0, func_args=None, func_kwargs=None)#
Apply a Fold (full-axis) function to the dataframe.
- Parameters
df (modin.pandas.DataFrame or modin.pandas.Series) – DataFrame object to apply the operator against.
func (callable(pandas.DataFrame[NxM], *args, **kwargs) -> pandas.DataFrame[NxM]) – A function to apply to every partition. Note that the function shouldn’t change the shape of the dataframe.
fold_axis (int, default: 0) – Whether to apply the function across rows (
axis=0
) or across columns (axis=1
).func_args (tuple, optional) – Positional arguments to pass to the funcs.
func_kwargs (dict, optional) – Keyword arguments to pass to the funcs.
- Return type
the same type as df
- classmethod register(fold_function)#
Build Fold operator that will be performed across rows/columns.
- Parameters
fold_function (callable(pandas.DataFrame) -> pandas.DataFrame) – Function to apply across rows/columns.
- Returns
Function that takes query compiler and executes Fold function.
- Return type
callable
GroupBy operator#
Evaluates GroupBy aggregation for that type of functions that can be executed via TreeReduce approach.
To be able to form groups engine broadcasts by
partitions to each partition of the source frame.
- class modin.core.dataframe.algebra.groupby.GroupByReduce#
Builder class for GroupBy aggregation functions.
- ID_LEVEL_NAME#
It’s supposed that implementations may produce multiple temporary columns per one source column in an intermediate phase. In order for these columns to be processed accordingly at the Reduce phase, an implementation must store unique names for such temporary columns in the
ID_LEVEL_NAME
level. Duplicated names are not allowed.- Type
str
- _GROUPBY_REDUCE_IMPL_FLAG#
Attribute indicating that a callable should be treated as an implementation for one of the TreeReduce phases rather than an arbitrary aggregation. Note: this attribute should be considered private.
- Type
str
- classmethod apply(df, map_func, reduce_func, by, groupby_kwargs=None, agg_args=None, agg_kwargs=None)#
Apply groupby aggregation function using map-reduce pattern.
- Parameters
df (modin.pandas.DataFrame or modin.pandas.Series) – A source DataFrame to group.
map_func (callable(pandas.core.groupby.DataFrameGroupBy) -> pandas.DataFrame) – A map function to apply to a groupby object in every partition.
reduce_func (callable(pandas.core.groupby.DataFrameGroupBy) -> pandas.DataFrame) – A reduction function to apply to the results of the map functions.
by (label or list of labels) – Columns of the df to group on.
groupby_kwargs (dict, optional) – Keyword arguments matching the signature of
pandas.DataFrame.groupby
.agg_args (tuple, optional) – Positional arguments to pass to the funcs.
agg_kwargs (dict, optional) – Keyword arguments to pass to the funcs.
- Return type
The same type as df.
- classmethod register(map_func, reduce_func=None, **call_kwds)#
Build template GroupBy aggregation function.
Resulted function is applied in parallel via TreeReduce algorithm.
- Parameters
map_func (str, dict or callable(pandas.core.groupby.DataFrameGroupBy) -> pandas.DataFrame) – Function to apply to the GroupByObject at the map phase. If
str
was passed it will be treated as a DataFrameGroupBy’s method name.reduce_func (str, dict or callable(pandas.core.groupby.DataFrameGroupBy) -> pandas.DataFrame, optional) – Function to apply to the
DataFrameGroupBy
at the reduce phase. If not specified will be set the same as ‘map_func’.**call_kwds (kwargs) – Kwargs that will be passed to the returned function.
- Returns
Function that takes query compiler and executes GroupBy aggregation with TreeReduce algorithm.
- Return type
callable
Default-to-pandas operator#
Do fallback to pandas for passed function.
How to use UDFs with these operators#
Let’s examine an example of how to use the algebra module to create your own new functions.
Imagine you have a complex aggregation that can be implemented into a single query but
doesn’t have any implementation in pandas API. If you know how to implement this
aggregation efficiently in a distributed frame, you may want to use one of the above described
patterns (e.g. TreeReduce
).
Let’s implement a function that counts non-NA values for each column or row
(pandas.DataFrame.count
). First, we need to determine the function type.
TreeReduce approach would be great: in a map phase, we’ll count non-NA cells in each
partition in parallel and then just sum its results in the reduce phase.
To execute a TreeReduce function that does count + sum you can simply use the operator’s .apply(...)
method that takes and outputs a DataFrame
:
from modin.core.dataframe.algebra import TreeReduce
res_df = TreeReduce.apply(
df,
map_func=lambda df: df.count(),
reduce_function=lambda df: df.sum()
)
If you’re going to use your custom-defined function quite often you may want to wrap it into a separate function or assign it as a DataFrame’s method:
import modin.pandas as pd
def count_func(self):
return TreeReduce.apply(
self,
map_func=lambda df: df.count(),
reduce_function=lambda df: df.sum()
)
# you can then use the function as is
res = count_func(df)
# or assign it to the DataFrame's class and use it as a method
pd.DataFrame.count_custom = count_func
res = df.count_custom()
Many of the pandas API functions can be easily implemented this way, so if you find out that one of your favorite function is still defaulted to pandas and decide to contribute to Modin to add its implementation, you may use this example as a reference.