Operators Module Description#

Brief description#

Most of the functions that are evaluated by QueryCompiler can be categorized into one of the patterns: Map, TreeReduce, Binary, Reduce, etc., called core operators. The modin.core.dataframe.algebra module provides templates to easily build such types of functions. These templates are supposed to be used at the QueryCompiler level since each built function accepts and returns QueryCompiler.

High-Level Module Overview#

Each template class implements a register method, which takes functions to apply and instantiate the related template. Functions that are passed to register will be executed against converted to pandas and preprocessed in a template-specific way partition, so the function would take one of the pandas object: pandas.DataFrame, pandas.Series or pandas.DataFrameGroupbyObject.

Note

Currently, functions that are built in that way are supported only in a pandas storage format (i.e. can be used only in PandasQueryCompiler).

Algebra module provides templates for this type of function:

Map operator#

Uniformly apply a function argument to each partition in parallel. Note: map function should not change the shape of the partitions.

class modin.core.dataframe.algebra.map.Map#

Builder class for Map operator.

classmethod apply(df, func, func_args=None, func_kwargs=None, **kwargs)#

Apply a function to a Modin DataFrame using the operators scheme.

Parameters

df (modin.pandas.DataFrame or modin.pandas.Series) – DataFrame object to apply the operator against.
func (callable(pandas.DataFrame, *args, **kwargs) -> pandas.DataFrame) – A function to apply.
func_args (tuple, optional) – Positional arguments to pass to the func.
func_kwargs (dict, optional) – Keyword arguments to pass to the func.
**kwargs (dict) – Aditional arguments to pass to the cls.register().

Return type

return_type

classmethod register(function, *call_args, **call_kwds)#

Build Map operator that will be performed across each partition.

Parameters

function (callable(pandas.DataFrame) -> pandas.DataFrame) – Function that will be applied to the each partition. Function takes pandas.DataFrame and returns pandas.DataFrame of the same shape.
*call_args (args) – Args that will be passed to the returned function.
**call_kwds (kwargs) – Kwargs that will be passed to the returned function.

Returns

Function that takes query compiler and executes map function.

Return type

callable

Reduce operator#

Applies an argument function that reduces each column or row on the specified axis into a scalar, but requires knowledge about the whole axis. Be aware that providing this knowledge may be expensive because the execution engine has to concatenate partitions along the specified axis. Also, note that the execution engine expects that the reduce function returns a one dimensional frame.

../../../../_images/reduce_evaluation.svg

class modin.core.dataframe.algebra.reduce.Reduce#

Builder class for Reduce operator.

classmethod apply(df, func, axis=0, func_args=None, func_kwargs=None)#

Apply a reduction function to each row/column partition of the dataframe.

Parameters

df (modin.pandas.DataFrame or modin.pandas.Series) – DataFrame object to apply the operator against.
func (callable(pandas.DataFrame, *args, **kwargs) -> Union[pandas.Series, pandas.DataFrame[1xN]]) – A function to apply.
axis (int, default: 0) – Whether to apply the function across rows (axis=0) or across columns (axis=1).
func_args (tuple, optional) – Positional arguments to pass to the func.
func_kwargs (dict, optional) – Keyword arguments to pass to the func.

Return type

modin.pandas.Series

classmethod register(reduce_function, axis=None)#

Build Reduce operator that will be performed across rows/columns.

It’s used if func reduces the dimension of partitions in contrast to Fold.

Parameters

reduce_function (callable(pandas.DataFrame) -> pandas.Series) – Source function.
axis (int, optional) – Axis to apply function along.

Returns

Function that takes query compiler and executes Reduce function.

Return type

callable

TreeReduce operator#

Applies an argument function that reduces specified axis into a scalar. First applies map function to each partition in parallel, then concatenates resulted partitions along the specified axis and applies reduce function. In contrast with Map function template, here you’re allowed to change partition shape in the map phase. Note that the execution engine expects that the reduce function returns a one dimensional frame.

class modin.core.dataframe.algebra.tree_reduce.TreeReduce#

Builder class for TreeReduce operator.

classmethod apply(df, map_function, reduce_function, axis=0, func_args=None, func_kwargs=None)#

Apply a map-reduce function to the dataframe.

Parameters

df (modin.pandas.DataFrame or modin.pandas.Series) – DataFrame object to apply the operator against.
map_function (callable(pandas.DataFrame, *args, **kwargs) -> pandas.DataFrame) – A map function to apply to every partition.
reduce_function (callable(pandas.DataFrame, *args, **kwargs) -> Union[pandas.Series, pandas.DataFrame[1xN]]) – A reduction function to apply to the results of the map functions.
axis (int, default: 0) – Whether to apply the reduce function across rows (axis=0) or across columns (axis=1).
func_args (tuple, optional) – Positional arguments to pass to the funcs.
func_kwargs (dict, optional) – Keyword arguments to pass to the funcs.

Return type

modin.pandas.Series

classmethod register(map_function, reduce_function=None, axis=None)#

Build TreeReduce operator.

Parameters

map_function (callable(pandas.DataFrame) -> pandas.DataFrame) – Source map function.
reduce_function (callable(pandas.DataFrame) -> pandas.Series, optional) – Source reduce function.
axis (int, optional) – Specifies axis to apply function along.

Returns

Function that takes query compiler and executes passed functions with TreeReduce algorithm.

Return type

callable

Binary operator#

Applies an argument function, that takes exactly two operands (first is always QueryCompiler). If both operands are query compilers then the execution engine broadcasts partitions of the right operand to the left.

../../../../_images/binary_evaluation.svg

Warning

To be able to do frame broadcasting, partitioning along the index axis of both frames has to be equal, otherwise they need to be aligned first. The execution engine will do it automatically but note that this requires repartitioning, which is a much more expensive operation than the binary function itself.

class modin.core.dataframe.algebra.binary.Binary#

Builder class for Binary operator.

classmethod apply(left, right, func, axis=0, func_args=None, func_kwargs=None, **kwargs)#

Apply a binary function row-wise or column-wise.

Parameters

left (modin.pandas.DataFrame or modin.pandas.Series) – Left operand.
right (modin.pandas.DataFrame or modin.pandas.Series) – Right operand.
func (callable(pandas.DataFrame, pandas.DataFrame, *args, axis, **kwargs) -> pandas.DataFrame) – A binary function to apply left and right.
axis (int, default: 0) – Whether to apply the function across rows (axis=0) or across columns (axis=1).
func_args (tuple, optional) – Positional arguments to pass to the funcs.
func_kwargs (dict, optional) – Keyword arguments to pass to the funcs.
**kwargs (dict) – Additional arguments to pass to the cls.register().

Return type

The same type as df.

classmethod register(func, join_type='outer', labels='replace', infer_dtypes=None)#

Build template binary operator.

Parameters

func (callable(pandas.DataFrame, [pandas.DataFrame, list-like, scalar]) -> pandas.DataFrame) – Binary function to execute. Have to be able to accept at least two arguments.
join_type ({'left', 'right', 'outer', 'inner', None}, default: 'outer') – Type of join that will be used if indices of operands are not aligned.
labels ({"keep", "replace", "drop"}, default: "replace") – Whether keep labels from left Modin DataFrame, replace them with labels from joined DataFrame or drop altogether to make them be computed lazily later.
infer_dtypes ({"common_cast", "float", "bool", None}, default: None) –
How dtypes should be inferred.
- If “common_cast”, casts to common dtype of operand columns.
- If “float”, performs type casting by finding common dtype. If the common dtype is any of the integer types, perform type casting to float. Used in case of truediv.
- If “bool”, dtypes would be a boolean series with same size as that of operands.
- If None, do not infer new dtypes (they will be computed manually once accessed).

Returns

Function that takes query compiler and executes binary operation.

Return type

callable

Fold operator#

Applies an argument function that requires knowledge of the whole axis. Be aware that providing this knowledge may be expensive because the execution engine has to concatenate partitions along the specified axis.

class modin.core.dataframe.algebra.fold.Fold#

Builder class for Fold functions.

classmethod apply(df, func, fold_axis=0, func_args=None, func_kwargs=None)#

Apply a Fold (full-axis) function to the dataframe.

Parameters

df (modin.pandas.DataFrame or modin.pandas.Series) – DataFrame object to apply the operator against.
func (callable(pandas.DataFrame[NxM], *args, **kwargs) -> pandas.DataFrame[NxM]) – A function to apply to every partition. Note that the function shouldn’t change the shape of the dataframe.
fold_axis (int, default: 0) – Whether to apply the function across rows (axis=0) or across columns (axis=1).
func_args (tuple, optional) – Positional arguments to pass to the funcs.
func_kwargs (dict, optional) – Keyword arguments to pass to the funcs.

Return type

the same type as df

classmethod register(fold_function)#

Build Fold operator that will be performed across rows/columns.

Parameters: fold_function (callable(pandas.DataFrame) -> pandas.DataFrame) – Function to apply across rows/columns.
Returns: Function that takes query compiler and executes Fold function.
Return type: callable

GroupBy operator#

Evaluates GroupBy aggregation for that type of functions that can be executed via TreeReduce approach. To be able to form groups engine broadcasts by partitions to each partition of the source frame.

class modin.core.dataframe.algebra.groupby.GroupByReduce#

Builder class for GroupBy aggregation functions.

ID_LEVEL_NAME#

It’s supposed that implementations may produce multiple temporary columns per one source column in an intermediate phase. In order for these columns to be processed accordingly at the Reduce phase, an implementation must store unique names for such temporary columns in the ID_LEVEL_NAME level. Duplicated names are not allowed.

Type: str

_GROUPBY_REDUCE_IMPL_FLAG#

Attribute indicating that a callable should be treated as an implementation for one of the TreeReduce phases rather than an arbitrary aggregation. Note: this attribute should be considered private.

Type: str

classmethod apply(df, map_func, reduce_func, by, groupby_kwargs=None, agg_args=None, agg_kwargs=None)#

Apply groupby aggregation function using map-reduce pattern.

Parameters

df (modin.pandas.DataFrame or modin.pandas.Series) – A source DataFrame to group.
map_func (callable(pandas.core.groupby.DataFrameGroupBy) -> pandas.DataFrame) – A map function to apply to a groupby object in every partition.
reduce_func (callable(pandas.core.groupby.DataFrameGroupBy) -> pandas.DataFrame) – A reduction function to apply to the results of the map functions.
by (label or list of labels) – Columns of the df to group on.
groupby_kwargs (dict, optional) – Keyword arguments matching the signature of pandas.DataFrame.groupby.
agg_args (tuple, optional) – Positional arguments to pass to the funcs.
agg_kwargs (dict, optional) – Keyword arguments to pass to the funcs.

Return type

The same type as df.

classmethod register(map_func, reduce_func=None, **call_kwds)#

Build template GroupBy aggregation function.

Resulted function is applied in parallel via TreeReduce algorithm.

Parameters

map_func (str, dict or callable(pandas.core.groupby.DataFrameGroupBy) -> pandas.DataFrame) – Function to apply to the GroupByObject at the map phase. If str was passed it will be treated as a DataFrameGroupBy’s method name.
reduce_func (str, dict or callable(pandas.core.groupby.DataFrameGroupBy) -> pandas.DataFrame, optional) – Function to apply to the DataFrameGroupBy at the reduce phase. If not specified will be set the same as ‘map_func’.
**call_kwds (kwargs) – Kwargs that will be passed to the returned function.

Returns

Function that takes query compiler and executes GroupBy aggregation with TreeReduce algorithm.

Return type

callable

Default-to-pandas operator#

Do fallback to pandas for passed function.

How to use UDFs with these operators#

Let’s examine an example of how to use the algebra module to create your own new functions.

Imagine you have a complex aggregation that can be implemented into a single query but doesn’t have any implementation in pandas API. If you know how to implement this aggregation efficiently in a distributed frame, you may want to use one of the above described patterns (e.g. TreeReduce).

Let’s implement a function that counts non-NA values for each column or row (pandas.DataFrame.count). First, we need to determine the function type. TreeReduce approach would be great: in a map phase, we’ll count non-NA cells in each partition in parallel and then just sum its results in the reduce phase.

To execute a TreeReduce function that does count + sum you can simply use the operator’s .apply(...) method that takes and outputs a DataFrame:

from modin.core.dataframe.algebra import TreeReduce

res_df = TreeReduce.apply(
    df,
    map_func=lambda df: df.count(),
    reduce_function=lambda df: df.sum()
)

If you’re going to use your custom-defined function quite often you may want to wrap it into a separate function or assign it as a DataFrame’s method:

import modin.pandas as pd

def count_func(self):
    return TreeReduce.apply(
        self,
        map_func=lambda df: df.count(),
        reduce_function=lambda df: df.sum()
    )

# you can then use the function as is
res = count_func(df)

# or assign it to the DataFrame's class and use it as a method
pd.DataFrame.count_custom = count_func
res = df.count_custom()

Many of the pandas API functions can be easily implemented this way, so if you find out that one of your favorite function is still defaulted to pandas and decide to contribute to Modin to add its implementation, you may use this example as a reference.