Experimental IO Module Description#

The module is used mostly for storing experimental utils and dispatcher classes for reading/writing files of different formats.

Submodules Description#

  • text - directory for storing all text file format dispatcher classes

    • format/feature specific dispatchers: csv_glob_dispatcher.py, custom_text_dispatcher.py.

  • sql - directory for storing SQL dispatcher class

    • format/feature specific dispatchers: sql_dispatcher.py

  • pickle - directory for storing Pickle dispatcher class

    • format/feature specific dispatchers: pickle_dispatcher.py

Public API#

Experimental IO functions implementations.

class modin.experimental.core.io.ExperimentalCSVGlobDispatcher#

Class contains utils for reading multiple .csv files simultaneously.

classmethod file_exists(file_path: str, storage_options=None) bool#

Check if the file_path is valid.

Parameters
  • file_path (str) – String representing a path.

  • storage_options (dict, optional) – Keyword from read_* functions.

Returns

True if the path is valid.

Return type

bool

classmethod get_path(file_path: str) list#

Return the path of the file(s).

Parameters

file_path (str) – String representing a path.

Returns

List of strings of absolute file paths.

Return type

list

classmethod partitioned_file(files, fnames: List[str], num_partitions: int = None, nrows: int = None, skiprows: int = None, skip_header: int = None, quotechar: bytes = b'"', is_quoting: bool = True) List[List[Tuple[str, int, int]]]#

Compute chunk sizes in bytes for every partition.

Parameters
  • files (file or list of files) – File(s) to be partitioned.

  • fnames (str or list of str) – File name(s) to be partitioned.

  • num_partitions (int, optional) – For what number of partitions split a file. If not specified grabs the value from modin.config.NPartitions.get().

  • nrows (int, optional) – Number of rows of file to read.

  • skiprows (int, optional) – Specifies rows to skip.

  • skip_header (int, optional) – Specifies header rows to skip.

  • quotechar (bytes, default: b'"') – Indicate quote in a file.

  • is_quoting (bool, default: True) – Whether or not to consider quotes.

Returns

List, where each element of the list is a list of tuples. The inner lists of tuples contains the data file name of the chunk, chunk start offset, and chunk end offsets for its corresponding file.

Return type

list

Notes

The logic gets really complicated if we try to use the TextFileDispatcher.partitioned_file.

class modin.experimental.core.io.ExperimentalCustomTextDispatcher#

Class handles utils for reading custom text files.

class modin.experimental.core.io.ExperimentalPickleDispatcher#

Class handles utils for reading pickle files.

classmethod write(qc, **kwargs)#

When * is in the filename, all partitions are written to their own separate file.

The filenames is determined as follows: - if * is in the filename, then it will be replaced by the ascending sequence 0, 1, 2, … - if * is not in the filename, then the default implementation will be used.

Example: 4 partitions and input filename=”partition*.pkl.gz”, then filenames will be: partition0.pkl.gz, partition1.pkl.gz, partition2.pkl.gz, partition3.pkl.gz.

Parameters
  • qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run to_pickle_distributed on.

  • **kwargs (dict) – Parameters for pandas.to_pickle(**kwargs).

class modin.experimental.core.io.ExperimentalSQLDispatcher#

Class handles experimental utils for reading SQL queries or database tables.

classmethod preprocess_func()#

Prepare a function for transmission to remote workers.