pycarol.storage

class pycarol.storage.Storage(carol)[source]

Handle all Carol storage interactions.

Parameters

carolclass: pycarol.Carol

delete(name)[source]

Delete a file in Carol Storage.

Parameters

namestr Filename to be deleted.

Returns:

exists(name, storage_space='app')[source]

Check if files exists in Carol Storage.

Parameters

namestr Filename

Returns: bool

files_storage_list(prefix='pipeline/', print_paths=False)[source]

It will return all files in Carol data Storage (CDS).

Parameters
  • prefixstr, default pipeline/ prefix of the folder to filter the output.

  • print_pathsbool, default False Print the tree structure of the files in CDS

Returns: list of files paths.

load(name, format='pickle', parquet=False, cache=True, storage_space='app_storage', columns=None, chunk_size=None)[source]

Load file from CDS.

Parameters
  • namestr. Filename to be load

  • format

    str Possible values:

    1. pickle: It uses pickle.dump to save the binary file.

    2. joblib: It uses joblib.dump to save the binary file.

    3. file: It saves a local file sending it directly to Carol.

  • parquetbool default False It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame

  • cachebool default True Cache the file saved in the temp directory.

  • storage_space

    str default app_storage Internal Storage space.

    1. ”app_storage”: For raw storage.

    2. ”golden”: Data Model golden records.

    3. ”staging”: Staging records path

    4. ”staging_master”: Staging records from Master

    5. ”staging_rejected”: Staging records from Rejected

    6. ”view”: Data Model View records

    7. ”app”: App bucket

    8. ”golden_cds”: CDS golden records

    9. ”staging_cds”: Staging Intake.

  • columnslist default None Columns to fetch when using parquet=True

  • chunk_sizeint default None The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.

Usage:

Loading a local file in CDS.

from pycarol import Carol, Storage
import pandas as pd
login = Carol()
stg = Storage(login)

path = stg.load(name='myfile.csv', format='file')
pd.read_csv(path)

Loading an object.

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

my_dict = stg.load(name='myfile.json',  format='pickle')

It works for format=joblib as well,

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)
my_dict = stg.load(name='myfile.json',  format='joblib')

Loading a pandas DataFrame

import pandas as pd
from pycarol import Carol, Storage

login = Carol()
stg = Storage(login)

df = stg.load(name='myfile.parquet', parquet=True)
save(name, obj, format='pickle', parquet=False, cache=True, chunk_size=None, storage_space='app', storage_space_params=None)[source]

Save file to CDS

Parameters
  • namestr. Filename to be used when saving the obj

  • objobj It depends on the format parameter.

  • format

    str Possible values:

    1. pickle: It uses pickle.dump to save the binary file.

    2. joblib: It uses joblib.dump to save the binary file.

    3. file: It saves a local file sending it directly to Carol.

  • parquetbool default False It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame

  • cachebool default True Cache the file saved in the temp directory.

  • chunk_sizeint default None The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.

  • storage_space

    str default app Which bucket to get. Possible values:

    1. ”golden”: Data Model golden records.

    2. ”staging”: Staging records path

    3. ”staging_master”: Staging records from Master

    4. ”staging_rejected”: Staging records from Rejected

    5. ”view”: Data Model View records

    6. ”app”: App bucket

    7. ”golden_cds”: CDS golden records

    8. ”staging_cds”: Staging Intake.

  • None (storage_space_params dict default) – Params needed for the given storage_space

Usage:

Saving a local file in CDS.

from pycarol import Carol, Storage
import pandas as pd
login = Carol()
stg = Storage(login)

stg.save(name='myfile.csv', obj='/local/file/.csv',  format='file')
# to load the file use:
path = stg.load(name='teste.zip',  format='file')
pd.read_csv(path)

Saving an object.

my_dict = {"a":1, "b":2}

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

stg.save(name='myfile.json', obj=my_dict,  format='pickle')
# to load the file use:
my_dict = stg.load(name='myfile.json',  format='pickle')

It works for format=joblib as well,

my_dict = {"a":1, "b":2}

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

stg.save(name='myfile.json', obj=my_dict,  format='joblib')
# to load the file use:
my_dict = stg.load(name='myfile.json',  format='joblib')

Saving a pandas DataFrame

import pandas as pd
from pycarol import Carol, Storage

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

login = Carol()
stg = Storage(login)

stg.save(name='myfile.parquet', obj=my_dict,  parquet=True)
# to load the file use:
df = stg.load(name='myfile.parquet', parquet=True)