pycarol.storage

class pycarol.storage.Storage(carol)[source]

Handle all Carol storage interactions.

Parameters: carol – class: pycarol.Carol

delete(name)[source]

Delete a file in Carol Storage.

Parameters: name – str Filename to be deleted.

Returns:

exists(name, storage_space='app')[source]

Check if files exists in Carol Storage.

Parameters: name – str Filename

Returns: bool

files_storage_list(prefix='pipeline/', print_paths=False)[source]

It will return all files in Carol data Storage (CDS).

Parameters

prefix – str, default pipeline/ prefix of the folder to filter the output.
print_paths – bool, default False Print the tree structure of the files in CDS

Returns: list of files paths.

load(name, format='pickle', parquet=False, cache=True, storage_space='app_storage', columns=None, chunk_size=None)[source]

Load file from CDS.

Parameters

name – str. Filename to be load
format –
str Possible values:
1. pickle: It uses pickle.dump to save the binary file.
2. joblib: It uses joblib.dump to save the binary file.
3. file: It saves a local file sending it directly to Carol.
parquet – bool default False It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame
cache – bool default True Cache the file saved in the temp directory.
storage_space –
str default app_storage Internal Storage space.
1. ”app_storage”: For raw storage.
2. ”golden”: Data Model golden records.
3. ”staging”: Staging records path
4. ”staging_master”: Staging records from Master
5. ”staging_rejected”: Staging records from Rejected
6. ”view”: Data Model View records
7. ”app”: App bucket
8. ”golden_cds”: CDS golden records
9. ”staging_cds”: Staging Intake.
columns – list default None Columns to fetch when using parquet=True
chunk_size – int default None The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.

Usage:

Loading a local file in CDS.

from pycarol import Carol, Storage
import pandas as pd
login = Carol()
stg = Storage(login)

path = stg.load(name='myfile.csv', format='file')
pd.read_csv(path)

Loading an object.

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

my_dict = stg.load(name='myfile.json',  format='pickle')

It works for format=joblib as well,

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)
my_dict = stg.load(name='myfile.json',  format='joblib')

Loading a pandas DataFrame

import pandas as pd
from pycarol import Carol, Storage

login = Carol()
stg = Storage(login)

df = stg.load(name='myfile.parquet', parquet=True)

save(name, obj, format='pickle', parquet=False, cache=True, chunk_size=None, storage_space='app', storage_space_params=None)[source]

Save file to CDS

Parameters

name – str. Filename to be used when saving the obj
obj – obj It depends on the format parameter.
format –
str Possible values:
1. pickle: It uses pickle.dump to save the binary file.
2. joblib: It uses joblib.dump to save the binary file.
3. file: It saves a local file sending it directly to Carol.
parquet – bool default False It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame
cache – bool default True Cache the file saved in the temp directory.
chunk_size – int default None The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.
storage_space –
str default app Which bucket to get. Possible values:
1. ”golden”: Data Model golden records.
2. ”staging”: Staging records path
3. ”staging_master”: Staging records from Master
4. ”staging_rejected”: Staging records from Rejected
5. ”view”: Data Model View records
6. ”app”: App bucket
7. ”golden_cds”: CDS golden records
8. ”staging_cds”: Staging Intake.
None (storage_space_params dict default) – Params needed for the given storage_space

Usage:

Saving a local file in CDS.

from pycarol import Carol, Storage
import pandas as pd
login = Carol()
stg = Storage(login)

stg.save(name='myfile.csv', obj='/local/file/.csv',  format='file')
# to load the file use:
path = stg.load(name='teste.zip',  format='file')
pd.read_csv(path)

Saving an object.

my_dict = {"a":1, "b":2}

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

stg.save(name='myfile.json', obj=my_dict,  format='pickle')
# to load the file use:
my_dict = stg.load(name='myfile.json',  format='pickle')

It works for format=joblib as well,

my_dict = {"a":1, "b":2}

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

stg.save(name='myfile.json', obj=my_dict,  format='joblib')
# to load the file use:
my_dict = stg.load(name='myfile.json',  format='joblib')

Saving a pandas DataFrame

import pandas as pd
from pycarol import Carol, Storage

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

login = Carol()
stg = Storage(login)

stg.save(name='myfile.parquet', obj=my_dict,  parquet=True)
# to load the file use:
df = stg.load(name='myfile.parquet', parquet=True)