pycarol.storage¶

class pycarol.storage.Storage(carol)[source]¶

Handle all Carol storage interactions.

Args:: carol: class: pycarol.Carol

delete(name)[source]¶

Delete a file in Carol Storage.

Args:

name: str

Filename to be deleted.

Returns:

exists(name)[source]¶

Check if files exists in Carol Storage.

Args:

name: str

Filename

Returns: bool

files_storage_list(prefix='pipeline/', print_paths=False)[source]¶

It will return all files in Carol data Storage (CDS).

Args:

prefix: str, default pipeline/

prefix of the folder to filter the output.

print_paths: bool, default False

Print the tree structure of the files in CDS

Returns: list of files paths.

load(name, format='pickle', parquet=False, cache=True, storage_space='app_storage', columns=None, chunk_size=None)[source]¶

Args:

name: str.

Filename to be load

format: str

Possible values:

pickle: It uses pickle.dump to save the binary file.

joblib: It uses joblib.dump to save the binary file.

file: It saves a local file sending it directly to Carol.

parquet: bool default False

It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame

cache: bool default True

Cache the file saved in the temp directory.

storage_space: str default app_storage

Internal Storage space.

“app_storage”: For raw storage.

“golden”: Data Model golden records.

“staging”: Staging records path

“staging_master”: Staging records from Master

“staging_rejected”: Staging records from Rejected

“view”: Data Model View records

“app”: App bucket

“golden_cds”: CDS golden records

“staging_cds”: Staging Intake.

columns: list default None

Columns to fetch when using parquet=True

chunk_size: int default None

The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.

Usage:

Loading a local file in CDS.

from pycarol import Carol, Storage
import pandas as pd
login = Carol()
stg = Storage(login)

path = stg.load(name='myfile.csv', format='file')
pd.read_csv(path)

Loading an object.

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

my_dict = stg.load(name='myfile.json',  format='pickle')

It works for format=joblib as well,

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)
my_dict = stg.load(name='myfile.json',  format='joblib')

Loading a pandas DataFrame

import pandas as pd
from pycarol import Carol, Storage

login = Carol()
stg = Storage(login)

df = stg.load(name='myfile.parquet', parquet=True)

save(name, obj, format='pickle', parquet=False, cache=True, chunk_size=None)[source]¶

Args:

name: str.

Filename to be used when saving the obj

obj: obj

It depends on the format parameter.

format: str

Possible values:

pickle: It uses pickle.dump to save the binary file.

joblib: It uses joblib.dump to save the binary file.

file: It saves a local file sending it directly to Carol.

parquet: bool default False

It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame

cache: bool default True

Cache the file saved in the temp directory.

chunk_size: int default None

The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.

Usage:

Saving a local file in CDS.

from pycarol import Carol, Storage
import pandas as pd
login = Carol()
stg = Storage(login)

stg.save(name='myfile.csv', obj='/local/file/.csv',  format='file')
# to load the file use:
path = stg.load(name='teste.zip',  format='file')
pd.read_csv(path)

Saving an object.

my_dict = {"a":1, "b":2}

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

stg.save(name='myfile.json', obj=my_dict,  format='pickle')
# to load the file use:
my_dict = stg.load(name='myfile.json',  format='pickle')

It works for format=joblib as well,

my_dict = {"a":1, "b":2}

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

stg.save(name='myfile.json', obj=my_dict,  format='joblib')
# to load the file use:
my_dict = stg.load(name='myfile.json',  format='joblib')

Saving a pandas DataFrame

import pandas as pd
from pycarol import Carol, Storage

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

login = Carol()
stg = Storage(login)

stg.save(name='myfile.parquet', obj=my_dict,  parquet=True)
# to load the file use:
df = stg.load(name='myfile.parquet', parquet=True)