pycarol.storage

class pycarol.storage.Storage(carol)[source]

Handle all Carol storage interactions.

Args:
carol: class: pycarol.Carol
delete(name)[source]

Delete a file in Carol Storage.

Args:

name: str
Filename to be deleted.

Returns:

exists(name)[source]

Check if files exists in Carol Storage.

Args:

name: str
Filename

Returns: bool

files_storage_list(prefix='pipeline/', print_paths=False)[source]

It will return all files in Carol data Storage (CDS).

Args:

prefix: str, default pipeline/
prefix of the folder to filter the output.
print_paths: bool, default False
Print the tree structure of the files in CDS

Returns: list of files paths.

load(name, format='pickle', parquet=False, cache=True, storage_space='app_storage', columns=None, chunk_size=None)[source]

Args:

name: str.
Filename to be load
format: str

Possible values:

  1. pickle: It uses pickle.dump to save the binary file.
  2. joblib: It uses joblib.dump to save the binary file.
  3. file: It saves a local file sending it directly to Carol.
parquet: bool default False
It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame
cache: bool default True
Cache the file saved in the temp directory.
storage_space: str default app_storage
Internal Storage space.
  1. “app_storage”: For raw storage.
  2. “golden”: Data Model golden records.
  3. “staging”: Staging records path
  4. “staging_master”: Staging records from Master
  5. “staging_rejected”: Staging records from Rejected
  6. “view”: Data Model View records
  7. “app”: App bucket
  8. “golden_cds”: CDS golden records
  9. “staging_cds”: Staging Intake.
columns: list default None
Columns to fetch when using parquet=True
chunk_size: int default None
The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.

Usage:

Loading a local file in CDS.

from pycarol import Carol, Storage
import pandas as pd
login = Carol()
stg = Storage(login)

path = stg.load(name='myfile.csv', format='file')
pd.read_csv(path)

Loading an object.

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

my_dict = stg.load(name='myfile.json',  format='pickle')

It works for format=joblib as well,

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)
my_dict = stg.load(name='myfile.json',  format='joblib')

Loading a pandas DataFrame

import pandas as pd
from pycarol import Carol, Storage

login = Carol()
stg = Storage(login)

df = stg.load(name='myfile.parquet', parquet=True)
save(name, obj, format='pickle', parquet=False, cache=True, chunk_size=None)[source]

Args:

name: str.
Filename to be used when saving the obj
obj: obj
It depends on the format parameter.
format: str

Possible values:

  1. pickle: It uses pickle.dump to save the binary file.
  2. joblib: It uses joblib.dump to save the binary file.
  3. file: It saves a local file sending it directly to Carol.
parquet: bool default False
It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame
cache: bool default True
Cache the file saved in the temp directory.
chunk_size: int default None
The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.

Usage:

Saving a local file in CDS.

from pycarol import Carol, Storage
import pandas as pd
login = Carol()
stg = Storage(login)

stg.save(name='myfile.csv', obj='/local/file/.csv',  format='file')
# to load the file use:
path = stg.load(name='teste.zip',  format='file')
pd.read_csv(path)

Saving an object.

my_dict = {"a":1, "b":2}

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

stg.save(name='myfile.json', obj=my_dict,  format='pickle')
# to load the file use:
my_dict = stg.load(name='myfile.json',  format='pickle')

It works for format=joblib as well,

my_dict = {"a":1, "b":2}

from pycarol import Carol, Storage
login = Carol()
stg = Storage(login)

stg.save(name='myfile.json', obj=my_dict,  format='joblib')
# to load the file use:
my_dict = stg.load(name='myfile.json',  format='joblib')

Saving a pandas DataFrame

import pandas as pd
from pycarol import Carol, Storage

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

login = Carol()
stg = Storage(login)

stg.save(name='myfile.parquet', obj=my_dict,  parquet=True)
# to load the file use:
df = stg.load(name='myfile.parquet', parquet=True)