pycarol.storage
- class pycarol.storage.Storage(carol)[source]
Handle all Carol storage interactions.
- Parameters
carol – class: pycarol.Carol
- delete(name)[source]
Delete a file in Carol Storage.
- Parameters
name – str Filename to be deleted.
Returns:
- exists(name, storage_space='app')[source]
Check if files exists in Carol Storage.
- Parameters
name – str Filename
Returns: bool
- files_storage_list(prefix='pipeline/', print_paths=False)[source]
It will return all files in Carol data Storage (CDS).
- Parameters
prefix – str, default pipeline/ prefix of the folder to filter the output.
print_paths – bool, default False Print the tree structure of the files in CDS
Returns: list of files paths.
- load(name, format='pickle', parquet=False, cache=True, storage_space='app_storage', columns=None, chunk_size=None)[source]
Load file from CDS.
- Parameters
name – str. Filename to be load
format –
str Possible values:
pickle: It uses pickle.dump to save the binary file.
joblib: It uses joblib.dump to save the binary file.
file: It saves a local file sending it directly to Carol.
parquet – bool default False It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame
cache – bool default True Cache the file saved in the temp directory.
storage_space –
str default app_storage Internal Storage space.
”app_storage”: For raw storage.
”golden”: Data Model golden records.
”staging”: Staging records path
”staging_master”: Staging records from Master
”staging_rejected”: Staging records from Rejected
”view”: Data Model View records
”app”: App bucket
”golden_cds”: CDS golden records
”staging_cds”: Staging Intake.
columns – list default None Columns to fetch when using parquet=True
chunk_size – int default None The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.
Usage:
Loading a local file in CDS.
from pycarol import Carol, Storage import pandas as pd login = Carol() stg = Storage(login) path = stg.load(name='myfile.csv', format='file') pd.read_csv(path)
Loading an object.
from pycarol import Carol, Storage login = Carol() stg = Storage(login) my_dict = stg.load(name='myfile.json', format='pickle')
It works for format=joblib as well,
from pycarol import Carol, Storage login = Carol() stg = Storage(login) my_dict = stg.load(name='myfile.json', format='joblib')
Loading a pandas DataFrame
import pandas as pd from pycarol import Carol, Storage login = Carol() stg = Storage(login) df = stg.load(name='myfile.parquet', parquet=True)
- save(name, obj, format='pickle', parquet=False, cache=True, chunk_size=None, storage_space='app', storage_space_params=None)[source]
Save file to CDS
- Parameters
name – str. Filename to be used when saving the obj
obj – obj It depends on the format parameter.
format –
str Possible values:
pickle: It uses pickle.dump to save the binary file.
joblib: It uses joblib.dump to save the binary file.
file: It saves a local file sending it directly to Carol.
parquet – bool default False It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame
cache – bool default True Cache the file saved in the temp directory.
chunk_size – int default None The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.
storage_space –
str default app Which bucket to get. Possible values:
”golden”: Data Model golden records.
”staging”: Staging records path
”staging_master”: Staging records from Master
”staging_rejected”: Staging records from Rejected
”view”: Data Model View records
”app”: App bucket
”golden_cds”: CDS golden records
”staging_cds”: Staging Intake.
None (storage_space_params dict default) – Params needed for the given storage_space
Usage:
Saving a local file in CDS.
from pycarol import Carol, Storage import pandas as pd login = Carol() stg = Storage(login) stg.save(name='myfile.csv', obj='/local/file/.csv', format='file') # to load the file use: path = stg.load(name='teste.zip', format='file') pd.read_csv(path)
Saving an object.
my_dict = {"a":1, "b":2} from pycarol import Carol, Storage login = Carol() stg = Storage(login) stg.save(name='myfile.json', obj=my_dict, format='pickle') # to load the file use: my_dict = stg.load(name='myfile.json', format='pickle')
It works for format=joblib as well,
my_dict = {"a":1, "b":2} from pycarol import Carol, Storage login = Carol() stg = Storage(login) stg.save(name='myfile.json', obj=my_dict, format='joblib') # to load the file use: my_dict = stg.load(name='myfile.json', format='joblib')
Saving a pandas DataFrame
import pandas as pd from pycarol import Carol, Storage d = {'col1': [1, 2], 'col2': [3, 4]} df = pd.DataFrame(data=d) login = Carol() stg = Storage(login) stg.save(name='myfile.parquet', obj=my_dict, parquet=True) # to load the file use: df = stg.load(name='myfile.parquet', parquet=True)