pycarol.storage¶
-
class
pycarol.storage.Storage(carol)[source]¶ Handle all Carol storage interactions.
- Args:
- carol: class: pycarol.Carol
-
delete(name)[source]¶ Delete a file in Carol Storage.
Args:
- name: str
- Filename to be deleted.
Returns:
-
exists(name)[source]¶ Check if files exists in Carol Storage.
Args:
- name: str
- Filename
Returns: bool
-
files_storage_list(prefix='pipeline/', print_paths=False)[source]¶ It will return all files in Carol data Storage (CDS).
Args:
- prefix: str, default pipeline/
- prefix of the folder to filter the output.
- print_paths: bool, default False
- Print the tree structure of the files in CDS
Returns: list of files paths.
-
load(name, format='pickle', parquet=False, cache=True, storage_space='app_storage', columns=None, chunk_size=None)[source]¶ Args:
- name: str.
- Filename to be load
- format: str
Possible values:
- pickle: It uses pickle.dump to save the binary file.
- joblib: It uses joblib.dump to save the binary file.
- file: It saves a local file sending it directly to Carol.
- parquet: bool default False
- It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame
- cache: bool default True
- Cache the file saved in the temp directory.
- storage_space: str default app_storage
- Internal Storage space.
- “app_storage”: For raw storage.
- “golden”: Data Model golden records.
- “staging”: Staging records path
- “staging_master”: Staging records from Master
- “staging_rejected”: Staging records from Rejected
- “view”: Data Model View records
- “app”: App bucket
- “golden_cds”: CDS golden records
- “staging_cds”: Staging Intake.
- columns: list default None
- Columns to fetch when using parquet=True
- chunk_size: int default None
- The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.
Usage:
Loading a local file in CDS.
from pycarol import Carol, Storage import pandas as pd login = Carol() stg = Storage(login) path = stg.load(name='myfile.csv', format='file') pd.read_csv(path)
Loading an object.
from pycarol import Carol, Storage login = Carol() stg = Storage(login) my_dict = stg.load(name='myfile.json', format='pickle')
It works for format=joblib as well,
from pycarol import Carol, Storage login = Carol() stg = Storage(login) my_dict = stg.load(name='myfile.json', format='joblib')
Loading a pandas DataFrame
import pandas as pd from pycarol import Carol, Storage login = Carol() stg = Storage(login) df = stg.load(name='myfile.parquet', parquet=True)
-
save(name, obj, format='pickle', parquet=False, cache=True, chunk_size=None, storage_space='app', storage_space_params=None)[source]¶ Args:
- name: str.
- Filename to be used when saving the obj
- obj: obj
- It depends on the format parameter.
- format: str
Possible values:
- pickle: It uses pickle.dump to save the binary file.
- joblib: It uses joblib.dump to save the binary file.
- file: It saves a local file sending it directly to Carol.
- parquet: bool default False
- It uses pandas.DataFrame.to_parquet to save. obj should be a pandas DataFrame
- cache: bool default True
- Cache the file saved in the temp directory.
- chunk_size: int default None
- The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.
- storage_space: str default app
Which bucket to get. Possible values:
- “golden”: Data Model golden records.
- “staging”: Staging records path
- “staging_master”: Staging records from Master
- “staging_rejected”: Staging records from Rejected
- “view”: Data Model View records
- “app”: App bucket
- “golden_cds”: CDS golden records
- “staging_cds”: Staging Intake.
- storage_space_params dict default None
- Params needed for the given storage_space
Usage:
Saving a local file in CDS.
from pycarol import Carol, Storage import pandas as pd login = Carol() stg = Storage(login) stg.save(name='myfile.csv', obj='/local/file/.csv', format='file') # to load the file use: path = stg.load(name='teste.zip', format='file') pd.read_csv(path)
Saving an object.
my_dict = {"a":1, "b":2} from pycarol import Carol, Storage login = Carol() stg = Storage(login) stg.save(name='myfile.json', obj=my_dict, format='pickle') # to load the file use: my_dict = stg.load(name='myfile.json', format='pickle')
It works for format=joblib as well,
my_dict = {"a":1, "b":2} from pycarol import Carol, Storage login = Carol() stg = Storage(login) stg.save(name='myfile.json', obj=my_dict, format='joblib') # to load the file use: my_dict = stg.load(name='myfile.json', format='joblib')
Saving a pandas DataFrame
import pandas as pd from pycarol import Carol, Storage d = {'col1': [1, 2], 'col2': [3, 4]} df = pd.DataFrame(data=d) login = Carol() stg = Storage(login) stg.save(name='myfile.parquet', obj=my_dict, parquet=True) # to load the file use: df = stg.load(name='myfile.parquet', parquet=True)