Some less known features of Google Cloud Storage

When somebody tells you Google Cloud Storage, probably first thing that comes to your mind is bucket, blob, Nearline, Coldline, you can set ACL for buckets and objects in there... but lets have a look at some cool features which are not so often mentioned. All code examples are in this github repository.

Transcoding

Transcoding is feature that allows you to upload files compressed in gzip but when accessed, they are sent uncompressed. Practical use case: save $$$ by paying for compressed storage, not full size files and reduced transfered data.

To make this work, you need to upload file as gzip compressed and set Content-Encoding of the file in Storage to be gzip. That's all. Good practice is to set also Content-Type accordingly.

So now when you request compressed file from Storage, it will be automatically uncompressed as response.

Lets see how can this be done in Python using client library for Google Cloud Storage.

from google.cloud import storage
client = storage.Client().from_service_account_json(SERVICE_JSON_FILE)
bucket = storage.Bucket(client, BUCKET_NAME)
compressed_file = 'test_file.txt.gz'
blob = bucket.blob(compressed_file, chunk_size=262144 * 10)
blob.content_encoding = 'gzip'
blob.upload_from_filename(compressed_file, content_type='text/plain')
blob.make_public() print(blob.public_url)

I am using Service Account created in IAM & Admin with Role Storage Admin to access Cloud Storage with client library. As I mentioned before, be need to set explicitly Content Encoding. I am making it also public.

Now when I use wget to download file from public url, whole content of file (uncompressed) is downloaded:

wget <public url>

It's interesting that requests library is downloading file compressed and uncompress it afterwards. So since Python library for Storage also uses requests, this method first downloads file as compressed and then it uncompress.

blob.download_to_filename('downloaded.txt')

to download file as compressed, you need to set headers Accept-Encoding: gzip. With wget for example:

wget --header="Accept-Encoding: gzip" <public url>

Browsers usually sent such headers so files are downloaded as compressed and are then uncompressed which is another good example of this case, i.e. serving static files.

Of course compressing / decompressing makes sense for files which are well compressed, like text files. Images, binary files not much, so keep that in mind. Everything depends from case to case.

Detailed description can be found here https://cloud.google.com/storage/docs/transcoding

 

Object Lifecycle Management

or in plain English "Do something with object in bucket based on date and time". Use case for this: archiving, deleting objects or object versions, moving to another storage class. Rules of actions on objects are defined per bucket. Currently two actions are supported:

  1. Delete (object or some object's version)
  2. SetStorageClass (change Storage type: Coldline, Nearline)

Conditions for actions are:

  •  Age (of object)
  • CreatedBefore (based on UTC date time)
  • NumberOfNewerVersions (relates to versioned objects and number of versions it has)
  • IsLive (relates to versioned objects, IsLive is well live object and not live is archived :))
  • MatchesStorageCLass (when object in bucket has specific storage class set (storage type)

Example to set lifecycle rules with Python library is following:

from settings import SERVICE_JSON_FILE, BUCKET_NAME
from google.cloud import storage

client = storage.Client().from_service_account_json(SERVICE_JSON_FILE)
bucket = storage.Bucket(client, BUCKET_NAME)

rules = [
    {
        'action': {
            'type': 'SetStorageClass',
            'storageClass': 'COLDLINE'
        },
        'condition': {
            'age': 365
        }
    },

    {
        'action': {
            'type': 'Delete'
        },
        'condition': {
            'age': 365*10,
            'matchesStorageClass': ['COLDLINE']
        }
    }
]

bucket.lifecycle_rules = rules
bucket.update()

Set rules for bucket in example means following:

  1. When object is older than 365 days (1 year) it's storage class is set to Coldline (i.e. hope I won't need it anymore or in case of Armaggeddon happens)
  2. After 10 years delete it in case it's in Coldline 

This feature is like suited for backups and similar stuff. Action doesn't take immediately after condition is reached (all conditions for object need to be reached for action to be executed) but asynchronously with some time lag. More info in this link: https://cloud.google.com/storage/docs/lifecycle

 

Object Versioning

Cloud Storage can keep versions of the same file. Versioning needs to be enabled on bucket level. When versioning is enabled and file under the same name is uploaded, version (or in terms of GCS - generation) is created. As usually, best is explained on example.

from settings import SERVICE_JSON_FILE, BUCKET_NAME
from google.cloud import storage
from google.auth.transport.requests import AuthorizedSession
from pprint import pprint
from google.oauth2 import service_account

client = storage.Client().from_service_account_json(SERVICE_JSON_FILE)

credentials = service_account.Credentials.from_service_account_file(SERVICE_JSON_FILE)
credentials = credentials.with_scopes(['https://www.googleapis.com/auth/cloud-platform'])
authed_session = AuthorizedSession(credentials)

bucket = storage.Bucket(client, BUCKET_NAME)

# name of the file which will be created in Cloud Storage
FILENAME = 'test_versioned_file.txt'

# enable versioning for bucket
bucket.enable_logging(BUCKET_NAME)
bucket.versioning_enabled = True
bucket.update()

# upload 3 times different content for the same filename
blob = bucket.blob(FILENAME)
blob.upload_from_string('first version', content_type='text/plain')
blob.upload_from_string('second version', content_type='text/plain')
blob.upload_from_string('third version', content_type='text/plain')

# list all files in the bucket with prefix equals to the filename
blob_versions_url = "https://www.googleapis.com/storage/v1/b/{bucket}/o?versions=true&prefix={filename}".format(bucket=BUCKET_NAME, filename=FILENAME)

# get versions
response = authed_session.get(blob_versions_url)
data = response.json()
versions = data['items']
pprint(versions)
# get first and latest generation
first_generation_url = versions[0]['mediaLink']
last_generation_url = versions[-1]['mediaLink']

resp = authed_session.get(first_generation_url)
pprint(resp.text)

resp = authed_session.get(last_generation_url)
pprint(resp.text)

print("current content ", blob.download_as_string())

Output of the program is:

[{'bucket': 'adventures-on-gcp',
  'contentType': 'text/plain',
  'crc32c': 'ZTopYw==',
  'etag': 'CL/os/GogtYCEAE=',
  'generation': '1504211601519679',
  'id': 'adventures-on-gcp/test_versioned_file.txt/1504211601519679',
  'kind': 'storage#object',
  'md5Hash': '6eI3FXDa7C57cPqk8PHquA==',
  'mediaLink': 'https://www.googleapis.com/download/storage/v1/b/adventures-on-gcp/o/test_versioned_file.txt?generation=1504211601519679&alt=media',
  'metageneration': '1',
  'name': 'test_versioned_file.txt',
  'selfLink': 'https://www.googleapis.com/storage/v1/b/adventures-on-gcp/o/test_versioned_file.txt',
  'size': '13',
  'storageClass': 'STANDARD',
  'timeCreated': '2017-08-31T20:33:21.440Z',
  'timeDeleted': '2017-08-31T20:33:21.894Z',
  'timeStorageClassUpdated': '2017-08-31T20:33:21.440Z',
  'updated': '2017-08-31T20:33:21.440Z'},
 {'bucket': 'adventures-on-gcp',
  'contentType': 'text/plain',
  'crc32c': 'M11WNQ==',
  'etag': 'CO7YyvGogtYCEAE=',
  'generation': '1504211601894510',
  'id': 'adventures-on-gcp/test_versioned_file.txt/1504211601894510',
  'kind': 'storage#object',
  'md5Hash': '8IS+N+2E6dDSoC1NS+WXRQ==',
  'mediaLink': 'https://www.googleapis.com/download/storage/v1/b/adventures-on-gcp/o/test_versioned_file.txt?generation=1504211601894510&alt=media',
  'metageneration': '1',
  'name': 'test_versioned_file.txt',
  'selfLink': 'https://www.googleapis.com/storage/v1/b/adventures-on-gcp/o/test_versioned_file.txt',
  'size': '14',
  'storageClass': 'STANDARD',
  'timeCreated': '2017-08-31T20:33:21.820Z',
  'timeDeleted': '2017-08-31T20:33:22.355Z',
  'timeStorageClassUpdated': '2017-08-31T20:33:21.820Z',
  'updated': '2017-08-31T20:33:21.820Z'},
 {'bucket': 'adventures-on-gcp',
  'contentType': 'text/plain',
  'crc32c': 'xXXOtw==',
  'etag': 'CKns5vGogtYCEAE=',
  'generation': '1504211602355753',
  'id': 'adventures-on-gcp/test_versioned_file.txt/1504211602355753',
  'kind': 'storage#object',
  'md5Hash': 'gE7JlW9qdJW8bWNO4NE6hw==',
  'mediaLink': 'https://www.googleapis.com/download/storage/v1/b/adventures-on-gcp/o/test_versioned_file.txt?generation=1504211602355753&alt=media',
  'metageneration': '1',
  'name': 'test_versioned_file.txt',
  'selfLink': 'https://www.googleapis.com/storage/v1/b/adventures-on-gcp/o/test_versioned_file.txt',
  'size': '13',
  'storageClass': 'STANDARD',
  'timeCreated': '2017-08-31T20:33:22.295Z',
  'timeStorageClassUpdated': '2017-08-31T20:33:22.295Z',
  'updated': '2017-08-31T20:33:22.295Z'}]
'first version'
'third version'
current content  b'third version'

 

Briefly to explain the code: First I am updating bucket so that it has enabled versioning then I am creating blob and I am uploading 3 times different content. Since currently Python's library for Cloud Storage doesn't support versioning I am doing with "raw" REST requests, using credentials from service account file. Then I am making request to list all files in bucket. I am using prefix field as filter to get only versions for related file and I am priting info about objects. In case I didn't miss something, currently there is no endpoint to get straigh all versions for object. I get url of first and latest version of the file from 'mediaLink' field and I download them and print them. At last I download via client library current version of file which is the same as downloaded latest.

Practical use is obvious. Whenever you need to keep versions of the same file and make them available, this is built in solution. There are some gotchas still so better read more detailed info https://cloud.google.com/storage/docs/object-versioning

 

blog comments powered by Disqus