Access Azure blob storage from within an Azure ML experiment

Solution 1:

Bottom Line Up Front: Use HTTP instead of HTTPS for accessing Azure storage.

When declaring BlobService pass in protocol='http' to force the service to communicate over HTTP. Note that you must have your container configured to allow requests over HTTP (which it does by default).

client = BlobService(STORAGE_ACCOUNT, STORAGE_KEY, protocol="http")

History and credit:

I posted a query on this topic to @AzureHelps and they opened a ticket on the MSDN forums: https://social.msdn.microsoft.com/Forums/azure/en-US/46166b22-47ae-4808-ab87-402388dd7a5c/trouble-writing-blob-storage-file-in-azure-ml-experiment?forum=MachineLearning&prof=required

Sudarshan Raghunathan replied with the magic. Here are the steps to make it easy for everyone to duplicate my fix:

  1. Download azure.zip which provides the required libraries: https://azuremlpackagesupport.blob.core.windows.net/python/azure.zip
  2. Upload them as a DataSet to the Azure ML Studio
  3. Connect them to the Zip input on an Execute Python Script module
  4. Write your script as you would normally, being sure to create your BlobService object with protocol='http'
  5. Run the Experiment - you should now be able to write to blob storage.

Some example code can be found here: https://gist.github.com/drdarshan/92fff2a12ad9946892df

The code I used was the following, which doesn't first write the CSV to the file system, but sends as a text stream.

from azure.storage.blob import BlobService

def azureml_main(dataframe1 = None, dataframe2 = None):
    account_name = 'mystorageaccount'
    account_key='p8kSy3FACx...redacted...ebz3plQ=='
    container_name = "upload"
    json_output_file_name = 'testfromml.json'
    json_orient = 'records' # Can be index, records, split, columns, values
    json_force_ascii=False;

    blob_service = BlobService(account_name, account_key, protocol='http')

    blob_service.put_block_blob_from_text(container_name,json_output_file_name,dataframe1.to_json(orient=json_orient, force_ascii=json_force_ascii))

    # Return value must be of a sequence of pandas.DataFrame
    return dataframe1,

Some thoughts:

  1. I would prefer if the azure Python libraries were imported by default. Microsoft imports hundreds of 3rd party libraries into Azure ML as part of the Anaconda distribution. They should also include those necessary to work with Azure. We're in Azure, we've committed to Azure. Embrace it.
  2. I don't like that I have to use HTTP, instead of HTTPS. Granted, this is internal Azure communication, so it's likely no big deal. However, most of the documentation suggests the use of SSL / HTTPS when working with blob storage, so I'd prefer to be able to do that.
  3. I still get random timeout errors in the Experiment. Sometimes the Python code will execute in milliseconds, other times it runs for several 60 or seconds and then times out. This makes running it in an experiment very frustrating at times. However, when published as a Web Service I do not seem to have this problem.
  4. I would prefer that the experience from my local code matched more closely Azure ML. Locally, I can use HTTPS and never time out. It's blazing fast, and easy to write. But moving to an Azure ML experiment means some debugging, nearly every time.

Huge props to Dan, Peter and Sudarshan, all from Microsoft, for their help in resolving this. I very much appreciate it!

Solution 2:

You are going down the correct path. The Execution Python Script module is meant for custom needs just like this. Your real issue is how to import existing Python script modules. The complete directions can be found here, but I will summarize for SO.

You will want to take the Azure Python SDK and zip it up, upload, then import into your module. I can look into why this is not there by default...

https://azure.microsoft.com/en-us/documentation/articles/machine-learning-execute-python-scripts/

Importing existing Python script modules

A common use-case for many data scientists is to incorporate existing Python scripts into Azure Machine Learning experiments. Instead of concatenating and pasting all the code into a single script box, the Execute Python Script module accepts a third input port to which a zip file that contains the Python modules can be connected. The file is then unzipped by the execution framework at runtime and the contents are added to the library path of the Python interpreter. The azureml_main entry point function can then import these modules directly.

As an example, consider the file Hello.py containing a simple “Hello, World” function.

image6

Figure 4. User-defined function.

Next, we can create a file Hello.zip containing Hello.py:

image7

Figure 5. Zip file containing user-defined Python code.

Then, upload this as a dataset into Azure Machine Learning Studio. If we then create and run a simple experiment a uses the module:

image8

image9

Figure 6. Sample experiment with user-defined Python code uploaded as a zip file.

The module output shows that the zip file has been unpackaged and the function print_hello has indeed been run.   image10 Figure 7. User-defined function in use inside the Execute Python Script module.

Solution 3:

As I know, you can use other packages via a zip file which you provide to the third input. The comments in the Python template script in Azure ML say:

If a zip file is connected to the third input port is connected, it is unzipped under ".\Script Bundle". This directory is added to sys.path. Therefore, if your zip file contains a Python file mymodule.py you can import it using: import mymodule

So you can package azure-storage-python as a zip file thru click New, click Dataset, and then select From local file and the Zip file option to upload a ZIP file to your workspace.

As reference, you can see more information at the section How to Use Execute Python Script of the doc Execute Python Script.