Loading Data
The easiest way to load data into a CAS server is by using the upload method on the CAS connection object. This method uses a file path or URL that points to a file in various possible formats including CSV, Excel, and SAS data sets. You can also pass a Pandas DataFrame object to the upload method in order to upload the data from that DataFrame to a CAS table. We use the classic Iris data set in the following data loading example.
In [12]: out = conn.upload('https://raw.githubusercontent.com/' + ....: 'pydata/pandas/master/pandas/tests/' + ....: 'data/iris.csv')
In [13]: out
Out[13]:
[caslib]
'CASUSER(username)'
[tableName]
'IRIS'
[casTable]
CASTable('IRIS', caslib='CASUSER(username)')
+ Elapsed: 0.0629s, user: 0.037s, sys: 0.021s, mem: 48.4mb
The output from the upload method is, again, a CASResults object. The output contains the name of the created table, the CASLib that the table was created in, and a CASTable object that can be used to interact with the table on the server. CASTable objects have all of the same CAS action set and action methods of the connection that created it. They also include many of the methods that are defined by Pandas DataFrames so that you can operate on them as if they were local DataFrames. However, until you explicitly fetch the data or call a method that returns data from the table (such as head or tail), all operations are simply combined on the client side (essentially creating a client-side view) until data is actually retrieved from the server.
We can use actions such as tableinfo and columninfo to access general information about the table itself and its columns.
# Store CASTable object in its own variable.
In [14]: iris = out.casTable
# Call the tableinfo action on the CASTable object.
In [15]: iris.tableinfo()
Out[15]:
[TableInfo]
Name Rows Columns Encoding CreateTimeFormatted \
0 IRIS 150 5 utf-8 01Nov2016:16:38:59
ModTimeFormatted JavaCharSet CreateTime ModTime \
0 01Nov2016:16:38:59 UTF8 1.793638e+09 1.793638e+09
Global Repeated View SourceName SourceCaslib Compressed \
0 0 0 0 0
Creator Modifier
0 username
+ Elapsed: 0.000856s, mem: 0.104mb
# Call the columninfo action on the CASTable.
In [16]: iris.columninfo()
Out[16]:
[ColumnInfo]
Column ID Type RawLength FormattedLength NFL NFD
0 SepalLength 1 double 8 12 0 0
1 SepalWidth 2 double 8 12 0 0
2 PetalLength 3 double 8 12 0 0
3 PetalWidth 4 double 8 12 0 0
4 Name 5 varchar 15 15 0 0
+ Elapsed: 0.000727s, mem: 0.175mb
Now that we have some data, let’s run some more interesting CAS actions on it.
Executing Actions on CAS Tables
The simple action set that comes with CAS contains some basic analytic actions. You can use either the help action or the IPython ? operator to view the available actions.
In [17]: conn.simple?
Type: Simple
String form: <swat.cas.actions.Simple object at 0x4582b10>
File: swat/cas/actions.py
Definition: conn.simple(self, *args, **kwargs)
Docstring :
Analytics
Actions
-------
simple.correlation : Generates a matrix of Pearson product-moment
correlation coefficients
simple.crosstab : Performs one-way or two-way tabulations
simple.distinct : Computes the distinct number of values of the
variables in the variable list
simple.freq : Generates a frequency distribution for one or
more variables
simple.groupby : Builds BY groups in terms of the variable value
combinations given the variables in the variable
list
simple.mdsummary : Calculates multidimensional summaries of numeric
variables
simple.numrows : Shows the number of rows in a Cloud Analytic
Services table
simple.paracoord : Generates a parallel coordinates plot of the
variables in the variable list
simple.regression : Performs a linear regression up to 3rd-order
polynomials
simple.summary : Generates descriptive statistics of numeric
variables such as the sample mean, sample
variance, sample size, sum of squares, and so on
simple.topk : Returns the top-K and bottom-K distinct values of
each variable included in the variable list based
on a user-specified ranking order
Let’s run the summary action on our CAS table.
In [18]: summ = iris.summary()
In [19]: summ
Out[19]:
[Summary]
Descriptive Statistics for IRIS
Column Min Max N NMiss Mean Sum Std \
0 SepalLength 4.3 7.9 150.0 0.0 5.843333 876.5 0.828066
1 SepalWidth 2.0 4.4 150.0 0.0 3.054000 458.1 0.433594
2 PetalLength 1.0 6.9 150.0 0.0 3.758667 563.8 1.764420
3 PetalWidth 0.1 2.5 150.0 0.0 1.198667 179.8 0.763161
StdErr Var USS CSS CV TValue \
0 0.067611 0.685694 5223.85 102.168333 14.171126 86.425375
1 0.035403 0.188004 1427.05 28.012600 14.197587 86.264297
2 0.144064 3.113179 2583.00 463.863733 46.942721 26.090198
3 0.062312 0.582414