Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-oracle-e-business-suite-entering-and-reconciling-bank-statements

23 Aug 2011

4 min read

Oracle E-Business Suite: Entering and Reconciling Bank Statements

23 Aug 2011

Oracle E-Business Suite 12 Financials Cookbook Take the hard work out of your daily interactions with E-Business Suite financials by using the 50+ recipes from this cookbook. Entering bank statements Bank statements are downloaded from the bank to a local directory. Once the file is received, the bank account balance and statement information can be loaded into the bank statement open interface tables, using the bank statement loader program or a custom loader program. The files can also be loaded automatically using an interface program or using the XML Gateway. Bank statements can also be entered manually. In this recipe, we will look at how to enter bank statements. Getting ready The bank statement shown next has been loaded into the open interface table: Let's review the transactions in the open interface: Select the Cash Management responsibility. Navigate to Bank Statements | Bank Statement Interface Lines. Select 95-6891-3074 in the Account field. Click on the Lines button to view the transactions in the interface tables. How to do it... Let's list the steps required to automatically enter the bank statements from the import and AutoReconciliation program: Select the Cash Management responsibility. Navigate to Other | Programs | Run, or select View | Requests from the menu. Click on the Submit a New Request button. Select Single Request from the Options. Click on the OK button. In the Submit Request form, select Bank Statement Import & AutoReconciliation from the list of values. Please note that we could run the Bank Statement Import program, to run only the import. Select the Parameters field, and select Kings Cross as the Bank Branch Name, select 95-6891-3074 as the Bank Account Number, and select 20110314-0001 as the parameter for the Statement Number From and the Statement Number To fields. Accept the default values for the remaining fields. Click on the OK button. We can schedule the program to run periodically, for example, every day. Click on the Submit button to submit the request. Let's review the imported bank statements: Navigate to Bank Statement | Bank Statements and Reconciliation. The imported statement is displayed. Click on the Review button. (Move the mouse over the image to enlarge it.) In the Bank Statement window, select the Lines button. The imported lines are displayed. How it works... Bank statements can be imported automatically, using a SQL*Loader script against the bank file to populate the bank statement open interface. The bank statement information is then imported into the Bank Statement windows using the Bank Statement Import program. There's more... Now, let's look at how to enter statements manually. Entering bank statements manually Let's enter the bank statement for the 15th of March manually. The lines on the statement are as follows: Payment of 213.80. Receipt of 3,389.89 from A.C. Networks. Credit of 7,500.00 for Non Sufficient Funds for the receipt from Advantage Corp. Bank Transfer payment of 1,000.00. Select the Cash Management responsibility. Navigate to Bank Statement | Bank Statements and Reconciliation. (Move the mouse over the image to enlarge it.) In the Reconcile Bank Statements window, click on the New button. In the Account Number field, enter 95-6891-3074, the other details are automatically entered. In the Date field enter 15-MAR-2011. In the Statement Number field enter 20110314-0002. In the Control Totals region, let's enter control totals based on our bank statement. The Opening Balance of 125,727.21 is entered based on the previous opening balance. In the Receipts field, enter 3,389.89 and 1 in the Lines field. In the Payments field, enter 8,713.80 and 3 in the Lines field. The Closing Balance of 98,495.56 is entered automatically. Let's enter the bank statement lines: Click on the Lines button. (Move the mouse over the image to enlarge it.) In the Bank Statements Lines form, enter 1 in the Line field. Select Payment as the Type. Enter 100 as the code. In the Transaction Date field, enter 15-MAR-2011. In the Amount field, enter 213.80. Select the next line, and enter 2 in the Line field. Select Receipt as the Type. Enter 200 as the code. In the Transaction Date field, enter 15-MAR-2011. In the Amount field, enter 3,389.89. Select the Reference tab, and enter A.C. Networks. Select the next line, and enter 3 in the Line field. Select NSF as the Type. Enter 500 as the code. In the Transaction Date field, enter 15-MAR-2011. In the Amount field, enter 7,500.00. Select the Reference tab, and enter Advantage Corp. Select the next line, and enter 4 in the Line field. Select Payment as the Type. Enter 140 as the code. In the Transaction Date field, enter 15-MAR-2011. In the Amount field, enter 1,000.00. Save the record.

0
0
5218

article-image-format-publish-code-using-r-markdown

Savia Lobo

29 Dec 2017

6 min read

How to format and publish code using R Markdown

Savia Lobo

29 Dec 2017

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ahmed Sherif titled Practical Business Intelligence. This book is a complete guide for implementing Business intelligence with the help of powerful tools like D3.js, R, Tableau, Qlikview and Python that are available on the market. It starts off by preparing you for data analytics and then moves on to teach you a range of techniques to fetch important information from various databases.[/box] Today you will explore how to use R Markdown, which is a format that allows reproducible reports with embedded R code that can be published into slideshows, Word documents, PDF files, and HTML web pages. Getting started with R Markdown R Markdown documents have the .RMD extension and are created by selecting R Markdown from the menu bar of RStudio, as seen here: If this is the first time you are creating an R Markdown report, you may be prompted to install some additional packages for R Markdown to work, as seen in the following screenshot: Once the packages are installed, we can define a title, author, and default output format, as follows: For our purposes we will use HTML output. The default output of the R Markdown document will appear like this: R Markdown features and components We can go ahead and delete everything below line 7 in the previous screenshot as we will create our own template with our embedded code and formatting. Header levels can be generated using # in front of a title. The largest font size will have a single # and each subsequent # added will decrease the header level font. Whenever we wish to embed actual R code into the report, we can include it inside of a code chunk by clicking on the icon shown here: Once that icon is selected, a shaded region is created between two ``` characters where R code can be generated identical to that used in RStudio. The first header generated will be for the results, and then the subsequent header will indicate the libraries used to generate the report. This can be generated using the following script: # Results ###### Libraries used are RODBC, plotly, and forecast Executing R code inside of R Markdown The next step is to run the actual R code inside of the chunk snippet that calls the required libraries needed to generate the report. This can be generated using the following script: ```{r} # We will not see the actual libraries loaded # as it is not necessary for the end user library('RODBC') library('plotly') library('forecast') ``` We can then click on the Knit HTML icon on the menu bar to generate a preview of our code results in R Markdown. Unfortunately, this output of library information is not useful to the end user. Exporting tips for R Markdown The report output includes all the messages and potential warnings that are the result of calling a package. This is not information that is useful to the report consumer. Fortunately for R developers, these types of messages can be concealed by tweaking the R chunk snippets to include the following logic in their script: ```{r echo = FALSE, results = 'hide', message = FALSE} ``` We can continue embedding R code into our report to run queries against the SQL Server database and produce summary data of the dataframe as well as the three main plots for the time series plot, observed versus fitted smoothing, and Holt-Winters forecasting: ###### Connectivity to Data Source is through ODBC ```{r echo = FALSE, results = 'hide', message = FALSE} connection_SQLBI<-odbcConnect("SQLBI") #Get Connection Details connection_SQLBI ##query fetching begin## SQL_Query_1<-sqlQuery(connection_SQLBI, 'SELECT [WeekInYear] ,[DiscountCode] FROM [AdventureWorks2014].[dbo].[DiscountCodebyWeek]' ) ##query fetching end## #begin table manipulation colnames(SQL_Query_1)<- c("Week", "Discount") SQL_Query_1$Weeks <- as.numeric(SQL_Query_1$Week) SQL_Query_1<-SQL_Query_1[,-1] #removes first column SQL_Query_1<-SQL_Query_1[c(2,1)] #reverses columns 1 and 2 #end table manipulation ``` ### Preview of First 6 rows of data ```{r echo = FALSE, message= FALSE} head(SQL_Query_1) ``` ### Summary of Table Observations ```{r echo = FALSE, message= FALSE} str(SQL_Query_1) ``` ### Time Series and Forecast Plots ```{r echo = FALSE, message= FALSE} Query1_TS<-ts(SQL_Query_1$Discount) par(mfrow=c(3,1)) plot.ts(Query1_TS, xlab = 'Week (1-52)', ylab = 'Discount', main = 'Time Series of Discount Code by Week') discountforecasts <- HoltWinters(Query1_TS, beta=FALSE, gamma=FALSE) plot(discountforecasts) discountforecasts_8periods <- forecast.HoltWinters(discountforecasts, h=8) plot.forecast(discountforecasts_8periods, ylab='Discount', xlab = 'Weeks (1-60)', main = 'Forecasting 8 periods') ``` The final output Before publishing the output with the results, R Markdown offers the developer opportunities to prettify the end product. One effect I like to add to a report is a logo of some kind. This can be done by applying the following code to any line in R Markdown: ![](http://website.com/logo.jpg) # image is on a website ![](images/logo.jpg) # image is locally on your machine The first option adds an image from a website, and the second option adds an image locally. For my purposes, I will add a PacktPub logo right above the Results section in the R Markdown, as seen in the following screenshot: To learn more about customizing an R Markdown document, visit the following website: http://rmarkdown.rstudio.com/authoring_basics.html. Once we are ready to preview the results of the R Markdown output, we can once again select the Knit to HTML button on the menu. The new report can be seen in this screenshot: As can be seen in the final output, even if the R code is embedded within the R Markdown document, we can suppress the unnecessary technical output and reveal the relevant tables, fields, and charts that will provide the most benefit to end users and report consumers. If you have enjoyed reading this article and want to develop the ability to think along the right lines and use more than one tool to perform analysis depending on the needs of your business, do check out Practical Business Intelligence.

0
0
5215

Packt

28 Oct 2015

28 min read

An Introduction to Kibana

Packt

28 Oct 2015

28 min read

0
0
5214

article-image-creating-view-mysql-query-browser

Packt

23 Oct 2009

2 min read

Creating a View with MySQL Query Browser

Packt

23 Oct 2009

2 min read

Please refer to an earlier article by the author to learn how to build queries visually. Creating a View from an Existing Query To create a view from a query, you must have executed the query successfully. To be more precise, the view is created from the latest successfully executed query, not necessarily from the query currently in the Query Area. To further clarify, the following three examples are cases where the view is not created from the current query: Your current query fails, and immediately after you create a view from the query. The view created is not from the failed query. If the failed query is the first query in your MySQL Query Browser session, you can’t create any view. You have just moved forward or backward the query in the Query Area without executing it, and then your current query is not the latest successfully executed. You open a saved query that you have never executed successfully in your active Resultset. Additionally, if you’re changing your Resultset, the view created is from the latest successfully executed query that uses the currently active Resultset to display its output. To make sure your view is from the query you want, select the query, confirm it as written in the Query Area, execute the query, and then, immediately create its view. You create a view from an existing query by selecting Query | Create View from Select from the Menu bar. Type in the name you want to give to the view, and then click Create View. MySQL Query Browser creates the view. When successfully created, you can see the view in the Schemata. You can modify a view by editing it: Right-click the view and select Edit View. You can edit the CREATE view statement by right-clicking it and select Edit View. The CREATE view statement opens in its Script tab. When you finish editing, you can execute the modified view. If successful, the existing view is replaced with the modified one. To replace the view you’re editing with the modified view, change the name of the view before you execute it. If you want to keep the view you’re editing, remove the DROP VIEW statement.

0
0
5211

article-image-including-charts-and-graphics-pentaho-reports-part-1

Packt

09 Nov 2009

8 min read

Including Charts and Graphics in Pentaho Reports (Part 1)

Packt

09 Nov 2009

8 min read

0
0
5208

article-image-getting-started-with-pentaho-data-integration-and-pentaho-bi-suite

Vijin Boricha

24 Feb 2018

9 min read

Getting Started with Pentaho Data Integration and Pentaho BI Suite

Vijin Boricha

24 Feb 2018

9 min read

0
0
5206

article-image-excel-2010-financials-identifying-profitability-investment

Packt

12 Jul 2011

5 min read

Excel 2010 Financials: Identifying the Profitability of an Investment

Packt

12 Jul 2011

5 min read

Excel 2010 Financials Cookbook Powerful techniques for financial organization, analysis, and presentation in Microsoft Excel Read more about this book (For more resources on this subject, see here.) Calculating the depreciation of assets Assets within a business are extremely important for a number of reasons. Assets can become investments for growth or investments in another line of business. Assets can also take on many forms such as computer equipment, vehicles, furniture, buildings, land, and so on. Assets are not only important within the business for which they are used, but they are also used as a method of reducing the tax burden on a business. As a financial manager, you are tasked with calculating the depreciation expense for a laptop computer with a useful life of five years. In this recipe, you will learn to calculate the depreciation of an asset over the life of the asset. Getting ready There are several different methods of depreciation. A business may use straight-line depreciation, declining depreciation, double-declining depreciation, or a variation of these methods. Excel has the functionality to calculate each of the methods with a slight variation to the function; however, in this recipe, we will use a straight-line depreciation. Straight-line depreciation provides equal reduction of an asset over its life. Getting ready We will first need to set up the Excel worksheet to hold the depreciation values for the laptop computer: In cell A5 list Year and in cell B5 list 1. This will account for the depreciation for the year that the asset was purchased. Continue this list until all five years are listed: In cell B2, list the purchase price of the Laptop computer; the purchase price is $2500: In cell B4, we will need to enter the salvage value of the asset. The salvage value will be the estimated resale value of the asset when it is useful life, as determined by generally accepted accounting principles, has elapsed. Enter $500 in cell B3: In cell B5 enter the formula =SLN($B$2,$B$3,5) and press Enter: Copy the formula from cell C5 and paste it through cell C9: Excel now has listed the straight-line depreciation expense for each of the five years. As you can see in this schedule, the depreciation expense remains consistent through each year of the asset's useful life. How it works... Straight-line depreciation calculates the value of the purchase price minus the salvage price, and divides the remainder across the useful life. The function accepts inputs as follows =SLN(purchase price, salvage price, useful life). There's more... Other depreciation methods are as follows: Calculating the future versus current value of your money When working within finance, accounting, or general business it is important to know how much money you have. However, knowing how much money you have now is only a portion of the whole financial picture. You must also know how much your money will be worth in the future. Knowing future value allows you to know truly how much your money is worth, and with this knowledge, you can decide what you need to do with it. As a financial manager, you must provide feedback on whether to introduce a new product line. As with any new venture, there will be several related costs including start-up costs, operational costs, and more. Initially, you must spend $20,000 to account for most start-up costs and you will potentially, for the sake of the example, earn a profit of $5500 for five years. You also know due to expenditures, you expect your cost of capital to be 10%. In this recipe, you will learn to use Excel functions to calculate the future value of the venture and whether this proves to be profitable. How to do it... We will first need to enter all known values and variables into the worksheet: In cell B2, enter the initial cost of the business venture: In cell B3, enter the discount rate, or the cost of capital of 10%: In cells B4 through B8, enter the five years' worth of expected net profit from the business venture: In cell B10, we will calculate the net present value: Enter the formula =NPV(B3,B4:B8) and press Enter We now see that accounting for future inflows, the net present value of the business venture is $20,849.33. Our last step is to account for the initial start-up costs and determine the overall profitability. In cell B11, enter the formula =B10/B2 and press Enter: As a financial manager, we now see that for every $1 invested in this venture, you will receive $1.04 in present value inflows. How it works... NPV or net present value is calculated in Excel using all of the inflow information that was entered across the estimated period. For the five years used in this recipe, the venture shows a profit of $5500 for each year. This number cannot be used directly, because there is a cost of making money. The cost in this instance pertains to taxes and other expenditures. In the NPV formula, we did not include the initial cost of start-up because this cost is exempt from the cost of capital; however, it must be used at the end of the formula to account for the outflow compared to the inflows. There's more... The $1.04 value calculated at the end of this recipe is also known as the profitability index. When this index is greater than one, the venture is said to be a positive investment.

0
0
5172

article-image-article-ibm-cognos-10-bi-business-insight-dashboard

Packt

16 Jul 2012

7 min read

IBM Cognos 10 BI dashboarding components

Packt

16 Jul 2012

7 min read

0
0
5159

article-image-some-basic-concepts-theano

Packt

21 Feb 2018

13 min read

Some Basic Concepts of Theano

Packt

21 Feb 2018

13 min read

In this article by Christopher Bourez, the author of the book Deep Learning with Theano, presents Theano as a compute engine, and the basics for symbolic computing with Theano. Symbolic computing consists in building graphs of operations that will be optimized later on for a specific architecture, using the computation libraries available for this architecture. (For more resources related to this topic, see here.) Although this article might sound far from practical application. Theano may be defined as a library for scientific computing; it has been available since 2007 and is particularly suited for deep learning. Two important features are at the core of any deep learning library: tensor operations, and the capability to run the code on CPU or GPU indifferently. These two features enable us to work with massive amount of multi-dimensional data. Moreover, Theano proposes automatic differentiation, a very useful feature to solve a wider range of numeric optimizations than deep learning problems. The content of the article covers the following points: Theano install and loading Tensors and algebra Symbolic programming Need for tensor Usually, input data is represented with multi-dimensional arrays: Images have three dimensions: The number of channels, the width and height of the image Sounds and times series have one dimension: The time length Natural language sequences can be represented by two dimensional arrays: The time length and the alphabet length or the vocabulary length In Theano, multi-dimensional arrays are implemented with an abstraction class, named tensor, with many more transformations available than traditional arrays in a computer language like Python. At each stage of a neural net, computations such as matrix multiplications involve multiple operations on these multi-dimensional arrays. Classical arrays in programming languages do not have enough built-in functionalities to address well and fastly multi-dimensional computations and manipulations. Computations on multi-dimensional arrays have known a long history of optimizations, with tons of libraries and hardwares. One of the most important gains in speed has been permitted by the massive parallel architecture of the Graphical Computation Unit (GPU), with computation ability on a large number of cores, from a few hundreds to a few thousands. Compared to the traditional CPU, for example a quadricore, 12-core or 32-core engine, the gain with GPU can range from a 5x to a 100x times speedup, even if part of the code is still being executed on the CPU (data loading, GPU piloting, result outputing). The main bottleneck with the use of GPU is usually the transfer of data between the memory of the CPU and the memory of the GPU, but still, when well programmed, the use of GPU helps bring a significant increase in speed of an order of magnitude. Getting results in days rather than months, or hours rather than days, is an undeniable benefit for experimentation. Theano engine has been designed to address these two challenges of multi-dimensional array and architecture abstraction from the beginning. There is another undeniable benefit of Theano for scientific computation: the automatic differentiation of functions of multi-dimensional arrays, a well-suited feature for model parameter inference via objective function minimization. Such a feature facilitates the experimentation by releasing the pain to compute derivatives, which might not be so complicated, but prone to many errors. Installing and loading Theano Conda package and environment manager The easiest way to install Theano is to use conda, a cross-platform package and environment manager. If conda is not already installed on your operating system, the fastest way to install conda is to download the miniconda installer from https://conda.io/miniconda.html. For example, for conda under Linux 64 bit and Python 2.7: wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh chmod +x Miniconda2-latest-Linux-x86_64.sh bash ./Miniconda2-latest-Linux-x86_64.sh Conda enables to create new environments in which versions of Python (2 or 3) and the installed packages may differ. The conda root environment uses the same version of Python as the version installed on your system on which you installed conda. Install and run Theano on CPU Last, let’s install Theano: conda install theano Run a python session and try the following commands to check your configuration: >>> from theano import theano >>> theano.config.device 'cpu' >>> theano.config.floatX 'float64' >>> print(theano.config) The last command prints all the configuration of Theano. The theano.config object contains keys to many configuration options. To infer the configuration options, Theano looks first at ~/.theanorc file, then at any environment variables available, which override the former options, last at the variable set in the code, that are first in the order of precedence: >>> theano.config.floatX='float32' Some of the properties might be read-only and cannot be changed in the code, but floatX property, that sets the default floating point precision for floats, is among properties that can be changed directly in the code. It is advised to use float32 since GPU have a long history without float64, float64 execution speed on GPU is slower, sometimes much slower (2x to 32x on latest generation Pascal hardware), and that float32 precision is enough in practice. GPU drivers and libraries Theano enables the use of GPU (graphic computation units), the units usually used to compute the graphics to display on the computer screen. To have Theano work on the GPU as well, a GPU backend library is required on your system. CUDA library (for NVIDIA GPU cards only) is the main choice for GPU computations. There exists also the OpenCL standard, which is opensource, but far less developed, and much more experimental and rudimentary on Theano. Most of the scientific computations still occur on NVIDIA cards today. If you have a NVIDIA GPU card, download CUDA from the NVIDIA website at https://developer.nvidia.com/cuda-downloads and install it. The installer will install the lastest version of the gpu drivers first if they are not already installed. It will install the CUDA library in /usr/local/cuda directory. Install the cuDNN library, a library by NVIDIA also, that offers faster implementations of some operations for the GPU To install it, I usually copy /usr/local/cuda directory to a new directory /usr/local/cuda-{CUDA_VERSION}-cudnn-{CUDNN_VERSION} so that I can choose the version of CUDA and cuDNN, depending on the deep learning technology I use, and its compatibility. In your .bashrc profile, add the following line to set $PATH and $LD_LIBRARY_PATH variables: export PATH=/usr/local/cuda-8.0-cudnn-5.1/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-8.0-cudnn-5.1/lib64::/usr/local/cuda-8.0-cudnn-5.1/lib:$LD_LIBRARY_PATH Install and run Theano on GPU N-dimensional GPU arrays have been implemented in Python under 6 different GPU library (Theano/CudaNdarray,PyCUDA/ GPUArray,CUDAMAT/ CUDAMatrix, PYOPENCL/GPUArray, Clyther, Copperhead), are a subset of NumPy.ndarray. Libgpuarray is a backend library to have them in a common interface with the same property. To install libgpuarray with conda: conda install pygpu To run Theano in GPU mode, you need to configure the config.device variable before execution since it is a read-only variable once the code is run. With the environment variable THEANO_FLAGS: THEANO_FLAGS="device=cuda,floatX=float32" python >>> import theano Using cuDNN version 5110 on context None Mapped name None to device cuda: Tesla K80 (0000:83:00.0) >>> theano.config.device 'gpu' >>> theano.config.floatX 'float32' The first return shows that GPU device has been correctly detected, and specifies which GPU it uses. By default, Theano activates CNMeM, a faster CUDA memory allocator, an initial preallocation can be specified with gpuarra.preallocate option. At the end, my launch command will be: THEANO_FLAGS="device=cuda,floatX=float32,gpuarray.preallocate=0.8" python >>> from theano import theano Using cuDNN version 5110 on context None Preallocating 9151/11439 Mb (0.800000) on cuda Mapped name None to device cuda: Tesla K80 (0000:83:00.0) The first line confirms that cuDNN is active, the second confirms memory preallocation. The third line gives the default context name (that is None when the flag device=cuda is set) and the model of the GPU used, while the default context name for the CPU will always be cpu. It is possible to specify a different GPU than the first one, setting the device to cuda0, cuda1,... for multi-GPU computers. It is also possible to run a program on multiple GPU in parallel or in sequence (when the memory of one GPU is not sufficient), in particular when training very deep neural nets. In this case, the context flag contexts=dev0->cuda0;dev1->cuda1;dev2->cuda2;dev3->cuda3 activates multiple GPU instead of one, and designate the context name to each GPU device to be used in the code. For example, on a 4-GPU instance: THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1;dev2->cuda2;dev3->cuda3,floatX=float32,gpuarray.preallocate=0.8" python >>> import theano Using cuDNN version 5110 on context None Preallocating 9177/11471 Mb (0.800000) on cuda0 Mapped name dev0 to device cuda0: Tesla K80 (0000:83:00.0) Using cuDNN version 5110 on context dev1 Preallocating 9177/11471 Mb (0.800000) on cuda1 Mapped name dev1 to device cuda1: Tesla K80 (0000:84:00.0) Using cuDNN version 5110 on context dev2 Preallocating 9177/11471 Mb (0.800000) on cuda2 Mapped name dev2 to device cuda2: Tesla K80 (0000:87:00.0) Using cuDNN version 5110 on context dev3 Preallocating 9177/11471 Mb (0.800000) on cuda3 Mapped name dev3 to device cuda3: Tesla K80 (0000:88:00.0) To assign computations to a specific GPU in this multi-GPU setting, the names we choose dev0, dev1, dev2, and dev3 have been mapped to each device (cuda0, cuda1, cuda2, cuda3). This name mapping enables to write codes that are independent of the underlying GPU assignments and libraries (CUDA or other). To keep the current configuration flags active at every Python session or execution without using environment variables, save your configuration in the ~/.theanorc file as: [global] floatX = float32 device = cuda0 [gpuarray] preallocate = 1 Now, you can simply run python command. You are now all set. Tensors In Python, some scientific libraries such as NumPy provide multi-dimensional arrays. Theano doesn't replace Numpy but works in concert with it. In particular, NumPy is used for the initialization of tensors. To perform the computation on CPU and GPU indifferently, variables are symbolic and represented by the tensor class, an abstraction, and writing numerical expressions consists in building a computation graph of Variable nodes and Apply nodes. Depending on the platform on which the computation graph will be compiled, tensors are replaced either: By a TensorType variable, which data has to be on CPU By a GpuArrayType variable, which data has to be on GPU That way, the code can be written indifferently of the platform where it will be executed. Here are a few tensor objects: Object class Number of dimensions Example theano.tensor.scalar 0-dimensional array 1, 2.5 theano.tensor.vector 1-dimensional array [0,3,20] theano.tensor.matrix 2-dimensional array [[2,3][1,5]] theano.tensor.tensor3 3-dimensional array [[[2,3][1,5]],[[1,2],[3,4]]] Playing with these Theano objects in the Python shell gives a better idea: >>> import theano.tensor as T >>> T.scalar() <TensorType(float32, scalar)> >>> T.iscalar() <TensorType(int32, scalar)> >>> T.fscalar() <TensorType(float32, scalar)> >>> T.dscalar() <TensorType(float64, scalar)> With a i, l, f, d letter in front of the object name, you initiate a tensor of a given type, integer32, integer64, floats32 or float64. For real-valued (floating point) data, it is advised to use the direct form T.scalar() instead of the f or d variants since the direct form will use your current configuration for floats: >>> theano.config.floatX = 'float64' >>> T.scalar() <TensorType(float64, scalar)> >>> T.fscalar() <TensorType(float32, scalar)> >>> theano.config.floatX = 'float32' >>> T.scalar() <TensorType(float32, scalar)> Symbolic variables either: Play the role of placeholders, as a starting point to build your graph of numerical operations (such as addition, multiplication): they receive the flow of the incoming data during the evaluation, once the graph has been compiled Represent intermediate or output results Symbolic variables and operations are both part of a computation graph that will be compiled either towards CPU or GPU for fast execution. Let's write a first computation graph consisting in a simple addition: >>> x = T.matrix('x') >>> y = T.matrix('y') >>> z = x + y >>> theano.pp(z) '(x + y)' >>> z.eval({x: [[1, 2], [1, 3]], y: [[1, 0], [3, 4]]}) array([[ 2., 2.], [ 4., 7.]], dtype=float32) At first place, two symbolic variables, or Variable nodes are created, with names x and y, and an addition operation, an Apply node, is applied between both of them, to create a new symbolic variable, z, in the computation graph. The pretty print function pp prints the expression represented by Theano symbolic variables. Eval evaluates the value of the output variable z, when the first two variables x and y are initialized with two numerical 2-dimensional arrays. The following example explicit the difference between the variables x and y, and their names x and y: >>> a = T.matrix() >>> b = T.matrix() >>> theano.pp(a + b) '(<TensorType(float32, matrix)> + <TensorType(float32, matrix)>)' Without names, it is more complicated to trace the nodes in a large graph. When printing the computation graph, names significantly helps diagnose problems, while variables are only used to handle the objects in the graph: >>> x = T.matrix('x') >>> x = x + x >>> theano.pp(x) '(x + x)' Here the original symbolic variable, named x, does not change and stays part of the computation graph. x + x creates a new symbolic variable we assign to the Python variable x. Note also, that with names, the plural form initializes multiple tensors at the same time: >>> x, y, z = T.matrices('x', 'y', 'z') Now, let's have a look at the different functions to display the graph. Summary Thus, this article helps us to give a brief idea on how to download and install Theano on various platforms along with the packages such as NumPy and SciPy. Resources for Article: Further resources on this subject: Introduction to Deep Learning [article] Getting Started with Deep Learning [article] Practical Applications of Deep Learning [article]

0
0
5151

article-image-python-graphics-combining-raster-and-vector-pictures

Packt

23 Nov 2010

12 min read

Python Graphics: Combining Raster and Vector Pictures

Packt

23 Nov 2010

12 min read

Python 2.6 Graphics Cookbook Over 100 great recipes for creating and animating graphics using Python Create captivating graphics with ease and bring them to life using Python Apply effects to your graphics using powerful Python methods Develop vector as well as raster graphics and combine them to create wonders in the animation world Create interactive GUIs to make your creation of graphics simpler Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to accomplish the task of creation and animation of graphics as efficiently as possible Because we are not altering and manipulating the actual properties of the images we do not need the Python Imaging Library (PIL) in this chapter. We need to work exclusively with GIF format images because that is what Tkinter deals with. We will also see how to use "The GIMP" as a tool to prepare images suitable for animation. Simple animation of a GIF beach ball We want to animate a raster image, derived from a photograph. To keep things simple and clear we are just going to move a photographic image (in GIF format) of a beach ball across a black background. Getting ready We need a suitable GIF image of an object that we want to animate. An example of one, named beachball.gif has been provided. How to do it... Copy a .gif fle from somewhere and paste it into a directory where you want to keep your work-in-progress pictures. Ensure that the path in our computer's fle system leads to the image to be used. In the example below, the instruction, ball = PhotoImage(file = "constr/pics2/beachball.gif") says that the image to be used will be found in a directory (folder) called pics2, which is a sub-folder of another folder called constr. Then execute the following code. # photoimage_animation_1.py #>>>>>>>>>>>>>>>>>>>>>>>> from Tkinter import * root = Tk() cycle_period = 100 cw = 300 # canvas width ch = 200 # canvas height canvas_1 = Canvas(root, width=cw, height=ch, bg="black") canvas_1.grid(row=0, column=1) posn_x = 10 posn_y = 10 shift_x = 2 shift_y = 1 ball = PhotoImage(file = "/constr/pics2/beachball.gif") for i in range(1,100): # end the program after 100 position shifts. posn_x += shift_x posn_y += shift_y canvas_1.create_image(posn_x,posn_y,anchor=NW, image=ball) canvas_1.update() # This refreshes the drawing on the canvas. canvas_1.after(cycle_period) # This makes execution pause for 100 milliseconds. canvas_1.delete(ALL) # This erases everything on the canvas. root.mainloop() How it Works The image of the beach ball is shifted across a canvas. The photo type images always occupy a rectangular area of screen. The size of this box, called the bounding box, is the size of the image. We have used a black background so the black corners on the image of our beach ball cannot be seen. The vector walking creature We make a pair of walking legs using the vector graphics. We want to use these legs together with pieces of raster images and see how far we can go in making appealing animations. We import Tkinter, math, and time modules. The math is needed to provide the trigonometry that sustains the geometric relations that move the parts of the leg in relation to each other. Getting ready We will be using Tkinter and time modules to animate the movement of lines and circles. You will see some trigonometry in the code. If you do not like mathematics you can just cut and paste the code without the need to understand exactly how the maths works. However, if you are a friend of mathematics it is fun to watch sine, cosine, and tangent working together to make a child smile. How to do it... Execute the program as shown in the previous image. # walking_creature_1.py # >>>>>>>>>>>>>>>> from Tkinter import * import math import time root = Tk() root.title("The thing that Strides") cw = 400 # canvas width ch = 100 # canvas height #GRAVITY = 4 chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 100 # time between new positions of the ball (milliseconds). base_x = 20 base_y = 100 hip_h = 40 thy = 20 #=============================================== # Hip positions: Nhip = 2 x Nstep, the number of steps per foot per stride. hip_x = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 60, 60] #15 hip_y = [0, 8, 12, 16, 12, 8, 0, 0, 0, 8, 12, 16, 12, 8, 0] #15 step_x = [0, 10, 20, 30, 40, 50, 60, 60] # 8 = Nhip step_y = [0, 35, 45, 50, 43, 32, 10, 0] # The merging of the separate x and y lists into a single sequence. #================================== # Given a line joining two points xy0 and xy1, the base of an isosceles triangle, # as well as the length of one side, "thy" . This returns the coordinates of # the apex joining the equal-length sides. def kneePosition(x0, y0, x1, y1, thy): theta_1 = math.atan2((y1 - y0), (x1 - x0)) L1 = math.sqrt( (y1 - y0)**2 + (x1 - x0)**2) if L1/2 < thy: # The sign of alpha determines which way the knees bend. alpha = -math.acos(L1/(2*thy)) # Avian #alpha = math.acos(L1/(2*thy)) # Mammalian else: alpha = 0.0 theta_2 = alpha + theta_1 x_knee = x0 + thy * math.cos(theta_2) y_knee = y0 + thy * math.sin(theta_2) return x_knee, y_knee def animdelay(): chart_1.update() # This refreshes the drawing on the canvas. chart_1.after(cycle_period) # This makes execution pause for 200 milliseconds. chart_1.delete(ALL) # This erases *almost* everything on the canvas. # Does not delete the text from inside a function. bx_stay = base_x by_stay = base_y for j in range(0,11): # Number of steps to be taken - arbitrary. astep_x = 60*j bstep_x = astep_x + 30 cstep_x = 60*j + 15 aa = len(step_x) -1 for k in range(0,len(hip_x)-1): # Motion of the hips in a stride of each foot. cx0 = base_x + cstep_x + hip_x[k] cy0 = base_y - hip_h - hip_y[k] cx1 = base_x + cstep_x + hip_x[k+1] cy1 = base_y - hip_h - hip_y[k+1] chart_1.create_line(cx0, cy0 ,cx1 ,cy1) chart_1.create_oval(cx1-10 ,cy1-10 ,cx1+10 ,cy1+10, fill="orange") if k >= 0 and k <= len(step_x)-2: # Trajectory of the right foot. ax0 = base_x + astep_x + step_x[k] ax1 = base_x + astep_x + step_x[k+1] ay0 = base_y - step_y[k] ay1 = base_y - step_y[k+1] ax_stay = ax1 ay_stay = ay1 if k >= len(step_x)-1 and k <= 2*len(step_x)-2: # Trajectory of the left foot. bx0 = base_x + bstep_x + step_x[k-aa] bx1 = base_x + bstep_x + step_x[k-aa+1] by0 = base_y - step_y[k-aa] by1 = base_y - step_y[k-aa+1] bx_stay = bx1 by_stay = by1 aknee_xy = kneePosition(ax_stay, ay_stay, cx1, cy1, thy) chart_1.create_line(ax_stay, ay_stay ,aknee_xy[0], aknee_xy[1], width = 3, fill="orange") chart_1.create_line(cx1, cy1 ,aknee_xy[0], aknee_xy[1], width = 3, fill="orange") chart_1.create_oval(ax_stay-5 ,ay1-5 ,ax1+5 ,ay1+5, fill="green") chart_1.create_oval(bx_stay-5 ,by_stay-5 ,bx_stay+5 ,by_stay+5, fill="blue") bknee_xy = kneePosition(bx_stay, by_stay, cx1, cy1, thy) chart_1.create_line(bx_stay, by_stay ,bknee_xy[0], bknee_xy[1], width = 3, fill="pink") chart_1.create_line(cx1, cy1 ,bknee_xy[0], bknee_xy[1], width = 3, fill="pink") animdelay() root.mainloop() How it works... Without getting bogged down in detail, the strategy in the program consists of defning the motion of a foot while walking one stride. This motion is defned by eight relative positions given by the two lists step_x (horizontal) and step_y (vertical). The motion of the hips is given by a separate pair of x- and y-positions hip_x and hip_y. Trigonometry is used to work out the position of the knee on the assumption that the thigh and lower leg are the same length. The calculation is based on the sine rule taught in high school. Yes, we do learn useful things at school! The time-animation regulation instructions are assembled together as a function animdelay(). There's more In Python math module, two arc-tangent functions are available for calculating angles given the lengths of two adjacent sides. atan2(y,x) is the best because it takes care of the crazy things a tangent does on its way around a circle - tangent ficks from minus infnity to plus infnity as it passes through 90 degrees and any multiples thereof. A mathematical knee is quite happy to bend forward or backward in satisfying its equations. We make the sign of the angle negative for a backward-bending bird knee and positive for a forward bending mammalian knee. More Info Section 1 This animated walking hips-and-legs is used in the recipes that follow this to make a bird walk in the desert, a diplomat in palace grounds, and a spider in a forest. Bird with shoes walking in the Karroo We now coordinate the movement of four GIF images and the striding legs to make an Apteryx (a fightless bird like the kiwi) that walks. Getting ready We need the following GIF images: A background picture of a suitable landscape A bird body without legs A pair of garish-colored shoes to make the viewer smile The walking avian legs of the previous recipe The images used are karroo.gif, apteryx1.gif, and shoe1.gif. Note that the images of the bird and the shoe have transparent backgrounds which means there is no rectangular background to be seen surrounding the bird or the shoe. In the recipe following this one, we will see the simplest way to achieve the necessary transparency. How to do it... Execute the program shown in the usual way. # walking_birdy_1.py # >>>>>>>>>>>>>>>> from Tkinter import * import math import time root = Tk() root.title("A Walking birdy gif and shoes images") cw = 800 # canvas width ch = 200 # canvas height #GRAVITY = 4 chart_1 = Canvas(root, width=cw, height=ch, background="white") chart_1.grid(row=0, column=0) cycle_period = 80 # time between new positions of the ball (milliseconds). im_backdrop = "/constr/pics1/karoo.gif" im_bird = "/constr/pics1/apteryx1.gif" im_shoe = "/constr/pics1/shoe1.gif" birdy =PhotoImage(file= im_bird) shoey =PhotoImage(file= im_shoe) backdrop = PhotoImage(file= im_backdrop) chart_1.create_image(0 ,0 ,anchor=NW, image=backdrop) base_x = 20 base_y = 190 hip_h = 70 thy = 60 #========================================== # Hip positions: Nhip = 2 x Nstep, the number of steps per foot per stride. hip_x = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 60, 60] #15 hip_y = [0, 8, 12, 16, 12, 8, 0, 0, 0, 8, 12, 16, 12, 8, 0] #15 step_x = [0, 10, 20, 30, 40, 50, 60, 60] # 8 = Nhip step_y = [0, 35, 45, 50, 43, 32, 10, 0] #============================================= # Given a line joining two points xy0 and xy1, the base of an isosceles triangle, # as well as the length of one side, "thy" this returns the coordinates of # the apex joining the equal-length sides. def kneePosition(x0, y0, x1, y1, thy): theta_1 = math.atan2(-(y1 - y0), (x1 - x0)) L1 = math.sqrt( (y1 - y0)**2 + (x1 - x0)**2) alpha = math.atan2(hip_h,L1) theta_2 = -(theta_1 - alpha) x_knee = x0 + thy * math.cos(theta_2) y_knee = y0 + thy * math.sin(theta_2) return x_knee, y_knee def animdelay(): chart_1.update() # Refresh the drawing on the canvas. chart_1.after(cycle_period) # Pause execution pause for X millise-conds. chart_1.delete("walking") # Erases everything on the canvas. bx_stay = base_x by_stay = base_y for j in range(0,13): # Number of steps to be taken - arbitrary. astep_x = 60*j bstep_x = astep_x + 30 cstep_x = 60*j + 15 aa = len(step_x) -1 for k in range(0,len(hip_x)-1): # Motion of the hips in a stride of each foot. cx0 = base_x + cstep_x + hip_x[k] cy0 = base_y - hip_h - hip_y[k] cx1 = base_x + cstep_x + hip_x[k+1] cy1 = base_y - hip_h - hip_y[k+1] #chart_1.create_image(cx1-55 ,cy1+20 ,anchor=SW, image=birdy, tag="walking") if k >= 0 and k <= len(step_x)-2: # Trajectory of the right foot. ax0 = base_x + astep_x + step_x[k] ax1 = base_x + astep_x + step_x[k+1] ay0 = base_y - 10 - step_y[k] ay1 = base_y - 10 -step_y[k+1] ax_stay = ax1 ay_stay = ay1 if k >= len(step_x)-1 and k <= 2*len(step_x)-2: # Trajectory of the left foot. bx0 = base_x + bstep_x + step_x[k-aa] bx1 = base_x + bstep_x + step_x[k-aa+1] by0 = base_y - 10 - step_y[k-aa] by1 = base_y - 10 - step_y[k-aa+1] bx_stay = bx1 by_stay = by1 chart_1.create_image(ax_stay-5 ,ay_stay + 10 ,anchor=SW, im-age=shoey, tag="walking") chart_1.create_image(bx_stay-5 ,by_stay + 10 ,anchor=SW, im-age=shoey, tag="walking") aknee_xy = kneePosition(ax_stay, ay_stay, cx1, cy1, thy) chart_1.create_line(ax_stay, ay_stay-15 ,aknee_xy[0], aknee_xy[1], width = 5, fill="orange", tag="walking") chart_1.create_line(cx1, cy1 ,aknee_xy[0], aknee_xy[1], width = 5, fill="orange", tag="walking") bknee_xy = kneePosition(bx_stay, by_stay, cx1, cy1, thy) chart_1.create_line(bx_stay, by_stay-15 ,bknee_xy[0], bknee_xy[1], width = 5, fill="pink", tag="walking") chart_1.create_line(cx1, cy1 ,bknee_xy[0], bknee_xy[1], width = 5, fill="pink", tag="walking") chart_1.create_image(cx1-55 ,cy1+20 ,anchor=SW, image=birdy, tag="walking") animdelay() root.mainloop() How it works... The same remarks concerning the trigonometry made in the previous recipe apply here. What we see here now is the ease with which vector objects and raster images can be combined once suitable GIF images have been prepared. There's more... For teachers and their students who want to make lessons on a computer, these techniques offer all kinds of possibilities like history tours and re-enactments, geography tours, and, science experiments. Get the students to do projects telling stories. Animated year books?

0
0
5142

article-image-working-large-data-sources

Packt

08 Jul 2015

20 min read

Working with large data sources

Packt

08 Jul 2015

20 min read

In this article, by Duncan M. McGreggor, author of the book Mastering matplotlib, we come across the use of NumPy in the world of matplotlib and big data, problems with large data sources, and the possible solutions to these problems. (For more resources related to this topic, see here.) Most of the data that users feed into matplotlib when generating plots is from NumPy. NumPy is one of the fastest ways of processing numerical and array-based data in Python (if not the fastest), so this makes sense. However by default, NumPy works on in-memory database. If the dataset that you want to plot is larger than the total RAM available on your system, performance is going to plummet. In the following section, we're going to take a look at an example that illustrates this limitation. But first, let's get our notebook set up, as follows: In [1]: import matplotlib matplotlib.use('nbagg') %matplotlib inline Here are the modules that we are going to use: In [2]: import glob, io, math, os import psutil import numpy as np import pandas as pd import tables as tb from scipy import interpolate from scipy.stats import burr, norm import matplotlib as mpl import matplotlib.pyplot as plt from IPython.display import Image We'll use the custom style sheet that we created earlier, as follows: In [3]: plt.style.use("../styles/superheroine-2.mplstyle") An example problem To keep things manageable for an in-memory example, we're going to limit our generated dataset to 100 million points by using one of SciPy's many statistical distributions, as follows: In [4]: (c, d) = (10.8, 4.2) (mean, var, skew, kurt) = burr.stats(c, d, moments='mvsk') The Burr distribution, also known as the Singh–Maddala distribution, is commonly used to model household income. Next, we'll use the burr object's method to generate a random population with our desired count, as follows: In [5]: r = burr.rvs(c, d, size=100000000) Creating 100 million data points in the last call took about 10 seconds on a moderately recent workstation, with the RAM usage peaking at about 2.25 GB (before the garbage collection kicked in). Let's make sure that it's the size we expect, as follows: In [6]: len(r) Out[6]: 100000000 If we save this to a file, it weighs in at about three-fourths of a gigabyte: In [7]: r.tofile("../data/points.bin") In [8]: ls -alh ../data/points.bin -rw-r--r-- 1 oubiwann staff 763M Mar 20 11:35 points.bin This actually does fit in the memory on a machine with a RAM of 8 GB, but generating much larger files tends to be problematic. We can reuse it multiple times though, to reach a size that is larger than what can fit in the system RAM. Before we do this, let's take a look at what we've got by generating a smooth curve for the probability distribution, as follows: In [9]: x = np.linspace(burr.ppf(0.0001, c, d), burr.ppf(0.9999, c, d), 100) y = burr.pdf(x, c, d) In [10]: (figure, axes) = plt.subplots(figsize=(20, 10)) axes.plot(x, y, linewidth=5, alpha=0.7) axes.hist(r, bins=100, normed=True) plt.show() The following plot is the result of the preceding code: Our plot of the Burr probability distribution function, along with the 100-bin histogram with a sample size of 100 million points, took about 7 seconds to render. This is due to the fact that NumPy handles most of the work, and we only displayed a limited number of visual elements. What would happen if we did try to plot all the 100 million points? This can be checked by the following code: In [11]: (figure, axes) = plt.subplots() axes.plot(r) plt.show() formatters.py:239: FormatterWarning: Exception in image/png formatter: Allocated too many blocks After about 30 seconds of crunching, the preceding error was thrown—the Agg backend (a shared library) simply couldn't handle the number of artists required to render all the points. But for now, this case clarifies the point that we stated a while back—our first plot rendered relatively quickly because we were selective about the data we chose to present, given the large number of points with which we are working. However, let's say we have data from the files that are too large to fit into the memory. What do we do about this? Possible ways to address this include the following: Moving the data out of the memory and into the filesystem Moving the data off the filesystem and into the databases We will explore examples of these in the following section. Big data on the filesystem The first of the two proposed solutions for large datasets involves not burdening the system memory with data, but rather leaving it on the filesystem. There are several ways to accomplish this, but the following two methods in particular are the most common in the world of NumPy and matplotlib: NumPy's memmap function: This function creates memory-mapped files that are useful if you wish to access small segments of large files on the disk without having to read the whole file into the memory. PyTables: This is a package that is used to manage hierarchical datasets. It is built on the top of the HDF5 and NumPy libraries and is designed to efficiently and easily cope with extremely large amounts of data. We will examine each in turn. NumPy's memmap function Let's restart the IPython kernel by going to the IPython menu at the top of notebook page, selecting Kernel, and then clicking on Restart. When the dialog box pops up, click on Restart. Then, re-execute the first few lines of the notebook by importing the required libraries and getting our style sheet set up. Once the kernel is restarted, take a look at the RAM utilization on your system for a fresh Python process for the notebook: In [4]: Image("memory-before.png") Out[4]: The following screenshot shows the RAM utilization for a fresh Python process: Now, let's load the array data that we previously saved to disk and recheck the memory utilization, as follows: In [5]: data = np.fromfile("../data/points.bin") data_shape = data.shape data_len = len(data) data_len Out[5]: 100000000 In [6]: Image("memory-after.png") Out[6]: The following screenshot shows the memory utilization after loading the array data: This took about five seconds to load, with the memory consumption equivalent to the file size of the data. This means that if we wanted to build some sample data that was too large to fit in the memory, we'd need about 11 of those files concatenated, as follows: In [7]: 8 * 1024 Out[7]: 8192 In [8]: filesize = 763 8192 / filesize Out[8]: 10.73656618610747 However, this is only if the entire memory was available. Let's see how much memory is available right now, as follows: In [9]: del data In [10]: psutil.virtual_memory().available / 1024**2 Out[10]: 2449.1796875 That's 2.5 GB. So, to overrun our RAM, we'll just need a fraction of the total. This is done in the following way: In [11]: 2449 / filesize Out[11]: 3.2096985583224114 The preceding output means that we only need four of our original files to create a file that won't fit in memory. However, in the following section, we will still use 11 files to ensure that data, if loaded into the memory, will be much larger than the memory. How do we create this large file for demonstration purposes (knowing that in a real-life situation, the data would already be created and potentially quite large)? We can try to use numpy.tile to create a file of the desired size (larger than memory), but this can make our system unusable for a significant period of time. Instead, let's use numpy.memmap, which will treat a file on the disk as an array, thus letting us work with data that is too large to fit into the memory. Let's load the data file again, but this time as a memory-mapped array, as follows: In [12]: data = np.memmap( "../data/points.bin", mode="r", shape=data_shape) The loading of the array to a memmap object was very quick (compared to the process of bringing the contents of the file into the memory), taking less than a second to complete. Now, let's create a new file to write the data to. This file must be larger in size as compared to our total system memory (if held on in-memory database, it will be smaller on the disk): In [13]: big_data_shape = (data_len * 11,) big_data = np.memmap( "../data/many-points.bin", dtype="uint8", mode="w+", shape=big_data_shape) The preceding code creates a 1 GB file, which is mapped to an array that has the shape we requested and just contains zeros: In [14]: ls -alh ../data/many-points.bin -rw-r--r-- 1 oubiwann staff 1.0G Apr 2 11:35 many-points.bin In [15]: big_data.shape Out[15]: (1100000000,) In [16]: big_data Out[16]: memmap([0, 0, 0, ..., 0, 0, 0], dtype=uint8) Now, let's fill the empty data structure with copies of the data we saved to the 763 MB file, as follows: In [17]: for x in range(11): start = x * data_len end = (x * data_len) + data_len big_data[start:end] = data big_data Out[17]: memmap([ 90, 71, 15, ..., 33, 244, 63], dtype=uint8) If you check your system memory before and after, you will only see minimal changes, which confirms that we are not creating an 8 GB data structure on in-memory. Furthermore, checking your system only takes a few seconds. Now, we can do some sanity checks on the resulting data and ensure that we have what we were trying to get, as follows: In [18]: big_data_len = len(big_data) big_data_len Out[18]: 1100000000 In [19]: data[100000000 – 1] Out[19]: 63 In [20]: big_data[100000000 – 1] Out[20]: 63 Attempting to get the next index from our original dataset will throw an error (as shown in the following code), since it didn't have that index: In [21]: data[100000000] ----------------------------------------------------------- IndexError Traceback (most recent call last) ... IndexError: index 100000000 is out of bounds ... But our new data does have an index, as shown in the following code: In [22]: big_data[100000000 Out[22]: 90 And then some: In [23]: big_data[1100000000 – 1] Out[23]: 63 We can also plot data from a memmaped array without having a significant lag time. However, note that in the following code, we will create a histogram from 1.1 million points of data, so the plotting won't be instantaneous: In [24]: (figure, axes) = plt.subplots(figsize=(20, 10)) axes.hist(big_data, bins=100) plt.show() The following plot is the result of the preceding code: The plotting took about 40 seconds to generate. The odd shape of the histogram is due to the fact that, with our data file-hacking, we have radically changed the nature of our data since we've increased the sample size linearly without regard for the distribution. The purpose of this demonstration wasn't to preserve a sample distribution, but rather to show how one can work with large datasets. What we have seen is not too shabby. Thanks to NumPy, matplotlib can work with data that is too large for memory, even if it is a bit slow iterating over hundreds of millions of data points from the disk. Can matplotlib do better? HDF5 and PyTables A commonly used file format in the scientific computing community is Hierarchical Data Format (HDF). HDF is a set of file formats (namely HDF4 and HDF5) that were originally developed at the National Center for Supercomputing Applications (NCSA), a unit of the University of Illinois at Urbana-Champaign, to store and organize large amounts of numerical data. The NCSA is a great source of technical innovation in the computing industry—a Telnet client, the first graphical web browser, a web server that evolved into the Apache HTTP server, and HDF, which is of particular interest to us, were all developed here. It is a little known fact that NCSA's web browser code was the ancestor to both the Netscape web browser as well as a prototype of Internet Explorer that was provided to Microsoft by a third party. HDF is supported by Python, R, Julia, Java, Octave, IDL, and MATLAB, to name a few. HDF5 offers significant improvements and useful simplifications over HDF4. It uses B-trees to index table objects and, as such, works well for write-once/read-many time series data. Common use cases span fields such as meteorological studies, biosciences, finance, and aviation. The HDF5 files of multiterabyte sizes are common in these applications. Its typically constructed from the analyses of multiple HDF5 source files, thus providing a single (and often extensive) source of grouped data for a particular application. The PyTables library is built on the top of the Python HDF5 library and NumPy. As such, it not only provides access to one of the most widely used large data file formats in the scientific computing community, but also links data extracted from these files with the data types and objects provided by the fast Python numerical processing library. PyTables is also used in other projects. Pandas wraps PyTables, thus extending its convenient in-memory data structures, functions, and objects to large on-disk files. To use HDF data with Pandas, you'll want to create pandas.HDFStore, read from the HDF data sources with pandas.read_hdf, or write to one with pandas.to_hdf. Files that are too large to fit in the memory may be read and written by utilizing chunking techniques. Pandas does support the disk-based DataFrame operations, but these are not very efficient due to the required assembly on columns of data upon reading back into the memory. One project to keep an eye on under the PyData umbrella of projects is Blaze. It's an open wrapper and a utility framework that can be used when you wish to work with large datasets and generalize actions such as the creation, access, updates, and migration. Blaze supports not only HDF, but also SQL, CSV, and JSON. The API usage between Pandas and Blaze is very similar, and it offers a nice tool for developers who need to support multiple backends. In the following example, we will use PyTables directly to create an HDF5 file that is too large to fit in the memory (for an 8GB RAM machine). We will follow the following steps: Create a series of CSV source data files that take up approximately 14 GB of disk space Create an empty HDF5 file Create a table in the HDF5 file and provide the schema metadata and compression options Load the CSV source data into the HDF5 table Query the new data source once the data has been migrated Remember the temperature precipitation data for St. Francis, in Kansas, USA, from a previous notebook? We are going to generate random data with similar columns for the purposes of the HDF5 example. This data will be generated from a normal distribution, which will be used in the guise of the temperature and precipitation data for hundreds of thousands of fictitious towns across the globe for the last century, as follows: In [25]: head = "country,town,year,month,precip,tempn" row = "{},{},{},{},{},{}n" filename = "../data/{}.csv" town_count = 1000 (start_year, end_year) = (1894, 2014) (start_month, end_month) = (1, 13) sample_size = (1 + 2 * town_count * (end_year – start_year) * (end_month - start_month)) countries = range(200) towns = range(town_count) years = range(start_year, end_year) months = range(start_month, end_month) for country in countries: with open(filename.format(country), "w") as csvfile: csvfile.write(head) csvdata = "" weather_data = norm.rvs(size=sample_size) weather_index = 0 for town in towns: for year in years: for month in months: csvdata += row.format( country, town, year, month, weather_data[weather_index], weather_data[weather_index + 1]) weather_index += 2 csvfile.write(csvdata) Note that we generated a sample data population that was twice as large as the expected size in order to pull both the simulated temperature and precipitation data at the same time (from the same set). This will take about 30 minutes to run. When complete, you will see the following files: In [26]: ls -rtm ../data/*.csv ../data/0.csv, ../data/1.csv, ../data/2.csv, ../data/3.csv, ../data/4.csv, ../data/5.csv, ... ../data/194.csv, ../data/195.csv, ../data/196.csv, ../data/197.csv, ../data/198.csv, ../data/199.csv A quick look at just one of the files reveals the size of each, as follows: In [27]: ls -lh ../data/0.csv -rw-r--r-- 1 oubiwann staff 72M Mar 21 19:02 ../data/0.csv With each file that is 72 MB in size, we have data that takes up 14 GB of disk space, which exceeds the size of the RAM of the system in question. Furthermore, running queries against so much data in the .csv files isn't going to be very efficient. It's going to take a long time. So what are our options? Well, to read this data, HDF5 is a very good fit. In fact, it is designed for jobs like this. We will use PyTables to convert the .csv files to a single HDF5. We'll start by creating an empty table file, as follows: In [28]: tb_name = "../data/weather.h5t" h5 = tb.open_file(tb_name, "w") h5 Out[28]: File(filename=../data/weather.h5t, title='', mode='w', root_uep='/', filters=Filters( complevel=0, shuffle=False, fletcher32=False, least_significant_digit=None)) / (RootGroup) '' Next, we'll provide some assistance to PyTables by indicating the data types of each column in our table, as follows: In [29]: data_types = np.dtype( [("country", "<i8"), ("town", "<i8"), ("year", "<i8"), ("month", "<i8"), ("precip", "<f8"), ("temp", "<f8")]) Also, let's define a compression filter that can be used by PyTables when saving our data, as follows: In [30]: filters = tb.Filters(complevel=5, complib='blosc') Now, we can create a table inside our new HDF5 file, as follows: In [31]: tab = h5.create_table( "/", "weather", description=data_types, filters=filters) With that done, let's load each CSV file, read it in chunks so that we don't overload the memory, and then append it to our new HDF5 table, as follows: In [32]: for filename in glob.glob("../data/*.csv"): it = pd.read_csv(filename, iterator=True, chunksize=10000) for chunk in it: tab.append(chunk.to_records(index=False)) tab.flush() Depending on your machine, the entire process of loading the CSV file, reading it in chunks, and appending to a new HDF5 table can take anywhere from 5 to 10 minutes. However, what started out as a collection of the .csv files that weigh in at 14 GB is now a single compressed 4.8 GB HDF5 file, as shown in the following code: In [33]: h5.get_filesize() Out[33]: 4758762819 Here's the metadata for the PyTables-wrapped HDF5 table after the data insertion: In [34]: tab Out[34]: /weather (Table(288000000,), shuffle, blosc(5)) '' description := { "country": Int64Col(shape=(), dflt=0, pos=0), "town": Int64Col(shape=(), dflt=0, pos=1), "year": Int64Col(shape=(), dflt=0, pos=2), "month": Int64Col(shape=(), dflt=0, pos=3), "precip": Float64Col(shape=(), dflt=0.0, pos=4), "temp": Float64Col(shape=(), dflt=0.0, pos=5)} byteorder := 'little' chunkshape := (1365,) Now that we've created our file, let's use it. Let's excerpt a few lines with an array slice, as follows: In [35]: tab[100000:100010] Out[35]: array([(0, 69, 1947, 5, -0.2328834718674, 0.06810312195695), (0, 69, 1947, 6, 0.4724989007889, 1.9529216219569), (0, 69, 1947, 7, -1.0757216683235, 1.0415374480545), (0, 69, 1947, 8, -1.3700249968748, 3.0971874991576), (0, 69, 1947, 9, 0.27279758311253, 0.8263207523831), (0, 69, 1947, 10, -0.0475253104621, 1.4530808932953), (0, 69, 1947, 11, -0.7555493935762, -1.2665440609117), (0, 69, 1947, 12, 1.540049376928, 1.2338186532516), (0, 69, 1948, 1, 0.829743501445, -0.1562732708511), (0, 69, 1948, 2, 0.06924900463163, 1.187193711598)], dtype=[('country', '<i8'), ('town', '<i8'), ('year', '<i8'), ('month', '<i8'), ('precip', '<f8'), ('temp', '<f8')]) In [36]: tab[100000:100010]["precip"] Out[36]: array([-0.23288347, 0.4724989 , -1.07572167, -1.370025 , 0.27279758, -0.04752531, -0.75554939, 1.54004938, 0.8297435 , 0.069249 ]) When we're done with the file, we do the same thing that we would do with any other file-like object: In [37]: h5.close() If we want to work with it again, simply load it, as follows: In [38]: h5 = tb.open_file(tb_name, "r") tab = h5.root.weather Let's try plotting the data from our HDF5 file: In [39]: (figure, axes) = plt.subplots(figsize=(20, 10)) axes.hist(tab[:1000000]["temp"], bins=100) plt.show() Here's a plot for the first million data points: This histogram was rendered quickly, with a much better response time than what we've seen before. Hence, the process of accessing the HDF5 data is very fast. The next question might be "What about executing calculations against this data?" Unfortunately, running the following will consume an enormous amount of RAM: tab[:]["temp"].mean() We've just asked for all of the data—all of its 288 million rows. We are going to end up loading everything into the RAM, grinding the average workstation to a halt. Ideally though, when you iterate through the source data and create the HDF5 file, you also crunch the numbers that you will need, adding supplemental columns or groups to the HDF5 file that can be used later by you and your peers. If we have data that we will mostly be selecting (extracting portions) and which has already been crunched and grouped as needed, HDF5 is a very good fit. This is why one of the most common use cases that you see for HDF5 is the sharing and distribution of the processed data. However, if we have data that we need to process repeatedly, then we will either need to use another method besides the one that will cause all the data to be loaded into the memory, or find a better match for our data processing needs. We saw in the previous section that the selection of data is very fast in HDF5. What about calculating the mean for a small section of data? If we've got a total of 288 million rows, let's select a divisor of the number that gives us several hundred thousand rows at a time—2,81,250 rows, to be more precise. Let's get the mean for the first slice, as follows: In [40]: tab[0:281250]["temp"].mean() Out[40]: 0.0030696632864265312 This took about 1 second to calculate. What about iterating through the records in a similar fashion? Let's break up the 288 million records into chunks of the same size; this will result in 1024 chunks. We'll start by getting the ranges needed for an increment of 281,250 and then, we'll examine the first and the last row as a sanity check, as follows: In [41]: limit = 281250 ranges = [(x * limit, x * limit + limit) for x in range(2 ** 10)] (ranges[0], ranges[-1]) Out[41]: ((0, 281250), (287718750, 288000000)) Now, we can use these ranges to generate the mean for each chunk of 281,250 rows of temperature data and print the total number of means that we generated to make sure that we're getting our counts right, as follows: In [42]: means = [tab[x * limit:x * limit + limit]["temp"].mean() for x in range(2 ** 10)] len(means) Out[42]: 1024 Depending on your machine, that should take between 30 and 60 seconds. With this work done, it's now easy to calculate the mean for all of the 288 million points of temperature data: In [43]: sum(means) / len(means) Out[43]: -5.3051780413782918e-05 Through HDF5's efficient file format and implementation, combined with the splitting of our operations into tasks that would not copy the HDF5 data into memory, we were able to perform calculations across a significant fraction of a billion records in less than a minute. HDF5 even supports parallelization. So, this can be improved upon with a little more time and effort. However, there are many cases where HDF5 is not a practical choice. You may have some free-form data, and preprocessing it will be too expensive. Alternatively, the datasets may be actually too large to fit on a single machine. This is when you may consider using matplotlib with distributed data. Summary In this article, we covered the role of NumPy in the world of big data and matplotlib as well as the process and problems in working with large data sources. Also, we discussed the possible solutions to these problems using NumPy's memmap function and HDF5 and PyTables. Resources for Article: Further resources on this subject: First Steps [article] Introducing Interactive Plotting [article] The plot function [article]

0
0
5127

article-image-creating-themes-report-using-birt

Packt

17 Jul 2010

3 min read

Creating Themes for a Report using BIRT

Packt

17 Jul 2010

3 min read

Creating themes Using the power of stylesheets and libraries, one has the ability to apply general formatting to an entire report design using themes and reuse these among different reports. Themes provide a simple mechanism to apply a wide range of styles to an entire report design without the need to manually apply them. The following example will move the styles that we have created in our library and will show how to apply them to our report using a theme. For each of the styles we have created, select them under the Outline tab and choose Export to Library…. Choose the ClassicCarsLibrary.rptLibrary file. All of the styles will reside under the defaultTheme themes section, so select this from the drop-down list that appears next to the Theme option. Repeat these steps for all styles we have created in Customer Orders.rptDesign. Delete all of the styles stored in the Customer Orders.rptDesign file. You will notice all the styles disappear from the report designer. In the Outline tab, under the Customer Orders.rptDesign file, select the root element titled Customer Orders.rptDesign. Right-click the element and select Refresh Library. The library should already be shared since we built the report using the library's data source and datasets. If it is not, open the Resource Explorer, choose ClassicCarsLibrary.rptLibrary, right-click and choose Use Library. Under the Property Editor, change the Themes drop down to ClassicCarsLibrarydefaultTheme. When we apply the theme, we will see the detail table header automatically apply the style for table-header. Apply the remaining custom styles to the two columns in the customer information section and the order detail row. Now, we know that we can create several different themes by grouping styles together in libraries. So, when developing reports, you can create several different looks that can be applied to reports, simply by applying themes to reports with the help of libraries. Using external CSS stylesheets Another stylesheet feature is the ability to use external stylesheets and simply link to them. This works out very well when report documents are embedded into existing web portals by using the portals stylesheets to keep a consistent look and feel. This creates a sense of uniformity in the overall site. Imagine that our graphics designer gives us a CSS file and asks us to design our reports around it. There are two ways one can use CSS files in BIRT: Importing CSS files Using CSS as a resource In the following examples we are going to illustrate both scenarios. I have a CSS file containing six styles—five styles that are for predefined elements in reports and one style that is a custom style. The following is the CSS stylesheet for the given report: .page { background-color: #FFFFFF; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 12px; line-height: 24px; color: #336699;}.table-group-footer-1 { font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 12px; line-height: 24px; color: #333333; background-color: #FFFFCC;}.title { font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 24px; line-height: 40px; background-color: #99CC00; color: #003333; font-weight: bolder;}.table-header { font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 20px; background-color: #669900; color: #FFFF33;}.table-footer { font-family: Arial, Helvetica, sans-serif; font-size: 14px; font-weight: bold; line-height: 22px; color: #333333; background-color: #CCFF99;}

0
0
5108

article-image-deep-learning-and-regression-analysis

Packt

09 Jan 2017

6 min read

Deep learning and regression analysis

Packt

09 Jan 2017

6 min read

In this article by Richard M. Reese and Jennifer L. Reese, authors of the book, Java for Data Science, We will discuss neural networks can be used to perform regression analysis. However, other techniques may offer a more effective solution. With regression analysis, we want to predict a result based on several input variables (For more resources related to this topic, see here.) We can perform regression analysis using an output layer that consists of a single neuron that sums the weighted input plus bias of the previous hidden layer. Thus, the result is a single value representing the regression. Preparing the data We will use a car evaluation database to demonstrate how to predict the acceptability of a car based on a series of attributes. The file containing the data we will be using can be downloaded from: http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data. It consists of car data such as price, number of passengers, and safety information, and an assessment of its overall quality. It is this latter element that we will try to predict. The comma-delimited values in each attribute are shown next, along with substitutions. The substitutions are needed because the model expects numeric data: Attribute Original value Substituted value Buying price vhigh, high, med, low 3,2,1,0 Maintenance price vhigh, high, med, low 3,2,1,0 Number of doors 2, 3, 4, 5-more 2,3,4,5 Seating 2, 4, more 2,4,5 Cargo space small, med, big 0,1,2 Safety low, med, high 0,1,2 There are 1,728 instances in the file. The cars are marked with four classes: Class Number of instances Percentage of instances Original value Substituted value Unacceptable 1210 70.023% unacc 0 Acceptable 384 22.222% acc 1 Good 69 3.99% good 2 Very good 65 3.76% v-good 3 Setting up the class We start with the definition of a CarRegressionExample class, as shown next: public class CarRegressionExample { public CarRegressionExample() { try { ... } catch (IOException | InterruptedException ex) { // Handle exceptions } } public static void main(String[] args) { new CarRegressionExample(); } } Reading and preparing the data The first task is to read in the data. We will use the CSVRecordReader class to get the data: RecordReader recordReader = new CSVRecordReader(0, ","); recordReader.initialize(new FileSplit(new File("car.txt"))); DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader, 1728, 6, 4); With this dataset, we will split the data into two sets. Sixty five percent of the data is used for training and the rest for testing: DataSet dataset = iterator.next(); dataset.shuffle(); SplitTestAndTrain testAndTrain = dataset.splitTestAndTrain(0.65); DataSet trainingData = testAndTrain.getTrain(); DataSet testData = testAndTrain.getTest(); The data now needs to be normalized: DataNormalization normalizer = new NormalizerStandardize(); normalizer.fit(trainingData); normalizer.transform(trainingData); normalizer.transform(testData); We are now ready to build the model. Building the model A MultiLayerConfiguration instance is created using a series of NeuralNetConfiguration.Builder methods. The following is the dice used. We will discuss the individual methods following the code. Note that this configuration uses two layers. The last layer uses the softmax activation function, which is used for regression analysis: MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder() .iterations(1000) .activation("relu") .weightInit(WeightInit.XAVIER) .learningRate(0.4) .list() .layer(0, new DenseLayer.Builder() .nIn(6).nOut(3) .build()) .layer(1, new OutputLayer .Builder(LossFunctions.LossFunction .NEGATIVELOGLIKELIHOOD) .activation("softmax") .nIn(3).nOut(4).build()) .backprop(true).pretrain(false) .build(); Two layers are created. The first is the input layer. The DenseLayer.Builder class is used to create this layer. The DenseLayer class is a feed-forward and fully connected layer. The created layer uses the six car attributes as input. The output consists of three neurons that are fed into the output layer and is duplicated here for your convenience: .layer(0, new DenseLayer.Builder() .nIn(6).nOut(3) .build()) The second layer is the output layer created with the OutputLayer.Builder class. It uses a loss function as the argument of its constructor. The softmax activation function is used since we are performing regression as shown here: .layer(1, new OutputLayer .Builder(LossFunctions.LossFunction .NEGATIVELOGLIKELIHOOD) .activation("softmax") .nIn(3).nOut(4).build()) Next, a MultiLayerNetwork instance is created using the configuration. The model is initialized, its listeners are set, and then the fit method is invoked to perform the actual training. The ScoreIterationListener instance will display information as the model trains which we will see shortly in the output of this example. Its constructor argument specifies the frequency that information is displayed: MultiLayerNetwork model = new MultiLayerNetwork(conf); model.init(); model.setListeners(new ScoreIterationListener(100)); model.fit(trainingData); We are now ready to evaluate the model. Evaluating the model In the next sequence of code, we evaluate the model against the training dataset. An Evaluation instance is created using an argument specifying that there are four classes. The test data is fed into the model using the output method. The eval method takes the output of the model and compares it against the test data classes to generate statistics. The getLabels method returns the expected values: Evaluation evaluation = new Evaluation(4); INDArray output = model.output(testData.getFeatureMatrix()); evaluation.eval(testData.getLabels(), output); out.println(evaluation.stats()); The output of the training follows, which is produced by the ScoreIterationListener class. However, the values you get may differ due to how the data is selected and analyzed. Notice that the score improves with the iterations but levels out after about 500 iterations: 12:43:35.685 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 0 is 1.443480901811554 12:43:36.094 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 100 is 0.3259061845624861 12:43:36.390 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 200 is 0.2630572026049783 12:43:36.676 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 300 is 0.24061281470878784 12:43:36.977 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 400 is 0.22955121170274934 12:43:37.292 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 500 is 0.22249920540161677 12:43:37.575 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 600 is 0.2169898450109222 12:43:37.872 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 700 is 0.21271599814600958 12:43:38.161 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 800 is 0.2075677126088741 12:43:38.451 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 900 is 0.20047317735870715 This is followed by the results of the stats method as shown next. The first part reports on how examples are classified and the second part displays various statistics: Examples labeled as 0 classified by model as 0: 397 times Examples labeled as 0 classified by model as 1: 10 times Examples labeled as 0 classified by model as 2: 1 times Examples labeled as 1 classified by model as 0: 8 times Examples labeled as 1 classified by model as 1: 113 times Examples labeled as 1 classified by model as 2: 1 times Examples labeled as 1 classified by model as 3: 1 times Examples labeled as 2 classified by model as 1: 7 times Examples labeled as 2 classified by model as 2: 21 times Examples labeled as 2 classified by model as 3: 14 times Examples labeled as 3 classified by model as 1: 2 times Examples labeled as 3 classified by model as 3: 30 times ==========================Scores======================================== Accuracy: 0.9273 Precision: 0.854 Recall: 0.8323 F1 Score: 0.843 ======================================================================== The regression model does a reasonable job with this dataset. Summary In this article, we examined deep learning and regression analysis. We showed how to prepare the data and class, build the model, and evaluate the model. We used sample data and displayed output statistics to demonstrate the relative effectiveness of our model. Resources for Article: Further resources on this subject: KnockoutJS Templates [article] The Heart of It All [article] Bringing DevOps to Network Operations [article]

0
0
5104

article-image-did-unfettered-growth-kill-maker-media-financial-crisis-leads-company-to-shutdown-maker-faire-and-lay-off-all-staff

Savia Lobo

10 Jun 2019

5 min read

Did unfettered growth kill Maker Media? Financial crisis leads company to shutdown Maker Faire and lay off all staff

Savia Lobo

10 Jun 2019

5 min read

Updated: On July 10, 2019, Dougherty announced the relaunch of Maker Faire and Maker Media with the new name “Make Community“. Maker Media Inc., the company behind Maker Faire, the popular event that hosts arts, science, and engineering DIY projects for children and their parents, has laid off all its employees--22 employees--and have decided to shut down due to financial troubles. In January 2005, the company first started off with MAKE, an American bimonthly magazine focused on do it yourself and/or DIWO projects involving computers, electronics, robotics, metalworking, woodworking, etc. for both adults and children. In 2006, the company first held its Maker Faire event, that lets attendees wander amidst giant, inspiring art and engineering installations. Maker Faire now includes 200 owned and licensed events per year in over 40 countries. The Maker movement gained momentum and popularity when MAKE magazine first started publishing 15 years ago. The movement emerged as a dominant source of livelihood as individuals found ways to build small businesses using their creative activity. In 2014, The WhiteHouse blog posted an article stating, “Maker Faires and similar events can inspire more people to become entrepreneurs and to pursue careers in design, advanced manufacturing, and the related fields of science, technology, engineering and mathematics (STEM).” With funding from the Department of Labor, “the AFL-CIO and Carnegie Mellon University are partnering with TechShop Pittsburgh to create an apprenticeship program for 21st-century manufacturing and encourage startups to manufacture domestically.” Recently, researchers from Baylor University and the University of North Carolina, in their research paper, have highlighted opportunities for studying the conditions under which the Maker movement might foster entrepreneurship outcomes. Dale Dougherty, Maker Media Inc.’s founder and CEO, told TechCrunch, “I started this 15 years ago and it’s always been a struggle as a business to make this work. Print publishing is not a great business for anybody, but it works…barely. Events are hard . . . there was a drop off in corporate sponsorship”. “Microsoft and Autodesk failed to sponsor this year’s flagship Bay Area Maker Faire”, TechCrunch reports. Dougherty further told that the company is trying to keep the servers running. “I hope to be able to get control of the assets of the company and restart it. We’re not necessarily going to do everything we did in the past but I’m committed to keeping the print magazine going and the Maker Faire licensing program”, he further added. In 2016, the company laid off 17 of its employees, followed by 8 employees recently in March. “They’ve been paid their owed wages and PTO, but did not receive any severance or two-week notice”, TechCrunch reports. These layoffs may have hinted the staff of the financial crisis affecting the company. Maker Media Inc. had raised $10 million from Obvious Ventures, Raine Ventures, and Floodgate. Dougherty says, “It started as a venture-backed company but we realized it wasn’t a venture-backed opportunity. The company wasn’t that interesting to its investors anymore. It was failing as a business but not as a mission. Should it be a non-profit or something like that? Some of our best successes, for instance, are in education.” The company has a huge public following for its products. Dougherty told TechCrunch that despite the rain, Maker Faire’s big Bay Area event last week met its ticket sales target. Also, about 1.45 million people attended its events in 2016. “MAKE: magazine had 125,000 paid subscribers and the company had racked up over one million YouTube subscribers. But high production costs in expensive cities and a proliferation of free DIY project content online had strained Maker Media”, writes TechCrunch. Dougherty told TechCrunch he has been overwhelmed by the support shown by the Maker community. As of now, licensed Maker Faire events around the world will proceed as planned. “Dougherty also says he’s aware of Oculus co-founder Palmer Luckey’s interest in funding the company, and a GoFundMe page started for it”, TechCrunch reports. Mike Senese, Executive Editor, MAKE magazine, tweeted, “Nothing but love and admiration for the team that I got to spend the last six years with, and the incredible community that made this amazing part of my life a reality.” https://twitter.com/donttrythis/status/1137374732733493248 https://twitter.com/xeni/status/1137395288262373376 https://twitter.com/chr1sa/status/1137518221232238592 Former Mythbusters co-host Adam Savage, who was a regular presence at the Maker Faire, told The Verge, “Make Media has created so many important new connections between people across the world. It showed the power from the act of creation. We are the better for its existence and I am sad. I also believe that something new will grow from what they built. The ground they laid is too fertile to lie fallow for long.” On July 10, 2019, Dougherty announced he’ll relaunch Maker Faire and Maker Media with the new name “Make Community“. The official launch of Make Community will supposedly be next week. The company is also working on a new issue of Make Magazine that is planned to be published quarterly and the online archives of its do-it-yourself project guides will remain available. Dougherty told TechCrunch “with the goal that we can get back up to speed as a business, and start generating revenue and a magazine again. This is where the community support needs to come in because I can’t fund it for very long.” GitHub introduces ‘Template repository’ for easy boilerplate code management and distribution 12 Visual Studio Code extensions that Node.js developers will love [Sponsored by Microsoft] Shoshana Zuboff on 21st century solutions for tackling the unique complexities of surveillance capitalism

0
0
5086

Packt

01 Apr 2016

16 min read

Machine Learning Tasks

Packt

01 Apr 2016

16 min read

In this article written by David Julian, author of the book Designing Machine Learning Systems with Python, the author wants to state that, he will first introduce the basic machine learning tasks. Classification is probably the most common task, due in part to the fact that it is relatively easy, well understood, and solves a lot of common problems. Multiclass classification (for instance, handwriting recognition) can sometimes be achieved by chaining binary classification tasks. However, we lose information this way, and we become unable to define a single decision boundary. For this reason, multiclass classification is often treated separately from binary classification. (For more resources related to this topic, see here.) There are cases where we are not interested in discrete classes but rather a real number, for instance, a probability. These type of problems are regression problems. Both classification and regression require a training set of correctly labelled data. They are supervised learning problems. Originating from these basic machine tasks are a number of derived tasks. In many applications, this may simply be applying the learning model to a prediction to establish a causal relationship. We must remember that explaining and predicting are not the same. A model can make a prediction, but unless we know explicitly how it made the prediction, we cannot begin to form a comprehensible explanation. An explanation requires human knowledge of the domain. We can also use a prediction model to find exceptions from a general pattern. Here, we are interested in the individual cases that deviate from the predictions. This is often called anomaly detection and has wide applications in areas such as detecting bank fraud, noise filtering, and even in the search for extraterrestrial life. An important and potentially useful task is subgroup discovery. Our goal here is not, as in clustering, to partition the entire domain but rather to find a subgroup that has a substantially different distribution. In essence, subgroup discovery is trying to find relationships between a dependent target variable and many independent explaining variables. We are not trying to find a complete relationship but rather a group of instances that are different in ways that are important in the domain. For instance, establishing the subgroups, smoker = true and family history =true, for a target variable of heart disease =true. Finally, we consider control type tasks. These act to optimize control setting to maximize a pay off is given different conditions. This can be achieved in several ways. We can clone expert behavior; the machine learns directly from a human and makes predictions of actions given different conditions. The task is to learn a prediction model for the expert's actions. This is similar to reinforcement learning, where the task is to learn about the relationship between conditions and optimal action. Clustering, on the other hand, is the task of grouping items without any information on that group; this is an unsupervised learning task. Clustering is basically making a measurement of similarity. Related to clustering is association, which is an unsupervised task to find a certain type of pattern in the data. This task is behind movie recommender systems, and customers who bought this also bought .. on checkout pages of online stores. Data for machine learning When considering raw data for machine learning applications, there are three separate aspects: The volume of the data The velocity of the data The variety of the data Data volume The volume problem can be approached from three different directions: efficiency, scalability, and parallelism. Efficiency is about minimizing the time it takes for an algorithm to process a unit of information. A component of this is the underlying processing power of the hardware. The other component, and one that we have more control over, is ensuring our algorithms are not wasting precious processing cycles on unnecessary tasks. Scalability is really about brute force, and throwing as much hardware at a problem as you can. With Moore's law, which predicts the trend of computer power doubling every two years and reaching its limit, it is clear that scalability is not, by its self, going to be able to keep pace with the ever increasing amounts of data. Simply adding more memory and faster processors is not, in many cases, going to be a cost effective solution. Parallelism is a growing area of machine learning, and it encompasses a number of different approaches from harnessing capabilities of multi core processors, to large scale distributed computing on many different platforms. Probably, the most common method is to simply run the same algorithm on many machines, each with a different set of parameters. Another method is to decompose a learning algorithm into an adaptive sequence of queries, and have these queries processed in parallel. A common implementation of this technique is known as MapReduce, or its open source version, Hadoop. Data velocity The velocity problem is often approached in terms of data producers and data consumers. The data transfer rate between the two is its velocity, and it can be measured in interactive response times. This is the time it takes from a query being made to its response being delivered. Response times are constrained by latencies such as hard disk read and write times, and the time it takes to transmit data across a network. Data is being produced at ever greater rates, and this is largely driven by the rapid expansion of mobile networks and devices. The increasing instrumentation of daily life is revolutionizing the way products and services are delivered. This increasing flow of data has led to the idea of streaming processing. When input data is at a velocity that makes it impossible to store in its entirety, a level of analysis is necessary as the data streams, in essence, deciding what data is useful and should be stored and what data can be thrown away. An extreme example is the Large Hadron Collider at CERN, where the vast majority of data is discarded. A sophisticated algorithm must scan the data as it is being generated, looking at the information needle in the data haystack. Another instance where processing data streams might be important is when an application requires an immediate response. This is becoming increasingly used in applications such as online gaming and stock market trading. It is not just the velocity of incoming data that we are interested in. In many applications, particularly on the web, the velocity of a system's output is also important. Consider applications such as recommender systems, which need to process large amounts of data and present a response in the time it takes for a web page to load. Data variety Collecting data from different sources invariably means dealing with misaligned data structures, and incompatible formats. It also often means dealing with different semantics and having to understand a data system that may have been built on a pretty different set of logical principles. We have to remember that, very often, data is repurposed for an entirely different application than the one it was originally intended for. There is a huge variety of data formats and underlying platforms. Significant time can be spent converting data into one consistent format. Even when this is done, the data itself needs to be aligned such that each record consists of the same number of features and is measured in the same units. Models The goal in machine learning is not to just solve an instance of a problem, but to create a model that will solve unique problems from new data. This is the essence of learning. A learning model must have a mechanism for evaluating its output, and in turn, changing its behavior to a state that is closer to a solution. A model is essentially a hypothesis: a proposed explanation for a phenomenon. The goal is to apply a generalization to the problem. In the case of supervised learning, problem knowledge gained from the training set is applied to the unlabeled test. In the case of an unsupervised learning problem, such as clustering, the system does not learn from a training set. It must learn from the characteristics of the dataset itself, such as degree of similarity. In both cases, the process is iterative. It repeats a well-defined set of tasks, that moves the model closer to a correct hypothesis. There are many models and as many variations on these models as there are unique solutions. We can see that the problems that machine learning systems solve (regression, classification, association, and so on) come up in many different settings. They have been used successfully in almost all branches of science, engineering, mathematics, commerce, and also in the social sciences; they are as diverse as the domains they operate in. This diversity of models gives machine learning systems great problem solving powers. However, it can also be a bit daunting for the designer to decide what is the best model, or models, for a particular problem. To complicate things further, there are often several models that may solve your task, or your task may need several models. The most accurate and efficient pathway through an original problem is something you simply cannot know when you embark upon such a project. There are several modeling approaches. These are really different perspectives that we can use to help us understand the problem landscape. A distinction can be made regarding how a model divides up the instance space. The instance space can be considered all possible instances of your data, regardless of whether each instance actually appears in the data. The data is a subset of the instance space. There are two approaches to dividing up this space: grouping and grading. The key difference between the two is that grouping models divide the instance space into fixed discrete units called segments. Each segment has a finite resolution and cannot distinguish between classes beyond this resolution. Grading, on the other hand, forms a global model over the entire instance space, rather than dividing the space into segments. In theory, the resolution of a grading model is infinite, and it can distinguish between instances no matter how similar they are. The distinction between grouping and grading is not absolute, and many models contain elements of both. Geometric models One of the most useful approaches to machine learning modeling is through geometry. Geometric models use the concept of instance space. The most obvious example is when all the features are numerical and can become coordinates in a Cartesian coordinate system. When we only have two or three features, they are easy to visualize. Since many machine learning problems have hundreds or thousands of features, and therefore dimensions, visualizing these spaces is impossible. Importantly, many of the geometric concepts, such as linear transformations, still apply in this hyper space. This can help us better understand our models. For instance, we expect many learning algorithms to be translation invariant, which means that it does not matter where we place the origin in the coordinate system. Also, we can use the geometric concept of Euclidean distance to measure similarity between instances; this gives us a method to cluster alike instances and form a decision boundary between them. Probabilistic models Often, we will want our models to output probabilities rather than just binary true or false. When we take a probabilistic approach, we assume that there is an underlying random process that creates a well-defined, but unknown, probability distribution. Probabilistic models are often expressed in the form of a tree. Tree models are ubiquitous in machine learning, and one of their main advantages is that they can inform us about the underlying structure of a problem. Decision trees are naturally easy to visualize and conceptualize. They allow inspection and do not just give an answer. For example, if we have to predict a category, we can also expose the logical steps that gave rise to a particular result. Also, tree models generally require less data preparation than other models and can handle numerical and categorical data. On the down side, tree models can create overly complex models that do not generalize very well to new data. Another potential problem with tree models is that they can become very sensitive to changes in the input data, and as we will see later, this problem can be mitigated by using them as ensemble learners. Linear models A key concept in machine learning is that of the linear model. Linear models form the foundation of many advanced nonlinear techniques such as support vector machines and neural networks. They can be applied to any predictive task such as classification, regression, or probability estimation. When responding to small changes in the input data, and provided that our data consists of entirely uncorrelated features, linear models tend to be more stable than tree models. Tree models can over-respond to small variation in training data. This is because splits at the root of a tree have consequences that are not recoverable further down a branch, potentially making the rest of the tree significantly different. Linear models, on the other hand, are relatively stable, being less sensitive to initial conditions. However, as you would expect, this has the opposite effect of making it less sensitive to nuanced data. This is described by the terms variance (for over fitting models) and bias (for under fitting models). A linear model is typically low variance and high bias. Linear models are generally best approached from a geometric perspective. We know we can easily plot two dimensions of space in a Cartesian co-ordinate system, and we can use the illusion of perspective to illustrate a third. We have also been taught to think of time as being a fourth dimension, but when we start speaking of n dimensions, a physical analogy breaks down. Intriguingly, we can still use many of the mathematical tools that we intuitively apply to three dimensions of space. While it becomes difficult to visualize these extra dimensions, we can still use the same geometric concepts (such as lines, planes, angles, and distance) to describe them. With geometric models, we describe each instance as having a set of real-valued features, each of which is a dimension in a space. Model ensembles Ensemble techniques can be divided broadly into two types. The Averaging Method: With this method, several estimators are run independently, and their predictions are averaged. This includes the random forests and bagging methods. The Boosting Methods: With this method, weak learners are built sequentially using weighted distributions of the data, based on the error rates. Ensemble methods use multiple models to obtain better performance than any single constituent model. The aim is to not only build diverse and robust models, but also to work within limitations such as processing speed and return times. When working with large datasets and quick response times, this can be a significant developmental bottleneck. Troubleshooting and diagnostics are important aspects of working with all machine learning models, but they are especially important when dealing with models that might take days to run. The types of machine learning ensembles that can be created are as diverse as the models themselves, and the main considerations revolve around three things: how we divide our data, how we select the models, and the methods we use to combine their results. This simplistic statement actually encompasses a very large and diverse space. Neural nets When we approach the problem of trying to mimic the brain, we are faced with a number of difficulties. Considering all the different things the brain does, we might first think that it consists of a number of different algorithms, each specialized to do a particular task, and each hard wired into different parts of the brain. This approach translates to considering the brain as a number of subsystems, each with its own program and task. For example, the auditory cortex for perceiving sound has its own algorithm that, say, does a Fourier transform on an incoming sound wave to detect the pitch. The visual cortex, on the other hand, has its own distinct algorithm for decoding the signals from the optic nerve and translating them into the sensation of sight. There is, however, growing evidence that the brain does not function like this at all. It appears, from biological studies, that brain tissue in different parts of the brain can relearn how to interpret inputs. So, rather than consisting of specialized subsystems that are programmed to perform specific tasks, the brain uses the same algorithm to learn different tasks. This single algorithm approach has many advantages, not least of which is that it is relatively easy to implement. It also means that we can create generalized models and then train them to perform specialized tasks. Like in real brains, using a singular algorithm to describe how each neuron communicates with the other neurons around it allows artificial neural networks to be adaptable and able to carry out multiple higher-level tasks. Much of the most important work being done with neural net models, and indeed machine learning in general, is through the use of very complex neural nets with many layers and features. This approach is often called deep architecture or deep learning. Human and animal learning occurs at a rate and depth that no machine can match. Many of the elements of biological learning still remain a mystery. One of the key areas of research, and one of the most useful in application, is that of object recognition. This is something quite fundamental to living systems, and higher animals have evolved to possessing an extraordinary ability to learn complex relationships between objects. Biological brains have many layers; each synaptic event exists in a long chain of synaptic processes. In order to recognize complex objects, such as people's faces or handwritten digits, a fundamental task is to create a hierarchy of representation from the raw input to higher and higher levels of abstraction. The goal is to transform raw data, such as a set of pixel values, into something that we can describe as, say, a person riding bicycle. Resources for Article: Further resources on this subject: Python Data Structures [article] Exception Handling in MySQL for Python [article] Python Data Analysis Utilities [article]

0
0
5074

How-To Tutorials - Data

Oracle E-Business Suite: Entering and Reconciling Bank Statements

How to format and publish code using R Markdown

An Introduction to Kibana

Creating a View with MySQL Query Browser

Including Charts and Graphics in Pentaho Reports (Part 1)

Getting Started with Pentaho Data Integration and Pentaho BI Suite

Excel 2010 Financials: Identifying the Profitability of an Investment

IBM Cognos 10 BI dashboarding components

Some Basic Concepts of Theano

Python Graphics: Combining Raster and Vector Pictures

Trending Topics

Working with large data sources

Creating Themes for a Report using BIRT

Deep learning and regression analysis

Did unfettered growth kill Maker Media? Financial crisis leads company to shutdown Maker Faire and lay off all staff

Machine Learning Tasks

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access