Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7008 Articles
article-image-managing-nano-server-windows-powershell-and-windows-powershell-dsc
Packt
05 Jul 2017
8 min read
Save for later

Managing Nano Server with Windows PowerShell and Windows PowerShell DSC

Packt
05 Jul 2017
8 min read
In this article by Charbel Nemnom, the author of the book Getting Started with Windows Nano Server, we will cover the following topics: Remote server graphical tools Server manager Hyper-V manager Microsoft management console Managing Nano Server with PowerShell (For more resources related to this topic, see here.) Remote server graphical tools Without the Graphical User Interface (GUI), it’s not easy to carry out the daily management and maintenance of Windows Server. For this reason, Microsoft integrated Nano Server with all the existing graphical tools that you are familiar with such as Hyper-V manager, failover cluster manager, server manager, registry editor, File explorer, disk and device manager, server configuration, computer management, users and groups console, and so on. All those tools and consoles are compatible to manage Nano Server remotely. The GUI is always the easiest way to use. In this section, we will discuss how to access and set the most common configurations in Nano Server with remote graphical tools. Server manager Before we start managing Nano Server, we need to obtain the IP address or the computer name of the Nano Server to connect to and remotely manage a Nano instance either physical or virtual machine. Login to your management machine and make sure you have installed the latest Remote Server Administration Tools (RSAT) for Windows Server 2016 or Windows 10. You can download the latest RSAT tools from the following link: https://www.microsoft.com/en-us/download/details.aspx?id=45520 Launch server manager as shown in Figure 1, and add your Nano Server(s) that you would like to manage: Figure 1: Managing Nano Server using server manager You can refresh the view and browse all events and services as you expect to see. I want to point out that Best Practices Analyzer (BPA) is not supported in Nano Server. BPA is completely cmdlets-based and written in C# back during the days of PowerShell 2.0. It is also statically using some .NET XML library code that was not part of .NET framework at that time. So, do not expect to see Best Practices Analyzer in server manager. Hyper-V manager The next console that you probably want to access is Hyper-V Manager, right click on Nano Server name in server manager and select Hyper-V Manager console as shown in Figure 2: Figure 2: Managing Nano Server using Hyper-V manager Hyper-V Manager will launch with full support as you expect when managing full Windows Server 2016 Hyper-V, free Hyper-V server, server core and Nano Server with Hyper-V role. Microsoft management console You can use the Microsoft Management Console (MMC) to manage Nano Server as well. From the command line type mmc.exe. From the File menu, Click Add/Remove Snap-in…and then select Computer Management and click Add. Choose Another computer and add the IP address or the computer name of your Nano Server machine. Click Ok. As shown in Figure 3, you can expand System Tools and check the tools that you are familiar with like (Event Viewer, Local Users and Groups, Shares,and Services). Please note that some of these MMC tools such as Task Scheduler and Disk Management cannot be used against Nano Server. Also, for certain tools you need to open some ports in Windows firewall: Figure 3: Managing Nano Server using Microsoft Management Console Managing Nano Server with PowerShell For most IT administrators, the graphical user interface is the easiest way to use. But on the other hand, PowerShell can bring a fast and an automated process. That's why in Windows Server 2016, the Nano Server deployment option of Windows Server comes with full PowerShell remoting support. The purpose of the core PowerShell engine, is to manage Nano Server instances at scale. PowerShell remoting including DSC, Windows Server cmdlets (network, storage, Hyper-V, and so on), Remote file transfer, Remote script authoring and debugging, and PowerShell Web access. Some of the new features in Windows PowerShell version 5.1 on Nano Server supports the following: Copying files via PowerShell sessions Remote file editing in PowerShell ISE Interactive script debugging over PowerShell session Remote script debugging within PowerShell ISE Remote host process connects and debug PowerShell version 5.1 is available in different editions which denote varying feature sets and platform compatibility. Desktop Edition targeting Full Server, Server Core and Windows Desktop, Core Edition targeting Nano Server and Windows IoT. You can find a list of Windows PowerShell features not available yet in Nano Server here. As Nano Server is still evolving, we will see what the next cadence update will bring for unavailable PowerShell features. If you want to manage your Nano Server, you can use PowerShell Remoting or if your Nano Server instance is running in a virtual machine you can also use PowerShell Direct, more on that at the end of this section. In order to manage a Nano server installation using PowerShell remoting carry out the following steps: You may need to start the WinRM service on your management machine to enable remote connections. From the PowerShell console type the following command: net start WinRM If you want to manage Nano Server in a workgroup environment, open PowerShell console, and type the following command, substituting server name or IP with the right value using your machine-name is the easiest to use, but if your device is not uniquely named on your network, you can use the IP address instead: Set-Item WSMan:localhostClientTrustedHosts -Value "servername or IP" If you want to connect multiple devices, you can use comma and quotation marks to separate each device. Set-Item WSMan:localhostClientTrustedHosts -Value "servername or IP, servername or IP" You can also set it to allow to connect to a specific network subnet using the following command: Set-Item WSMan:localhostClientTrustedHosts -Value 10.10.100.* To test Windows PowerShell remoting against Nano Server and check if it’s working, you can use the following command: Test-WSMan -ComputerName"servername or IP" -Credential servernameAdministrator -Authentication Negotiate You can now start an interactive session with Nano Server. Open an elevated PowerShell console and type the following command: Enter-PSSession -ComputerName "servername or IP" -Credential servernameAdministrator In the following example, we will create two virtual machines on Nano Server Hyper-V host using PowerShell remoting. From your management machine, open an elevated PowerShell console or PowerShell scripting environment ,and run the following script (make sure to update the variables to match your environment): #region Variables $NanoSRV='NANOSRV-HV01' $Cred=Get-Credential"DemoSuperNano" $Session=New-PSSession-ComputerName$NanoSRV-Credential$Cred $CimSesion=New-CimSession-ComputerName$NanoSRV-Credential$Cred $VMTemplatePath='C:Temp' $vSwitch='Ext_vSwitch' $VMName='DemoVM-0' #endregion # Copying VM Template from the management machine to Nano Server Get-ChildItem-Path$VMTemplatePath-filter*.VHDX-recurse|Copy-Item-ToSession$Session-DestinationD: 1..2|ForEach-Object { New-VM-CimSession$CimSesion-Name$VMName$_-VHDPath"D:$VMName$_.vhdx"-MemoryStartupBytes1024GB` -SwitchName$vSwitch-Generation2 Start-VM-CimSession$CimSesion-VMName$VMName$_-Passthru } In this script, we are creating a PowerShell session and CIM session to Nano Server. A CIM session is a client-side object representing a connection to a local computer or a remote computer. Then we are copying VM Templates from the management machine to Nano Server over PowerShell remoting, when the copy is completed, we are creating two virtual machines as Generation 2 and finally starting them. After a couple of seconds, you can launch Hyper-V Manager console and see the new VMs running on Nano Server host as shown in Figure 4: Figure 4: Creating virtual machines on Nano Server host using PowerShell remoting If you have installed Nano Server in a virtual machine running on a Hyper-V host, you can use PowerShell direct to connect directly from your Hyper-V host to your Nano Server VM without any network connection by using the following command: Enter-PSSession -VMName <VMName> -Credential.Administrator So instead of specifying the computer name, we specified the VM Name, PowerShell Direct is so powerful, it’s one of my favorite feature, you can configure a bunch of VMs from scratch in just couple of seconds without any network connection. Moreover, if you have Nano Server running as a Hyper-V host as shown in the example earlier, you could use PowerShell remoting first to connect to Nano Server from your management machine, and then leverage PowerShell Direct to manage your virtual machines running on top of Nano Server. In this example, we used two PowerShell technologies (PS remoting and PS Direct).This is so powerful and open many possibilities to effectively manage Nano Server. To do that, you can use the following command: #region Variables $NanoSRV='NANOSRV-HV01'#Nano Server name or IP address $DomainCred=Get-Credential"DemoSuperNano" $VMLocalCred=Get-Credential"~Administrator" $Session=New-PSSession-ComputerName$NanoSRV-Credential$DomainCred #endregion Invoke-Command-Session$Session-ScriptBlock { Get-VM Invoke-Command-VMName (Get-VM).Name-Credential$Using;VMLocalCred-ScriptBlock { hostname Tzutil/g } } In this script, we have created a PowerShell session into Nano Server physical host, and then we used PowerShell Direct to list all VMs, including their hostnames and time zone. The result is shown in Figure 5: Figure 5. Nested PowerShell remoting Summary In this article, we discussed how to manage a Nano Server installation using remote server graphic tools, and Windows PowerShell remoting. Resources for Article: Further resources on this subject: Exploring Windows PowerShell 5.0 [article] Exchange Server 2010 Windows PowerShell: Mailboxes and Reports [article] Exchange Server 2010 Windows PowerShell: Managing Mailboxes [article]
Read more
  • 0
  • 0
  • 58407

article-image-azure-feature-pack
Packt
05 Jul 2017
9 min read
Save for later

Azure Feature Pack

Packt
05 Jul 2017
9 min read
In this article by Christian Cote, Matija Lah, and Dejan Sarka, the author of the book SQL Server 2016 Integration Services Cookbook, we will see how to install Azure Feature Pack that in turn, will install Azure control flow tasks and data flow components And we will also see how to use the Fuzzy Lookup transformation for identity mapping. (For more resources related to this topic, see here.) In the early years of SQL Server, Microsoft introduced a tool to help developers and database administrator (DBA) to interact with the data: data transformation services (DTS). The tool was very primitive compared to SSIS and it was mostly relying on ActiveX and TSQL to transform the data. SSIS 1.0 appears in 2005. The tool was a game changer in the ETL world at the time. It was a professional and (pretty much) reliable tool for 2005. 2008/2008R2 versions were much the same as 2005 in a sense that they didn't add much functionality but they made the tool more scalable. In 2012, Microsoft enhanced SSIS in many ways. They rewrote the package XML to ease source control integration and make package code easier to read. They also greatly enhanced the way packages are deployed by using a SSIS catalog in SQL Server. Having the catalog in SQL Server gives us execution reports and many views that give us access to metadata or metaprocess information's in our projects. Version 2014 didn't have anything for SSIS. Version 2016 brings other set of features as you will see. We now also have the possibility to integrate with big data. Business intelligence projects many times reveal previously unseen issues with the quality of the source data. Dealing with data quality includes data quality assessment, or data profiling, data cleansing, and maintaining high quality over time. In SSIS, the data profiling task helps you with finding unclean data. The Data Profiling task is not like the other tasks in SSIS because it is not intended to be run over and over again through a scheduled operation. Think about SSIS being the wrapper for this tool. You use the SSIS framework to configure and run the Data Profiling task, and then you observe the results through the separate Data Profile Viewer. The output of the Data Profiling task will be used to help you in your development and design of the ETL and dimensional structures in your solution. Periodically, you may want to rerun the Data Profile task to see how the data has changed, but the package you develop will not include the task in the overall recurring ETL process Azure tasks and transforms Azure ecosystem is becoming predominant in Microsoft ecosystem and SSIS has not been left over in the past few years. The Azure Feature Pack is not a SSIS 2016 specific feature. It's also available for SSIS version 2012 and 2014. It's worth mentioning that it appeared in July 2015, a few months before SSIS 2016 release. Getting ready This section assumes that you have installed SQL Server Data Tools 2015. How to do it... We'll start SQL Server Data Tools, and open the CustomLogging project if not already done: In the SSIS toolbox, scroll to the Azure group. Since the Azure tools are not installed with SSDT, the Azure group is disabled in the toolbox. The tools must be downloaded using a separate installer. Click on Azure group to expand it and click on Download Azure Feature Pack as shown in the following screenshot: Your default browser opens and the Microsoft SQL Server 2016 Integration Services Feature Pack for Azure opens. Click on Download as shown in the following screenshot: From the popup that appears, select both 32-bit and 64-bit version. The 32-bit version is necessary for SSIS package development since SSDT is a 32-bit program. Click Next as shown in the following screenshot: As shown in the following screenshot, the files are downloaded: Once the download completes, run on one the installers downloaded. The following screen appears. In that case, the 32-bit (x86) version is being installed. Click Next to start the installation process: As shown in the following screenshot, check the box near I accept the terms in the License Agreement and click Next. Then the installation starts. The following screen appears once the installation is completed. Click Finish to close the screen: Install the other feature pack you downloaded. If SSDT is opened, close it. Start SSDT again and open the CustomLogging project. In the Azure group in the SSIS toolbox, you should now see the Azure tasks as in the following screenshot: Using SSIS fuzzy components SSIS includes two really sophisticated matching transformations in the data flow. The Fuzzy Lookup transformation is used for mapping the identities. The Fuzzy Grouping Transformation is used for de-duplicating. Both of them use the same algorithm for comparing the strings and other data. Identity mapping and de-duplication are actually the same problem. For example, instead for mapping the identities of entities in two tables, you can union all of the data in a single table and then do the de-duplication. Or vice versa, you can join a table to itself and then do identity mapping instead of de-duplication. Getting ready This recipe assumes that you have successfully finished the previous recipe. How to do it… In SSMS, create a new table in the DQS_STAGING_DATA database in the dbo schema and name it dbo.FuzzyMatchingResults. Use the following code: CREATE TABLE dbo.FuzzyMatchingResults ( CustomerKey INT NOT NULL PRIMARY KEY, FullName NVARCHAR(200) NULL, StreetAddress NVARCHAR(200) NULL, Updated INT NULL, CleanCustomerKey INT NULL ); Switch to SSDT. Continue editing the DataMatching package. Add a Fuzzy Lookup transformation below the NoMatch Multicast transformation. Rename it FuzzyMatches and connect it to the NoMatch Multicast transformation with the regular data flow path. Double-click the transformation to open its editor. On the Reference Table tab, select the connection manager you want to use to connect to your DQS_STAGING_DATA database and select the dbo.CustomersClean table. Do not store a new index or use an existing index. When the package executes the transformation for the first time, it copies the reference table, adds a key with an integer datatype to the new table, and builds an index on the key column. Next, the transformation builds an index, called a match index, on the copy of the reference table. The match index stores the results of tokenizing the values in the transformation input columns. The transformation then uses these tokens in the lookup operation. The match index is a table in a SQL Server database. When the package runs again, the transformation can either use an existing match index or create a new index. If the reference table is static, the package can avoid the potentially expensive process of rebuilding the index for repeat sessions of data cleansing. Click the Columns tab. Delete the mapping between the two CustomerKey columns. Clear the check box next to the CleanCustomerKey input column. Select the check box next to the CustomerKey lookup column. Rename the output alias for this column to CleanCustomerKey. You are replacing the original column with the one retrieved during the lookup. Your mappings should resemble those shown in the following screenshot: Click the Advanced tab. Raise the Similarity threshold to 0.50 to reduce the matching search space. With similarity threshold of 0.00, you would get a full cross join. Click OK. Drag the Union All transformation below the Fuzzy Lookup transformation. Connect it to an output of the Match Multicast transformation and an output of the FuzzyMatches Fuzzy Lookup transformation. You will combine the exact and approximate matches in a single row set. Drag an OLE DB Destination below the Union All transformation. Rename it FuzzyMatchingResults and connect it with the Union All transformation. Double-click it to open the editor. Connect to your DQS_STAGING_DATA database and select the dbo.FuzzyMatchingResults table. Click the Mappings tab. Click OK. The completed data flow is shown in the following screenshot: You need to add restartability to your package. You will truncate all destination tables. Click the Control Flow tab. Drag the Execute T-SQL Statement task above the data flow task. Connect the tasks with the green precedence constraint from the Execute T-SQL Statement task to the data flow task. The Execute T-SQL Statement task must finish successfully before the data flow task starts. Double-click the Execute T-SQL Statement task. Use the connection manager to your DQS_STAGING_DATA database. Enter the following code in the T-SQL statement textbox, and then click OK: TRUNCATE TABLE dbo.CustomersDirtyMatch; TRUNCATE TABLE dbo.CustomersDirtyNoMatch; TRUNCATE TABLE dbo.FuzzyMatchingResults; Save the solution. Execute your package in debug mode to test it. Review the results of the Fuzzy Lookup transformation in SSMS. Look for rows for which the transformation did not find a match, and for any incorrect matches. Use the following code: -- Not matched SELECT * FROM FuzzyMatchingResults WHERE CleanCustomerKey IS NULL; -- Incorrect matches SELECT * FROM FuzzyMatchingResults WHERE CleanCustomerKey <> CustomerKey * (-1); You can use the following code to clean up the AdventureWorksDW2014 and DQS_STAGING_DATA databases: USE AdventureWorksDW2014; DROP TABLE IF EXISTS dbo.Chapter05Profiling; DROP TABLE IF EXISTS dbo.AWCitiesStatesCountries; USE DQS_STAGING_DATA; DROP TABLE IF EXISTS dbo.CustomersCh05; DROP TABLE IF EXISTS dbo.CustomersCh05DQS; DROP TABLE IF EXISTS dbo.CustomersClean; DROP TABLE IF EXISTS dbo.CustomersDirty; DROP TABLE IF EXISTS dbo.CustomersDirtyMatch; DROP TABLE IF EXISTS dbo.CustomersDirtyNoMatch; DROP TABLE IF EXISTS dbo.CustomersDQSMatch; DROP TABLE IF EXISTS dbo.DQSMatchingResults; DROP TABLE IF EXISTS dbo.DQSSurvivorshipResults; DROP TABLE IF EXISTS dbo.FuzzyMatchingResults; When you are done, close SSMS and SSDT. SQL Server data quality services (DQS) is a knowledge-driven data quality solution. This means that it requires you to maintain one or more knowledge bases (KBs). In a KB, you maintain all knowledge related to a specific portion of data—for example, customer data. The idea of data quality services is to mitigate the cleansing process. While the amount of time you need to spend on cleansing decreases, you will achieve higher and higher levels of data quality. While cleansing, you learn what types of errors to expect, discover error patterns, find domains of correct values, and so on. You don't throw away this knowledge. You store it and use it to find and correct the same issues automatically during your next cleansing process. Summary We have seen how to install Azure Feature Pack, Azure control flow tasks and data flow components, and Fuzzy Lookup transformation. Resources for Article: Further resources on this subject: Building A Recommendation System with Azure [article] Introduction to Microsoft Azure Cloud Services [article] Windows Azure Service Bus: Key Features [article]
Read more
  • 0
  • 0
  • 4100

article-image-opendaylight-fundamentals
Packt
05 Jul 2017
14 min read
Save for later

OpenDaylight Fundamentals

Packt
05 Jul 2017
14 min read
In this article by Jamie Goodyear, Mathieu Lemay, Rashmi Pujar, Yrineu Rodrigues, Mohamed El-Serngawy, and Alexis de Talhouët the authors of the book OpenDaylight Cookbook, we will be covering the following recipes: Connecting OpenFlow switches Mounting a NETCONF device Browsing data models with Yang UI (For more resources related to this topic, see here.) OpenDaylight is a collaborative platform supported by leaders in the networking industry and hosted by the Linux Foundation. The goal of the platform is to enable the adoption of software-defined networking (SDN) and create a solid base for network functions virtualization (NFV). Connecting OpenFlow switches OpenFlow is a vendor-neutral standard communications interface defined to enable the interaction between the control and forwarding channels of an SDN architecture. The OpenFlow plugin project intends to support implementations of the OpenFlow specification as it evolves. It currently supports OpenFlow versions 1.0 and 1.3.2. In addition, to support the core OpenFlow specification, OpenDaylight Beryllium also includes preliminary support for the Table Type Patterns and OF-CONFIG specifications. The OpenFlow southbound plugin currently provides the following components: Flow management Group management Meter management Statistics polling Let's connect an OpenFlow switch to OpenDaylight. Getting ready This recipe requires an OpenFlow switch. If you don't have any, you can use a mininet-vm with OvS installed. You can download mininet-vm from the website: https://github.com/mininet/mininet/wiki/Mininet-VM-Images. Any version should work The following recipe will be presented using a mininet-vm with OvS 2.0.2. How to do it... Start the OpenDaylight distribution using the karaf script. Using this script will give you access to the karaf CLI: $ ./bin/karaf Install the user facing feature responsible for pulling in all dependencies needed to connect an OpenFlow switch: opendaylight-user@root>feature:install odl-openflowplugin-all It might take a minute or so to complete the installation. Connect an OpenFlow switch to OpenDaylight.we will use mininet-vm as our OpenFlow switch as this VM runs an instance of OpenVSwitch:     Login to mininet-vm using:  Username: mininet   Password: mininet    Let's create a bridge: mininet@mininet-vm:~$ sudo ovs-vsctl add-br br0 Now let's connect OpenDaylight as the controller of br0: mininet@mininet-vm:~$ sudo ovs-vsctl set-controller br0 tcp: ${CONTROLLER_IP}:6633    Let's look at our topology: mininet@mininet-vm:~$ sudo ovs-vsctl show 0b8ed0aa-67ac-4405-af13-70249a7e8a96 Bridge "br0" Controller "tcp: ${CONTROLLER_IP}:6633" is_connected: true Port "br0" Interface "br0" type: internal ovs_version: "2.0.2" ${CONTROLLER_IP} is the IP address of the host running OpenDaylight. We're establishing a TCP connection. Have a look at the created OpenFlow node.Once the OpenFlow switch is connected, send the following request to get information regarding the switch:    Type: GET    Headers:Authorization: Basic YWRtaW46YWRtaW4=    URL: http://localhost:8181/restconf/operational/opendaylight-inventory:nodes/ This will list all the nodes under opendaylight-inventory subtree of MD-SAL that store OpenFlow switch information. As we connected our first switch, we should have only one node there. It will contain all the information the OpenFlow switch has, including its tables, its ports, flow statistics, and so on. How it works... Once the feature is installed, OpenDaylight is listening to connection on port 6633 and 6640. Setting up the controller on the OpenFlow-capable switch will immediately trigger a callback on OpenDaylight. It will create the communication pipeline between the switch and OpenDaylight so they can communicate in a scalable and non-blocking way. Mounting a NETCONF device The OpenDaylight component responsible to connect remote NETCONF devices is called the NETCONF southbound plugin aka the netconf-connector. Creating an instance of the netconf-connector will connect a NETCONF device. The NETCONF device will be seen as a mount point in the MD-SAL, exposing the device configuration and operational datastore and its capabilities. These mount points allow applications and remote users (over RESTCONF) to interact with the mounted devices. The netconf-connector currently supports the RFC-6241, RFC-5277 and RFC-6022. The following recipe will explain how to connect a NETCONF device to OpenDaylight. Getting ready This recipe requires a NETCONF device. If you don't have any, you can use the NETCONF test tool provided by OpenDaylight. It can be downloaded from the OpenDaylight Nexus repository: https://nexus.opendaylight.org/content/repositories/opendaylight.release/org/opendaylight/netconf/netconf-testtool/1.0.4-Beryllium-SR4/netconf-testtool-1.0.4-Beryllium-SR4-executable.jar How to do it... Start OpenDaylight karaf distribution using the karaf script. Using this script will give you access to the karaf CLI: $ ./bin/karaf Install the user facing feature responsible for pulling in all dependencies needed to connect an NETCONF device: opendaylight-user@root>feature:install odl-netconf-topology odl-restconf It might take a minute or so to complete the installation. Start your NETCONF device.If you want to use the NETCONF test tool, it is time to simulate a NETCONF device using the following command: $ java -jar netconf-testtool-1.0.1-Beryllium-SR4-executable.jar --device-count 1 This will simulate one device that will be bound to port 17830. Configure a new netconf-connectorSend the following request using RESTCONF:    Type: PUT    URL: http://localhost:8181/restconf/config/network-topology:network-topology/topology/topology-netconf/node/new-netconf-deviceBy looking closer at the URL, you will notice that the last part is new-netconf-device. This must match the node-id that we will define in the payload. Headers:Accept: application/xml Content-Type: application/xml Authorization: Basic YWRtaW46YWRtaW4= Payload: <node > <node-id>new-netconf-device</node-id> <host >127.0.0.1</host> <port >17830</port> <username >admin</username> <password >admin</password> <tcp-only >false</tcp- only> </node> Let's have a closer look at this payload:    node-id: Defines the name of the netconf-connector.    address: Defines the IP address of the NETCONF device.    port: Defines the port for the NETCONF session.    username: Defines the username of the NETCONF session. This should be provided by the NETCONF device configuration.    password: Defines the password of the NETCONF session. As for the username, this should be provided by the NETCONF device configuration.    tcp-only: Defines whether or not the NETCONF session should use tcp or ssl. If set to true it will use tcp. This is the default configuration of the netconf-connector; it actually has more configurable elements that will be present in a second part. Once you have completed the request, send it. This will spawn a new netconf-connector that connects to the NETCONF device at the provided IP address and port using the provided credentials. Verify that the netconf-connector has correctly been pushed and get information about the connected NETCONF device.First, you could look at the log to see if any error occurred. If no error has occurred, you will see: 2016-05-07 11:37:42,470 | INFO | sing-executor-11 | NetconfDevice | 253 - org.opendaylight.netconf.sal-netconf-connector - 1.3.0.Beryllium | RemoteDevice{new-netconf-device}: Netconf connector initialized successfully Once the new netconf-connector is created, some useful metadata are written into the MD-SAL's operational datastore under the network-topology subtree. To retrieve this information, you should send the following request: Type: GET Headers: Authorization: Basic YWRtaW46YWRtaW4= URL: http://localhost:8181/restconf/operational/network-topology:network-topology/topology/topology-netconf/node/new-netconf-device We're using new-netconf-device as the node-id because this is the name we assigned to the netconf-connector in a previous step. This request will provide information about the connection status and device capabilities. The device capabilities are all the yang models the NETCONF device is providing in its hello-message that was used to create the schema context. More configuration for the netconf-connectorAs mentioned previously, the netconf-connector contains various configuration elements. Those fields are non-mandatory, with default values. If you do not wish to override any of these values, you shouldn't provide them. schema-cache-directory: This corresponds to the destination schema repository for yang files downloaded from the NETCONF device. By default, those schemas are saved in the cache directory ($ODL_ROOT/cache/schema). Using this configuration will define where to save the downloaded schema related to the cache directory. For instance, if you assigned new-schema-cache, schemas related to this device would be located under $ODL_ROOT/cache/new-schema-cache/. reconnect-on-changed-schema: If set to true, the connector will auto disconnect/reconnect when schemas are changed in the remote device. The netconf-connector will subscribe to base NETCONF notifications and listens for netconf-capability-change notification. Default value is false. connection-timeout-millis: Timeout in milliseconds after which the connection must be established. Default value is 20000 milliseconds.  default-request-timeout-millis: Timeout for blocking operations within transactions. Once this timer is reached, if the request is not yet finished, it will be canceled. Default value is 60000 milliseconds.  max-connection-attempts: Maximum number of connection attempts. Non-positive or null value is interpreted as infinity. Default value is 0, which means it will retry forever.  between-attempts-timeout-millis: Initial timeout in milliseconds between connection attempts. This will be multiplied by the sleep-factor for every new attempt. Default value is 2000 milliseconds.  sleep-factor: Back-off factor used to increase the delay between connection attempt(s). Default value is 1.5.  keepalive-delay: Netconf-connector sends keep alive RPCs while the session is idle to ensure session connectivity. This delay specifies the timeout between keep alive RPC in seconds. Providing a 0 value will disable this mechanism. Default value is 120 seconds. Using this configuration, your payload would look like this: <node > <node-id>new-netconf-device</node-id> <host >127.0.0.1</host> <port >17830</port> <username >admin</username> <password >admin</password> <tcp-only >false</tcp- only> <schema-cache-directory >new_netconf_device_cache</schema-cache-directory> <reconnect-on-changed-schema >false</reconnect-on-changed-schema> <connection-timeout-millis >20000</connection-timeout-millis> <default-request-timeout-millis >60000</default-request-timeout-millis> <max-connection-attempts >0</max-connection-attempts> <between-attempts-timeout-millis >2000</between-attempts-timeout-millis> <sleep-factor >1.5</sleep-factor> <keepalive-delay >120</keepalive-delay> </node> How it works... Once the request to connect a new NETCONF device is sent, OpenDaylight will setup the communication channel, used for managing, interacting with the device. At first, the remote NETCONF device will send its hello-message defining all of the capabilities it has. Based on this, the netconf-connector will download all the YANG files provided by the device. All those YANG files will define the schema context of the device. At the end of the process, some exposed capabilities might end up as unavailable, for two possible reasons: The NETCONF device provided a capability in its hello-message but hasn't provided the schema. ODL failed to mount a given schema due to YANG violation(s). OpenDaylight parses YANG models as per as the RFC 6020; if a schema is not respecting the RFC, it could end up as an unavailable-capability. If you encounter one of these situations, looking at the logs will pinpoint the reason for such a failure. There's more... Once the NETCONF device is connected, all its capabilities are available through the mount point. View it as a pass-through directly to the NETCONF device. Get datastore To see the data contained in the device datastore, use the following request: Type: GET Headers:Authorization: Basic YWRtaW46YWRtaW4= URL: http://localhost:8080/restconf/config/network-topology:network-topology/topology/topology-netconf/node/new-netconf-device/yang-ext:mount/ Adding yang-ext:mount/ to the URL will access the mount point created for new-netconf-device. This will show the configuration datastore. If you want to see the operational one, replace config by operational in the URL. If your device defines yang model, you can access its data using the following request: Type: GET Headers:Authorization: Basic YWRtaW46YWRtaW4= URL: http://localhost:8080/restconf/config/network-topology:network-topology/topology/topology-netconf/node/new-netconf-device/yang-ext:mount/<module>:<container> The <module> represents a schema defining the <container>. The <container> can either be a list or a container. It is not possible to access a single leaf. You can access containers/lists within containers/lists. The last part of the URL would look like this: …/ yang-ext:mount/<module>:<container>/<sub-container> Invoke RPC In order to invoke an RPC on the remote device, you should use the following request: Type: POST Headers:Accept: application/xml Content-Type: application/xml Authorization: Basic YWRtaW46YWRtaW4= URL: http://localhost:8080/restconf/config/network-topology:network-topology/topology/topology-netconf/node/new-netconf-device/yang-ext:mount/<module>:<operation> This URL is accessing the mount point of new-netconf-device, and through this mount point we're accessing the <module> to call its <operation>. The <module> represents a schema defining the RPC and <operation> represents the RPC to call. Delete a netconf-connector Removing a netconf-connector will drop the NETCONF session and all resources will be cleaned. To perform such an operation, use the following request: Type: DELETE Headers:Authorization: Basic YWRtaW46YWRtaW4= URL: http://localhost:8181/restconf/config/network-topology:network-topology/topology/topology-netconf/node/new-netconf-device By looking closer to the URL, you can see that we are removing the NETCONF node-id new-netconf-device. Browsing data models with Yang UI Yang UI is a user interface application through which one can navigate among all yang models available in the OpenDaylight controller. Not only does it aggregate all data models, it also enables their usage. Using this interface, you can create, remove, update, and delete any part of the model-driven datastore. It provides a nice, smooth user interface making it easier to browse through the model(s). This recipe will guide you through those functionalities. Getting ready This recipe only requires the OpenDaylight controller and a web-browser. How to do it... Start your OpenDaylight distribution using the karaf script. Using this client will give you access to the karaf CLI: $ ./bin/karaf Install the user facing feature responsible to pull in all dependencies needed to use Yang UI: opendaylight-user@root>feature:install odl-dlux-yangui It might take a minute or so to complete the installation. Navigate to http://localhost:8181/index.html#/yangui/index.Username: admin Password: admin Once logged in, all modules will be loading until you can see this message at the bottom of the screen: Loading completed successfully You should see the API tab listing all yang models in the following format: <module-name> rev.<revision-date> For instance:     cluster-admin rev.2015-10-13     config rev.2013-04-05     credential-store rev.2015-02-26 By default, there isn't much you can do with the provided yang models. So, let's connect an OpenFlow switch to better understand how to use this Yang UI. Once done, refresh your web page to load newly added modules. Look for opendaylight-inventory rev.2013-08-19 and select the operational tab, as nothing will yet be in the config datastore. Then click on nodes and you'll see a request bar at the bottom of the page with multiple options.You can either copy the request to the clipboard to use it on your browser, send it, show a preview of it, or define a custom API request. For now, we will only send the request. You should see Request sent successfully and under this message should be the retrieved data. As we only have one switch connected, there is only one node. All the switch operational information is now printed on your screen. You could do the same request by specifying the node-id in the request. To do that you will need to expand nodes and click on node {id}, which will enable a more fine-grained search. How it works... OpenDaylight has a model-driven architecture, which means that all of its components are modeled using YANG. While installing features, OpenDaylight loads YANG models, making them available within the MD-SAL datastore. YangUI is a representation of this datastore. Each schema represents a subtree based on the name of the module and its revision-date. YangUI aggregates and parses all those models. It also acts as a REST client; through its web interface we can execute functions such as GET, POST, PUT, and DELETE. There's more… The example shown previously can be improved on, as there was no user yang model loaded. For instance, if you mount a NETCONF device containing its own yang model, you could interact with it through YangUI. You would use the config datastore to push/update some data, and you would see the operational datastore updated accordingly. In addition, accessing your data would be much easier than having to define the exact URL. See also Using API doc as a REST API client. Summary Throughout this article, we learned recipes such as connecting OpenFlow switches, mounting a NETCONF device, browsing data models with Yang UI. Resources for Article: Further resources on this subject: Introduction to SDN - Transformation from legacy to SDN [article] The OpenFlow Controllers [article] Introduction to SDN - Transformation from legacy to SDN [article]
Read more
  • 0
  • 0
  • 42607

article-image-object-oriented-scala
Packt
05 Jul 2017
9 min read
Save for later

Object-Oriented Scala

Packt
05 Jul 2017
9 min read
In this article by Md. Rezaul Karim and Sridhar Alla, author of Scala and Spark for Big Data Analytics, we will discuss the basic object-oriented features in Scala. In a nutshell, the following topics will be covered in this article: Variables in Scala (For more resources related to this topic, see here.) Variables in Scala Before entering the depth of OOP features, at first, we need to know the details of the different types of variables and data types in Scala. To declare a variable in Scala, you need to use the var or val keywords. The formal syntax of declaring a variable in Scala is as follows: val or var VariableName : DataType = Initial_Value For example, let's see how we can declare two variables whose data types are explicitly specified as shown: var myVar : Int = 50 val myVal : String = "Hello World! I've started learning Scala." Even you can just declare a variable without specifying the data type. For example, let's see how to declare a variable using var or val: var myVar = 50 val myVal = "Hello World! I've started learning Scala." There are two types of variables in Scala--mutable and immutable--that can be defined as this: Mutable: The ones whose values you can change later Immutable: The ones whose values you cannot change once they have been set In general, for declaring a mutable variable, the var keyword is used. On the other hand, the val keyword is used for specifying an immutable variable. To see an example of using the mutable and immutable variables, let's consider the following code segment: package com.chapter3.OOP object VariablesDemo { def main(args: Array[String]) { var myVar : Int = 50 valmyVal : String = "Hello World! I've started learning Scala." myVar = 90 myVal = "Hello world!" println(myVar) println(myVal) } } The preceding code works fine until myVar = 90, since myVar is a mutable variable. However, if you try to change the value of the immutable variable (that is, myVal), as shown earlier, your IDE will show a compilation error saying that reassignment to val as follows: Figure 1: Reassignment of immutable variables is not allowed in Scala variable scope Don't worry looking at the preceding code with object and method! We will discuss classes, methods, and objects later in this article, then things will get clearer. In Scala, variables can have three different scopes depending on the place where you have declared them: Fields: Fields are variables belonging to an instance of a class of your Scala code. The fields are, therefore, accessible from inside every method in the object. However, depending on the access modifiers, fields can be accessible to instances of the other classes. As discussed earlier, object fields can be mutable or immutable (based on the declaration types, using either var or val). However, they can’t be both at the same time. Method arguments: When the method is called, these variables can be used to pass the value inside a method. Method parameters are accessible only from inside the method. However, the objects that are being passed in may be accessible from the outside. It is to be noted that method parameters/arguments are always immutable, no matter what is/are the keywords specified. Local variables: These variables are declared inside a method and are accessible from the inside the method itself. However, the calling code can access the returned value. Reference versus value immutability According to the earlier section, val is used to declare immutable variables, so can we change the values of these variables? Also, will it be similar to the final keyword in Java? To help us understand more about this, we will use this code snippet: scala> var testVar = 10 testVar: Int = 10 scala> testVar = testVar + 10 testVar: Int = 20 scala> val testVal = 6 testVal: Int = 6 scala> testVal = testVal + 10 <console>:12: error: reassignment to val testVal = testVal + 10 ^ scala> If you ran the preceding code, an error at compilation time will be noticed, which will tell that you are trying to reassign to a val variable. In general, mutable variables bring a performance advantage. The reason is that this is closer to how the computer behaves and introducing immutable values forces the computer to create a whole new instance of an object whenever a change (no matter how small) to that instance is required. Data types in Scala As mentioned Scala is a JVM language, so it shares lots of commonalities with Java. One of these commonalities is the data types; Scala shares the same data types with Java. In short, Scala has all the data types as Java, with the same memory footprint and precision. Objects are almost everywhere in Scala and all data types are objects; you can call methods in them, as illustrated: Sr.No Data type and description 1 Byte: An 8 bit signed value; range is from -128 to 127 2 Short: A 16 bit signed value; range is -32768 to 32767 3 Int: A 32 bit signed value; range is -2147483648 to 2147483647 4 Long: A 64 bit signed value; range is from -9223372036854775808 to 9223372036854775807 5 Float: A 32 bit IEEE 754 single-precision float 6 Double: A 64 bit IEEE 754 double-precision float 7 Char: A 16 bit unsigned Unicode character; ranges from U+0000 to U+FFFF 8 String: A sequence of Chars 9 Boolean: Either the literal true or the literal false 10 Unit: Corresponds to no value 11 Null: Null or empty reference 12 Nothing: The subtype of every other type; it includes no values 13 Any: The supertype of any type; any object is of the Any type 14 AnyRef: The supertype of any reference type Table 1: Scala data types, description, and range All the data types listed in the preceding table are objects. However, note that there are no primitive types as in Java. This means that you can call methods on an Int, Long, and so on: val myVal = 20 //use println method to print it to the console; you will also notice that if will be inferred as Int println(myVal + 10) val myVal = 40 println(myVal * "test") Now you can start playing around with these variables. Now, let's get some ideas on how to initialize a variable and work on the type annotations. Variable initialization In Scala, it's a good practice to initialize the variables once declared. However, it is to be noted that uninitialized variables aren’t necessarily nulls (consider types such as Int, Long, Double, and Char) and initialized variables aren't necessarily non-null (for example, val s: String = null). The actual reasons are the following: In Scala, types are inferred from the assigned value. This means that a value must be assigned for the compiler to infer the type (how should the compiler consider this code: val a? Since a value isn't given, the compiler can't infer the type; since it can’t infer the type, it wouldn't know how to initialize it). In Scala, most of the times, you’ll use val since these are immutable; you wouldn’t be able to declare them and initialize them afterward. Although Scala language requires that you initialize your instance variable before using it, Scala does not provide a default value for your variable. Instead, you have to set up its value manually using the wildcard underscore, which acts like a default value, as follows: var name:String = _ Instead of using names such as val1, and val2, you can define your own names: scala> val result = 6 * 5 + 8 result: Int = 38 You can even use these names in subsequent expressions, as follows: scala> 0.5 * result res0: Double = 19.0 Type annotations If you used the val or var keyword to declare a variable, its data type will be inferred automatically according to the value that you assigned to this variable. You also have the luxury of explicitly stating the data type of the variable at declaration time: val myVal : Integer = 10 Now, let's see some other aspects that will be needed while working with the variable and data types in Scala. We will see how to work with type ascription and lazy variables. Type ascriptions Type ascription is used to tell the compiler what types you expect out of an expression, from all possible valid types. Consequently, a type is valid if it respects the existing constraints, such as variance and type declarations; it is either one of the types the expression it applies to "is a" or there's a conversion that applies in scope. So technically, java.lang.String extends java.lang.Object; therefore, any String is also an Object. Consider the following example: scala> val s = "Ahmed Shadman" s: String = Ahmed Shadman scala> val p = s:Object p: Object = Ahmed Shadman scala> Lazy val The main characteristic of a lazy val is that the bound expression is not evaluated immediately, but once on the first access. Here lies the main difference between val and lazy val. When the initial access happens, the expression is evaluated and the result bound to the identifier of the lazy val. On subsequent access, no further evaluation occurs; instead, the stored result is returned immediately. Let's look at an interesting example: scala> lazy val num = 1 / 0 num: Int = <lazy> If you see the preceding code in Scala REPL, you will note that the code runs very well without giving any error even though you divided an integer with 0! Let's look at a better example: scala> val x = {println("x"); 20} x x: Int = 20 scala> x res1: Int = 20 scala> This works, and you can access the value of the x variable when required later on. These are just a few examples of using lazy val concepts. The interested readers should access https://blog.codecentric.de/en/2016/02/lazy-vals-scala-look-hood/ for more details. Summary The structure code in a sane way with classes and traits enhance the reusability of your code with generics and create a project with standard and widespread tools. Improve on the basics to know how Scala implements the OO paradigm to allow building modular software systems. Resources for Article: Further resources on this subject: Spark for Beginners [article] Getting Started with Apache Spark [article] Spark – Architecture and First Program [article]
Read more
  • 0
  • 0
  • 3295

article-image-devops-concepts-and-assessment-framework
Packt
05 Jul 2017
21 min read
Save for later

DevOps Concepts and Assessment Framework

Packt
05 Jul 2017
21 min read
In this article by Mitesh Soni, the author of the book DevOps Bootcamp we will discuss how to get quick understanding of DevOps from 10000 feet with real world examples on how to prepare for changing a culture. This will allow us to build the foundation of the DevOps concepts by discussing what our goals are, as well as getting buy-in from Organization Management. Basically, we will try to cover DevOps practices that can make application lifecycle management easy and effective. It is very important to understand that DevOps is not a framework, tool or any technology. It is more about culture of any organization. It is also a way people work in an organization using defined processes and by utilizing automation tools to make daily work more effective and less manual. To understand the basic importance of DevOps, we will cover following topics in this article: Need for DevOps How DevOps culture can evolve? Importance of PPT – People, Process, and Technology Why DevOps is not all about Tools DevOps Assessment Questions (For more resources related to this topic, see here.) Need for DevOps There is a famous quote by Harriet Tubman which you can find on (http://harriettubmanbiography.com). It says : Every great dream begins with a dreamer. Always remember, you have within you the strength, the patience, and the passion to reach for the stars to change the world Change is the law of life and that is also applicable to organization as well. And if any organization or individuals look only at the past or present patterns, culture, or practices then they are certain to miss the future best practices. In the dynamic IT world, we need to keep pace with the technology evolution. We can relate to George Bernard Shaw's saying: Progress is impossible without change, and those who cannot change their minds cannot change anything. Here we are focusing on changing the way we manage application lifecycle. Important question is whether we really need this change? Do we really need to go through the pain of this change? Answer is Yes. One may ask that such kind of change in business or culture must not be forceful. Agree. Let's understand the pain points faced by organizations in Application lifecycle management in modern world with the help of the following figure:   Considering the changing patterns and competitive environment is business, it is the need of an hour to improve application lifecycle management. Are there any factors that can be helpful in this modern times that can help us to improve application lifecycle management? Yes. Cloud Computing has changed the game. It has open doors for many path breaking solutions and innovations. Let's understand what Cloud Computing is and then we will see overview of DevOps and how Cloud is useful in DevOps. Overview of Cloud Computing Cloud computing is a type of computing that provides multi-tenant or dedicated computing resources such as compute, storage, and network which are delivered to Cloud consumers on demand. It comes in different flavors that includes Cloud Deployment Models and Cloud Service Models. The most important thing in this is the way its pricing model works that is pay as you go. Cloud Deployment Models describes the way Cloud resources are deployed such as behind the firewall and on the premise exclusively for a specific organization that is Private Cloud; or Cloud resources that are available to all organizations and individuals that is Public Cloud; or Cloud resources that are available to specific set of organizations that share similar types of interests or similar types of requirements that is Community Cloud; or Cloud resources that combines two or more deployment models that is known as Hybrid Cloud. Cloud Service Models describes the way Cloud resources are made available to Cloud consumers. It can be in form of pure Infrastructure where virtual machines are accessible and controlled by Cloud consumer or end user that is Infrastructure as a Service (IaaS); or Platform where runtime environments are provided so installation and configuration of all software needed to run application are already available and managed by Cloud Service Provider that is Platform as a Service; or Software as a Service where whole application is made available by Cloud Service Provider with responsibility of Infrastructure and Platform remains with Cloud Service Provider. There are many Service Models that have emerged during last few years but IaaS, PaaS, and SaaS are based on the National Institute of Standards and Technology (NIST) definition. Cloud computing has few characteristics which are significant such as Multi-Tenancy, Pay as you Use similar to electricity or Gas connection, On demand Self Service, Resource Pooling for better utilization of compute, storage and network resources, Rapid Elasticity for scaling up and scaling down resources based on needs in automated fashion and Measured Service for billing. Over the years, usage of different Cloud Deployment Models has varied based on use cases. Initially Public Cloud was used for applications that were considered non-critical while Private Cloud was used for critical application where security was a major concern. Hybrid and Public Cloud usage evolved over the time with experience and confidence in the services provided by Cloud Service Providers. Similarly, usage of different Cloud Service Models has varied based on the use cases and flexibility. IaaS was the most popular in early days but PaaS is catching up in its maturity and ease of use with enterprise capabilities. Overview of DevOps DevOps is all about a culture of an organization, processes, and technology to develop communication and collaboration between Development and IT Operations teams to manage application life-cycle more effectively than the existing ways of doing it. We often tend to work based on patterns to find reusable solutions from similar kind of problems or challenges. Over the years, achievements and failed experiments, Best practices, automation scripts, configuration management tools, and methodologies becomes integral part of Culture. It helps to define practices for a way of designing, a way of developing, a way of testing, a way of setting up resources, a way of managing environments, a way of configuration management, a way of deploying an application, a way of gathering feedback, a way of code improvements, and a way of doing innovations. Following are some of the visible benefits that can be achieved by implementing DevOps practices. DevOps culture is considered as innovative package to integrate Dev and Ops team in effective manner that includes components such as Continuous Build Integration, Continuous Testing, Cloud Resource Provisioning, Continuous Delivery, Continuous Deployment, Continuous Monitoring, Continuous Feedback, Continuous Improvement, and Continuous Innovation to make application delivery faster as per the demand of Agile methodology. However, it is not only about development and operations team that are involved. Testing team, Business Analysts, Build Engineers, Automation team, Cloud Team, and many other stakeholders are involved in this exercise of evolving existing culture. DevOps culture is not much different than the Organization culture which has shared values and behavioral aspect. It needs adjustment in mindsets and processes to align with new technology and tools. Challenges for Development and Operations Team There are some challenges why this scenario has occurred and that is why DevOps is going in upward direction and talk of the town in all Information Technology related discussions. Challenges for the Development Team Developers are enthusiastic and willing to adopt new technologies and approaches to solve problems. However they face many challenges including below: The competitive market creates pressure of on-time delivery They have to take care of production-ready code management and new feature implementation The release cycle is often long and hence the development team has to make assumptions before the application deployment finally takes place. In such a scenario, it takes more time to fix the issues that occurred during deployment in the staging or production environment Challenges for the Operations Team Operations team is always careful in changing resources or using any new technologies or new approaches as they want stability. However they face many challenges including below: Resource contention: It's difficult to handle increasing resource demands Redesigning or tweaking: This is needed to run the application in the production environment Diagnosing and rectifying: They are supposed to diagnose and rectify issues after application deployment in isolation Considering all the challenges faced by development and operations team, how should we improve existing processes, make use of automation tools to make processes more effective, and change people's mindset? Let's see in the next section on how to evolve DevOps culture in the organization and improve efficiency and effectiveness. How DevOps culture can evolve? Inefficient estimation, long time to market, and other issues led to a change in the waterfall model, resulting in the agile model. Evolving a culture is not a time bound or overnight process. It can be a step by step and stage wise process that can be achieved without dependencies on the other stages. We can achieve Continuous Integration without Cloud Provisioning. We can achieve Cloud Provisioning without Configuration Management. We can achieve Continuous Testing without any other DevOps practices. Following are different types of stages to achieve DevOps practices. Agile Development Agile development or the agile based methodology are useful for building an application by empowering individuals and encouraging interactions, giving importance to working software, customer collaboration—using feedback for improvement in subsequent steps—and responding to change in efficient manner. One of the most attractive benefits of agile development is continuous delivery in short time frames or, in agile terms, sprints. Thus, the agile approach of application development, improvement in technology, and disruptive innovations and approaches have created a gap between development and operations teams. DevOps DevOps attempts to fill these gaps by developing a partnership between the development and operations teams. The DevOps movement emphasizes communication, collaboration, and integration between software developers and IT operations. DevOps promotes collaboration, and collaboration is facilitated by automation and orchestration in order to improve processes. In other words, DevOps essentially extends the continuous development goals of the agile movement to continuous integration and release. DevOps is a combination of agile practices and processes leveraging the benefits of cloud solutions. Agile development and testing methodologies help us meet the goals of continuously integrating, developing, building, deploying, testing, and releasing applications. Build Automation An automated build helps us create an application build using build automation tools such as Gradle, Apache Ant and Apache Maven. An automated build process includes the activities such as Compiling source code into class files or binary files, Providing references to third-party library files, Providing the path of configuration files, Packaging class files or binary files into Package files, Executing automated test cases, Deploying package files on local or remote machines and Reducing manual effort in creating the package file. Continuous Integration In simple words, Continuous Integration or CI is a software engineering practice where each check-in made by a developer is verified by either of the following: Pull mechanism: Executing an automated build at a scheduled time and Push mechanism: Executing an automated build when changes are saved in the repository. This step is followed by executing a unit test against the latest changes available in the source code repository. Continuous integration is a popular DevOps practice that requires developers to integrate code into a code repositories such as Git and SVN multiple times a day to verify integrity of the code. Each check-in is then verified by an automated build, allowing teams to detect problems early. Cloud Provisioning Cloud provisioning has opened the door to treat Infrastructure as a Code and that makes the entire process extremely efficient and effective as we are automating process that involved manual intervention to a huge extent. Pay as you go billing model has made required resources more affordable to not only large organizations but also to mid and small scale organizations as well as individuals. It helps to go for improvements and innovations as earlier resource constraints were blocking organizations to go for extra mile because of cost and maintenance. Once we have agility in infrastructure resources then we can think of automating installation and configuration of packages that are required to run the application. Configuration Management Configuration management (CM) manages changes in the system or, to be more specific, the server run time environment. There are many tools available in the market with which we can achieve configuration management. Popular tools are Chef, Puppet, Ansible, Salt, and so on. Let's consider an example where we need to manage multiple servers with same kind of configuration. For example, we need to install Tomcat on each server. What if we need to change the port on all servers or update some packages or provide rights to some users? Any kind of modification in this scenario is a manual and, if so, error-prone process. As the same configuration is being used for all the servers, automation can be useful here. Continuous Delivery Continuous Delivery and Continuous Deployment are used interchangeably. However, there is a small difference between them. Continuous delivery is a process of deploying an application in any environment in an automated fashion and providing continuous feedback to improve its quality. Automated approach may not change in Continuous Delivery and Continuous Deployment. Approval process and some other minor things can change. Continuous Testing and Deployment Continuous Testing is a very important phase of end to end application lifecycle management process. It involves functional testing, performance testing, security testing and so on. Selenium, Appium, Apache JMeter, and many other tools can be utilized for the same. Continuous deployment, on the other hand, is all about deploying an application with the latest changes to the production environment. Continuous Monitoring Continuous monitoring is a backbone of end-to-end delivery pipeline, and open source monitoring tools are like toppings on an ice cream scoop. It is desirable to have monitoring at almost every stage in order to have transparency about all the processes, as shown in the following diagram. It also helps us troubleshoot quickly. Monitoring should be a well thought-out implementation of a plan. Let's try to depict entire process as continuous approach in the diagram below. We need to understand here that it is a phased approach and it is not necessary to automate every phase of automation at once. It is more effective to take one DevOps practice at a time, implement it and realize its benefit before implementing another one. This way we are safe enough to assess the improvements of changing culture in the organization and remove manual efforts from the application lifecycle management. Importance of PPT – People, Process, and Technology PPT is an important word in any organization. Wait! We are not talking about Powerpoint Presentation. Here, we are focusing on People, Processes, and Tools / Technology. Let's understand why and how they are important in changing culture of any organization. People As per the famous quote from Jack Canfield : Successful people maintain a positive focus in life no matter what is going on around them. They stay focused on their past successes rather than their past failures, and on the next action steps they need to take to get them closer to the fulfillment of their goals rather than all the other distractions that life presents to them. Curious question can be, why People matter? In one sentence, if we try to answer it then it would be: Because We are trying to change Culture. So? People are important part of any culture and only people can drive the change or change themselves to adapt to new processes or defining new processes and to learn new tools or technologies. Let's understand how and why with “Formula for Change“. David Gleicher created the “Formula for Change” in early 1960s as per references available in Wikipedia. Kathie Dannemiller refined it in 1980. This formula provides a model to assess the relative strengths affecting the possible success of organisational change initiatives. Gleicher (original) version: C = (ABD) > X, where: C = change, A = the status quo dissatisfaction, B = a desired clear state, D = is practical steps to the desired state, X = the cost of the change. Dannemiller version: D x V x F > R; where D, V, and F must be present for organizational change to take place where: D = Dissatisfaction with how things are now; V = Vision of what is possible; F = First, concrete steps that can be taken towards the vision; If the product of these three factors is greater than R = Resistance then change is possible. Essentially, it implies that there has to be strong Dissatisfaction with existing things or processes, Vision of what is possible with new trends, technologies, and innovations with respect to market scenario; concrete steps that can be taken towards achieving the vision. For More Details on 'Formula for change' you can visit this wiki page : https://en.wikipedia.org/wiki/Formula_for_change#cite_note-myth-1 If it comes to sharing an experience, I would say it is very important to train people to adopt new culture. It is a game of patience. We can't change mindset of people overnight and we need to understand first before changing the culture. Often I see Job Opening with a DevOps knowledge or DevOps Engineers and I feel that they should not be imported but people should be trained in the existing environment with Changing things gradually to manage resistance. We don't need special DevOps team, we need more communication and collaboration between developers, test teams, automation enablers, and cloud or infrastructure team. It is essential for all to understand pain points of each other. In number of organization I have worked, we used to have COE (Center of Excellence) in place to manage new technologies, innovations or culture. As an automation enabler and part of DevOps team, we should be working as facilitator only and not a part of silo. Processes Here is a famous quote from Tom Peters which says : Almost all quality improvement comes via simplification of design, manufacturing… layout, processes, and procedures Quality is extremely important when we are dealing with evolving a culture. We need processes and policies for doing things in proper way and standardized across the projects so sequence of operations, constraints, rules and so on are well defined to measure success. We need to set processes for following things: Agile Planning Resource Planning & Provisioning Configuration Management Role based Access Control to Cloud resources and other tools used in Automation Static Code Analysis – Rules for Programming Languages Testing Methodology and Tools  Release Management These processes are also important for measuring success in the process of evolving DevOps culture. Technology Here is a famous quote from Steve Jobs which says: Technology is nothing. What's important is that you have a faith in people, that they're basically good and smart, and if you give them tools, they'll do wonderful things with them Technology helps people and organizations to bring creativity and innovations while changing the culture. Without Technology, it is difficult to achieve speed and effectiveness in the daily and routine automation operations. Cloud Computing, Configuration Management tools, and Build Pipeline are among few that is useful in resource provisioning, installing runtime environment, and orchestration. Essentially, it helps to speed up different aspects of application lifecycle management. Why DevOps is not all about Tools Yes, tools are nothing. It is not that important factor in changing the culture of any organization. Reason is very simple. No matter what technology we use, we will perform Continuous Integration, Cloud Provisioning, Configuration Management, Continuous Delivery, Continuous Deployment, Continuous Monitoring and so on. Category wise different tool sets can be used but all perform similar things. It is just the way that tool perform operation that differs else outcome is same. Following are some tools based on the categories: Category Tools Build Automation Nant, MSBuild, Maven, Ant, Gradle Repository Git, SVN Static Code Analysis Sonar, PMD Continuous Integration Jenkins, Atlassian Bamboo, VSTS Configuration Management Chef, Puppet, Ansible, Salt Cloud Platforms AWS, Microsoft Azure Cloud Management Tool RightScale Application Deployment Shell Scripts, Plugins Functional Testing Selenium, Appium Load Testing Apache Jmeter Repositories Artifactory, Nexus, Fabric  Let's see how different tools can be useful in different stages for different operations. This may change based on number of environments or the number of DevOps practices we follow in different organizations. If we need to categorize tools based on different DevOps best practices then we can categorize them based on open source and commercial categories. Below are just sample examples. Components Open Source IBM Urban Code Electric-Cloud Build Tools Ant or Maven or MS Build Ant or Maven or MS Build Ant or Maven or MS Build Code Repositories Git or Subversion Git or Atlassian Stash or Subversion or StarTeam Git or Subversion or StarTeam Code Analysis Tools Sonar Sonar Sonar Continuous Integration Jenkins Jenkins or Atlassian Bamboo Jenkins or ElectricAccelerator Continuous Delivery Chef Artifactory and IBM UrbanCode Deploy ElectricFlow In this book we will try to focus on the Open source category as well as Commercial tools. We will use Jenkins and Visual Studio Team Services for all the major automation and orchestration related activities. DevOps Assessment Questions DevOps is a culture and we are very much aware with that fact. However, before implementing automation, putting processes in place and evolving culture, we need to understand existing status of organizations' culture and whether we need to introduce new processes or automation tools. We need to be very clear that we need to make the existing culture more efficient rather than importing culture. To accommodate assessment framework is difficult but we will try to provide some questions and hints based on which it will be easier to create an assessment framework. Create categories for which we want to ask questions and get responses for specific application. Few Sample Questions: Do you follow Agile Principles / Scrum or Kanban? Do you use any tool to keep track of Scrum or Kanban? What is normal sprint duration (2 weeks or 3 weeks) Is there a definitive and explicit definition of done for all phases of work? Are you using any Source Code Repository? Which Source Code Repository Do you use? Are you using any build automation tool such as Ant or Maven or Gradle or not? Are you using any custom script for build automation? Do you have Android and iOS based applications? Are you using any tools for Static Code Analysis? Are you using multiple environment for application deployment for different teams such as Dev, Test, Stage, pre-prod, prod etc. ? Are you using On Premise Infrastructure or Cloud based Infrastructure? Are you using any Configuration management tool or script for installing application packages or runtime environment? Are you using any automated scripts to deploy applications in prod and non-prod environments? Are you using manual approval before application release in any specific environment? Are you using any orchestration tool or script for Application Lifecycle Management? Are you using automation tools for Functional Testing, Load Testing, Security Testing, and Mobile Testing? Are you using any tools for Application and Infrastructure Monitoring? How are defects logged, triaged, and prioritized for resolving them based on priority? Are you using notification services to let stakeholders know about the status of application lifecycle management? Once questions are ready, prepare responses and based on responses decide rating for each response that is given for the above questions. Make a framework flexible so even if we change any question in any category then it will be managed automatically. Once rating is given, capture responses and calculate overall ratings by introducing different conditions and intelligence into the framework. Create category wise final ratings and create different kind of charts from the final rating to improve the reading value of it. The important thing to note here is the significance of organizations' expertise in each area of Application lifecycle management. It will give assessment framework a new dimension to add intelligence and make it more effective. Summary In this article, we have set many goals to achieve throughout this book. We have covered Continuous Integration, Resource provisioning in the Cloud environment, Configuration Management, Continuous Delivery, Continuous Deployment, and Continuous Monitoring. Setting goals is the first step in turning the invisible into the visible. Tony Robbins We have seen how Cloud Computing has changed the way innovation was perceived earlier and how feasible it has become now. We have also covered need for DevOps and all different DevOps practices in brief. People, Processes, and Technology is also important in this whole process of changing existing culture of an organization. We tried to touch upon the reasons why they are important. Tools are important but not the show stopper; Any toolset can be utilized and changing a culture doesn't need specific set of tools. We have discussed in brief about DevOps Assessment Framework as well. It will help to get going on the path of changing culture. Resources for Article: Further resources on this subject: Introduction to DevOps [article] DevOps Tools and Technologies [article] Command Line Tools for DevOps [article]
Read more
  • 0
  • 1
  • 18722

article-image-sql-server-basics
Packt
05 Jul 2017
14 min read
Save for later

SQL Server basics

Packt
05 Jul 2017
14 min read
In this article by Jasmin Azemović, author of the book SQL Server 2017 for Linux, we will cover basic a overview of SQL server and learn about backup. Linux, or to be precise GNU/Linux, is one of the best alternatives to Windows; and in many cases, it is the first choice of environment for daily tasks such as system administration, running different kinds of services, or just a tool for desktop application Linux's native working interface is the command line. Yes, KDE and GNOME are great graphic user interfaces. From a user's perspective, clicking is much easier than typing; but this observation is relative. GUI is something that changed the perception of modern IT and computer usage. Some tasks are very difficult without a mouse, but not impossible. On the other hand, command line is something where you can solve some tasks quicker, more efficiently, and better than in GUI. You don't believe me? Imagine these situations and try to implement them through your favorite GUI tool: In a folder of 1000 files, copy only those the names of which start with A and end with Z, .txt extension Rename 100 files at the same time Redirect console output to the file There are many such examples; in each of them, Command Prompt is superior—Linux Bash, even more. Microsoft SQL Server is considered to be one the most commonly used systems for database management in the world. This popularity has been gained by high degree of stability, security, and business intelligence and integration functionality. Microsoft SQL Server for Linux is a database server that accepts queries from clients, evaluates them and then internally executes them, to deliver results to the client. The client is an application that produces queries, through a database provider and communication protocol sends requests to the server, and retrieves the result for client side processing and/or presentation. (For more resources related to this topic, see here.) Overview of SQL Server When writing queries, it's important to understand that the interaction between the tool of choice and the database based on client-server architecture, and the processes that are involved. It's also important to understand which components are available and what functionality they provide. With a broader understanding of the full product and its components and tools, you'll be able to make better use of its functionality, and also benefit from using the right tool for specific jobs. Client-server architecture concepts In a client-server architecture, the client is described as a user and/or device, and the server as a provider of some kind of service. SQL Server client-server communication As you can see in the preceding figure, the client is represented as a machine, but in reality can be anything. Custom application (desktop, mobile, web) Administration tool (SQL Server Management Studio, dbForge, sqlcmd…) Development environment (Visual Studio, KDevelop…) SQL Server Components Microsoft SQL Server consists of many different components to serve a variety of organizational needs of their data platform. Some of these are: Database Engine is the relational database management system (RDBMS), which hosts databases and processes queries to return results of structured, semi-structured, and non-structured data in online transactional processing solutions (OLTP). Analysis Services is the online analytical processing engine (OLAP) as well as the data mining engine. OLAP is a way of building multi-dimensional data structures for fast and dynamic analysis of large amounts of data, allowing users to navigate hierarchies and dimensions to reach granular and aggregated results to achieve a comprehensive understanding of business values. Data mining is a set of tools used to predict and analyse trends in data behaviour and much more. Integration Services supports the need to extract data from sources, transform it, and load it in destinations (ETL) by providing a central platform that distributes and adjusts large amounts of data between heterogeneous data destinations. Reporting Services is a central platform for delivery of structured data reports and offers a standardized, universal data model for information workers to retrieve data and model reports without the need of understanding the underlying data structures. Data Quality Services (DQS) is used to perform a variety data cleaning, correction and data quality tasks, based on knowledge base. DQS is mostly used in ETL process before loading DW. R services (advanced analytics) is a new service that actually incorporate powerful R language for advanced statistic analytics. It is part of database engine and you can combine classic SQL code with R scripts. While writing this book, only one service was actually available in SQL Server for Linux and its database engine. This will change in the future and you can expect more services to be available. How it works on Linux? SQL Server is a product with a 30-year-long history of development. We are speaking about millions of lines of code on a single operating system (Windows). The logical question is how Microsoft successfully ports those millions of lines of code to the Linux platform so fast. SQL Server@Linux, officially became public in the autumn of 2016. This process would take years of development and investment. Fortunately, it was not so hard. From version 2005, SQL Server database engine has a platform layer called SQL Operating system (SOS). It is a setup between SQL Server engine and the Windows operating systems. The main purpose of SOS is to minimize the number of system calls by letting SQL Server deal with its own resources. It greatly improves performance, stability and debugging process. On the other hand, it is platform dependent and does not provide an abstraction layer. That was the first big problem for even start thinking to make Linux version. Project Drawbridge is a Microsoft research project created to minimize virtualization resources when a host runs many VM on the same physical machine. The technical explanation goes beyond the scope of this book (https://www.microsoft.com/en-us/research/project/drawbridge/). Drawbridge brings us to the solution of the problem. Linux solution uses a hybrid approach, which combines SOS and Liberty OS from Drawbridge project to create SQL PAL (SQL Platform Abstraction Layer). This approach creates a set of SOS API calls which does not require Win32 or NT calls and separate them from platform depended code. This is a dramatically reduced process of rewriting SQL Server from its native environment to a Linux platform. This figure gives you a high-level overview of SQL PAL( https://blogs.technet.microsoft.com/dataplatforminsider/2016/12/16/sql-server-on-linux-how-introduction/). SQL PAL architecture Retrieving and filtering data Databases are one of the cornerstones of modern business companies. Data retrieval is usually made with SELECT statement and is therefore very important that you are familiar with this part of your journey. Retrieved data is often not organized in the way you want them to be, so they require additional formatting. Besides formatting, accessing very large amount of data requires you to take into account the speed and manner of query execution which can have a major impact on system performance Databases usually consist of many tables where all data are stored. Table names clearly describe entities whose data are stored inside and therefore if you need to create a list of new products or a list of customers who had the most orders, you need to retrieve those data by creating a query. A query is an inquiry into the database by using the SELECT statement which is the first and most fundamental SQL statement that we are going to introduce in this chapter. SELECT statement consists of a set of clauses that specifies which data will be included into query result set. All clauses of SQL statements are the keywords and because of that will be written in capital letters. Syntactically correct SELECT statement requires a mandatory FROM clause which specifies the source of the data you want to retrieve. Besides mandatory clauses, there are a few optional ones that can be used to filter and organize data: INTO enables you to insert data (retrieved by the SELECT clause) into a different table. It is mostly used to create table backup. WHERE places conditions on a query and eliminates rows that would be returned by a query without any conditions. ORDER BY displays the query result in either ascending or descending alphabetical order. GROUP BY provides mechanism for arranging identical data into groups. HAVING allows you to create selection criteria at the group level. SQL Server recovery models When it comes to the database, backup is something that you should consider and reconsider really carefully. Mistakes can cost you: money, users, data and time and I don't know which one has bigger consequences. Backup and restore are elements of a much wider picture known by the name of disaster recovery and it is science itself. But, from the database perspective and usual administration task these two operations are the foundation for everything else. Before you even think about your backups, you need to understand recovery models that SQL Server internally uses while the database is in operational mode. Recovery model is about maintaining data in the event of a server failure. Also, it defines amount of information that SQL Server writes in log file with purpose of recovery. SQL Server has three database recovery models: Simple recovery model Full recovery model Bulk-logged recovery model Simple recovery model This model is typically used for small databases and scenarios were data changes are infrequent. It is limited to restoring the database to the point when the last backup was created. It means that all changes made after the backup are gone. You will need to recreate all changes manually. Major benefit of this model is that it takes small amount of storage space for log file. How to use it and when, depends on business scenarios. Full recovery model This model is recommended when recovery from damaged storage is the highest priority and data loss should be minimal. SQL Server uses copies of database and log files to restore database. Database engine logs all changes to the database including bulk operation and most DDL commands. If the transaction log file is not damaged, SQL Server can recover all data except transaction which are in process at the time of failure (not committed in to database file). All logged transactions give you an opportunity of point in time recovery, which is a really cool feature. Major limitation of this model is the large size of the log files which leads you to performance and storage issues. Use it only in scenarios where every insert is important and loss of data is not an option. Bulk-logged recovery model This model is somewhere in the middle of simple and full. It uses database and log backups to recreate database. Comparing to full recovery model, it uses less log space for: CREATE INDEX and bulk load operations such as SELECT INTO. Let's look at this example. SELECT INTO can load a table with 1, 000, 000 records with a single statement. The log will only record occurrence of this operations but details. This approach uses less storage space comparing to full recovery model. Bulk-logged recovery model is good for databases which are used to ETL process and data migrations. SQL Server has system database model. This database is the template for each new one you create. If you use just CREATE DATABASE statement without any additional parameters it simply copies model database with all properties and metadata. It also inherits default recovery model which is full. So, conclusion is that each new database will be in full recovery mode. This can be changed during and after creation process. Elements of backup strategy Good backup strategy is not just about creating a backup. This is a process of many elements and conditions that should be filed to achieve final goal and this is the most efficient backup strategy plan. To create a good strategy, we need to answer the following questions: Who can create backups? Backup media Types of backups Who can create backups? Let's say that SQL Server user needs to be a member of security role which is authorized to execute backup operations. They are members of: sysadmin server role Every user with sysadmin permission can work with backups. Our default sa user is a member of the sysadmin role. db_owner database role Every user who can create databases by default can execute any backup/restore operations. db_backupoperator database role Some time you need just a person(s) to deal with every aspect of backup operation. This is common for large-scale organizations with tens or even hundreds of SQL Server instances. In those environments, backup is not trivial business. Backup media An important decision is where to story backup files and how to organize while backup files and devices. SQL Server gives you a large set of combinations to define your own backup media strategy. Before we explain how to store backups, let's stop for a minute and describe the following terms: Backup disk is a hard disk or another storage device that contains backup files. Back file is just ordinary file on the top of file system. Media set is a collection of backup media in ordered way and fixed type (example: three type devices, Tape1, Tape2, and Tape3). Physical backup device can be a disk file of tape drive. You will need to provide information to SQL Server about your backup device. A backup file that is created before it is used for a backup operation is called a backup device. Figure Backup devices The simplest way to store and handle database backups is by using a back disk and storing them as regular operating system files, usually with the extension .bak. Linux does not care much about extension, but it is good practice to mark those files with something obvious. This chapter will explain how to use backup disk devices because every reader of this book should have a hard disk with an installation of SQL Server on Linux; hope so! Tapes and media sets are used for large-scale database operations such as enterprise-class business (banks, government institutions and so on). Disk backup devices can anything such as a simple hard disk drive, SSD disk, hot-swap disk, USB drive and so on. The size of the disk determines the maximum size of the database backup file. It is recommended that you use a different disk as backup disk. Using this approach, you will separate database data and log disks. Imagine this. Database files and backup are on the same device. If that device fails, your perfect backup strategy will fall like a tower of cards. Don't do this. Always separate them. Some serious disaster recovery strategies (backup is only smart part of it) suggest using different geographic locations. This makes sense. A natural disaster or something else of that scale can knock down the business if you can't restore your system from a secondary location in a reasonably small amount of time. Summary Backup and restore is not something that you can leave aside. It requires serious analyzing and planning, and SQL Server gives you powerful backup types and options to create your disaster recovery policy on SQL Server on Linux. Now you can do additional research and expand your knowledge A database typically contains dozens of tables, and therefore it is extremely important that you master creating queries over multiple tables. This implies the knowledge of the functioning JOIN operators with a combination with elements of string manipulation. Resources for Article: Further resources on this subject: Review of SQL Server Features for Developers [article] Configuring a MySQL linked server on SQL Server 2008 [article] Exception Handling in MySQL for Python [article]
Read more
  • 0
  • 0
  • 54334
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-econometric-analysis
Packt
05 Jul 2017
10 min read
Save for later

Econometric Analysis

Packt
05 Jul 2017
10 min read
In this article by Param Jeet and Prashant Vats, the author of the book Learning Quantitative Finance with R, will discuss about the types of regression and how we can build regression model in R for building predictive models. Also, how we can implement variable selection method and other aspects associated with regression. This article will not contain the theoretical description but it will just guide you how to implement regression model in R in financial space. Regression analysis can be used for doing forecast on cross-sectional data in financial domain. This article covers the following topics: Simple linear regression Multivariate linear regression Multicollinearity ANOVA (For more resources related to this topic, see here.) Simple linear regression In simple linear regression we try to predict one variable in terms of second variable called predictor variable. The variable we are trying to predict is called dependent variable and is denoted by y and the independent variable is denoted by x. In simple linear regression we assume linear relationship between dependent attribute and predictor attribute. First we need to plot the data to understand the linear relationship between the dependent variable and independent variable. Here our data consists of two variables: YPrice: Dependent variable XPrice: Predictor variable In this case we are trying to predict Yprice in terms of XPrice. StockXprice is independent variable and StockYprice is dependent variable. For every element of StockXprice there is an element of StockYprice which implies one to one mapping between elements of StockXprice and StockYprice. Few lines of data used for the following analysis is displayed using the following code: >head(Data)   StockYPrice StockXPrice 1 80.13 72.86 2 79.57 72.88 3 79.93 71.72 4 81.69 71.54 5 80.82 71 6 81.07 71.78 Scatter plot First we will plot scatter plot between y and x to understand the type of linear relationship between x and y. The given followig code when executed, gives the following scatterplot: > YPrice = Data$StockYPrice > XPrice = Data$StockXPrice > plot(YPrice, XPrice, xlab=“XPrice“, ylab=“YPrice“) Here our dependent variable is YPrice and predictor variable is Xprice. Please note this example is just for illustration purpose: Figure 3.1. Scatter plot of two variables Once we examined the relationship between the dependent variable and predictor variable we try fit best straight line through the points which represents the predicted Y value for all the given predictor variable. A simple linear regression is represented by the following equation describing the relationship between the dependent and predictor variable: Where α and β are parameters and ε is error term. Whereα is also known as intercept and β as coefficient of predictor variable and is obtained by minimizing the sum of squares of error term ε. All the statistical software gives the option of estimating the coefficients and so does R. We can fit the linear regression model using lm function in R as shown here: > LinearR.lm = lm(YPrice ~ XPrice, data=Data) Where Data is the input data given and Yprice and Xprice is the dependent and predictor variable respectively. Once we have fit the model we can extract our parameters using the following code: > coeffs = coefficients(LinearR.lm); coeffs The preceding resultant gives the value of intercept and coefficient: (Intercept) XPrice 92.7051345 -0.1680975 So now we can write our model as: > YPrice = 92.7051345 + -0.1680975*(Xprice) This can give the predicted value for any given Xprice. Also, we can execute the given following code to get predicted value using the fit linear regression model on any other data say OutofSampleData by executing the following code: > predict(LinearR.lm, OutofSampleData) Coefficient of determination We have fit our model but now we need to test how good the model is fitting to the data. There are few measures available for it but the main is coefficient of determination. This is given by the following code: > summary(LinearR.lm)$r.squared By definition, it is proportion of the variance in the dependent variable that is explained by the independent variable and is also known as R2. Significance test Now we need to examine that the relationship between the variables in linear regression model is significant or not at 0.05 significance level. When we execute the following code will look like: > summary(LinearR.lm) It gives all the relevant statistics of the linear regression model as shown here: Figure 3.2: Summary of linear regression model If the Pvalue associated with Xprice is less than 0.05 then the predictor is explaining the dependent variable significantly at 0.05 significance level. Confidence interval for linear regression model One of the important issues for the predicted value is to find the confidence interval around the predicted value. So let us try to find 95% confidence interval around predicted value of the fit model. This can be achieved by executing the following code: > Predictdata = data.frame(XPrice=75) > predict(LinearR.lm, Predictdata, interval=“confidence“) Here we are estimating the predicted value for given value of Xprice = 75 and then the next we try to find the confidence interval around the predicted value. The output generated by executing the preceding code is shown in the following screenshot:: Figure 3.3: Prediction of confidence interval for linear regression model Residual plot Once we have fitted the model then we compare it with the observed value and find the difference which is known as residual. Then we plot the residual against the predictor variable to see the performance of model visually. The following code can be executed to get the residual plot: > LinearR.res = resid(LinearR.lm) > plot(XPrice, LinearR.res, ylab=“Residuals“, xlab=“XPrice“, main=“Residual Plot“) Figure 3.4: Residual plot of linear regression model We can also plot the residual plot for standardized residual by just executing the following code in the previous mentioned code: > LinearRSTD.res = rstandard(LinearR.lm) > plot(XPrice, LinearRSTD.res, ylab=“Standardized Residuals“, xlab=“XPrice“, main=“Residual Plot“) Normality distribution of errors One of the assumption of linear regression is that errors are normally distributed and after fitting the model we need to check that errors are normally distributed. Which can be checked by executing the following code and can be compared with theoretical normal distribution: > qqnorm(LinearRSTD.res, ylab=“Standardized Residuals“, xlab=“Normal Scores“, main=“Error Normal Distribution plot“) > qqline(LinearRSTD.res) Figure 3.5: QQ plot of standardized residuals Further detail of the summary function for linear regression model can be found in the R documentation. The following command will open a window which has complete information about linear regression model, that is, lm(). It also has information about each and every input variable including their data type, what are all the variable this function returns and how output variables can be extracted along with the examples: > help(summary.lm) Multivariate linear regression In multiple linear regression, we try to explain the dependent variable in terms of more than one predictor variable. The multiple linear regression equation is given by the following formula: Where α, β1 …βk are multiple linear regression parameters and can be obtained by minimizing the sum of squares which is also known as OLS method of estimation. Let us an take an example where we have the dependent variable StockYPrice and we are trying to predict it in terms of independent variables StockX1Price, StockX2Price, StockX3Price, StockX4Price, which is present in dataset DataMR. Now let us fit the multiple regression model and get parameter estimates of multiple regression: > MultipleR.lm = lm(StockYPrice ~ StockX1Price + StockX2Price + StockX3Price + StockX4Price, data=DataMR) > summary(MultipleR.lm) When we executed the preceding code, it fits the multiple regression model on the data and gives the basic summary of statistics associated with the multiple regression: Figure 3.6: Summary of multivariate linear regression Just like simple linear regression model the lm function estimates the coefficients of multiple regression model as shown in the previous summary and we can write our prediction equation as follows: > StockYPrice = 88.42137 +(-0.16625)*StockX1Price + (-0.00468) * StockX2Price + (.03497)*StockX3Price+ (.02713)*StockX4Price For any given set of independent variable we can find the predicted dependent variable by using the previous equation. For any out of sample data we can obtain the forecast by executing the following code: > newdata = data.frame(StockX1Price=70, StockX2Price=90, StockX3Price=60, StockX4Price=80) > predict(MultipleR.lm, newdata) Which gives the output 80.63105 as the predicted value of dependent variable for given set of independent variables. Coefficient of determination For checking the adequacy of model the main statistics is coefficient of determination and adjusted coefficient of determination which has been displayed in the summary table as R-Squared and Adjusted R-Squared matrices. Also we can obtain them by the following code: > summary(MultipleR.lm)$r.squared > summary(MultipleR.lm)$adj.r.squared From the summary table we can see which variables are coming significant. If the Pvalue associated with the variables in the summary table are <0.05 then the specific variable is significant, else it is insignificant. Confidence interval We can find the prediction interval for 95% confidence interval for the predicted value by multiple regression model by executing the following code: > predict(MultipleR.lm, newdata, interval=“confidence“) The following code generates the following output:  Figure 3.7: Prediction of confidence interval for multiple regression model Multicollinearity If the predictor variables are correlated then we need to detect multicollinearity and treat it. Recognition of multicollinearity is very crucial because two of more variables are correlated which shows strong dependence structure between those variables and we are using correlated variables as independent variables which end up having double effect of these variables on the prediction because of relation between them. If we treat the multicollinearity and consider only variables which are not correlated then we can avoid the problem of double impact. We can find multicollinearity by executing the following code: > vif(MultipleR.lm) This gives the multicollinearity table for the predictor variables:  Figure 3.8: VIF table for multiple regression model Depending upon the values of VIF we can drop the irrelevant variable. ANOVA ANOVA is used to determine whether there are any statistically significant differences between the means of three or more independent groups. In case of only two samples we can use the t-test to compare the means of the samples but in case of more than two samples it may be very complicated. We are going to study the relationship between a quantitative dependent variable returns and single qualitative independent variable stock. We have five levels of stock stock1, stock2, .. stock5. We can study the four levels of stocks by means of boxplot and we can compare by executing the following code: > DataANOVA = read.csv(“C:/Users/prashant.vats/Desktop/Projects/BOOK R/DataAnova.csv“) >head(DataANOVA) This displays few lines of the data used for analysis in the tabular format:   Returns Stock 1 1.64 Stock1 2 1.72 Stock1 3 1.68 Stock1 4 1.77 Stock1 5 1.56 Stock1 6 1.95 Stock1 >boxplot(DataANOVA$Returns ~ DataANOVA$Stock) This gives the following output and boxplot it: Figure 3.9: Boxplot of different levels of stocks The preceding boxplot shows that level stock has higher returns. If we repeat the procedure we are most likely going to get different returns. It may be possible that all the levels of stock gives similar numbers and we are just seeing random fluctuation in one set of returns. Let us assume that there is no difference at any level and it be our null hypothesis. Using ANOVA, let us test the significance of hypothesis: > oneway.test(Returns ~ Stock, var.equal=TRUE) Executing the preceding code gives the following outcome: Figure 3.10: Output of ANOVA for different levels of Stocks Since Pvalue is less than 0.05 so the null hypothesis gets rejected. The returns at the different levels of stocks are not similar. Summary This article has been proven very beneficial to know some basic quantitative implementation with R. Moreover, you will also get to know the information regarding the packages that R use. Resources for Article: Further resources on this subject: What is Quantitative Finance? [article] Stata as Data Analytics Software [article] Using R for Statistics, Research, and Graphics [article]
Read more
  • 0
  • 0
  • 2801

article-image-planning-and-preparation
Packt
05 Jul 2017
9 min read
Save for later

Planning and Preparation

Packt
05 Jul 2017
9 min read
In this article by Jason Beltrame, authors of the book Penetration Testing Bootcamp, Proper planning and preparation is key to a successful penetration test. It is definitely not as exciting as some of the tasks we will do within the penetration test later, but it will lay the foundation of the penetration test. There are a lot of moving parts to a penetration test, and you need to make sure that you stay on the correct path and know just how far you can and should go. The last thing you want to do in a penetration test is cause a customer outage because you took down their application server with an exploit test (unless, of course, they want us to get to that depth) or scanned the wrong network. Performing any of these actions would cause our penetration-testing career to be a rather short-lived career. In this article, following topics will be covered: Why does penetration testing take place? Building the systems for the penetration test Penetration system software setup (For more resources related to this topic, see here.) Why does penetration testing take place? There are many reasons why penetration tests happen. Sometimes, a company may want to have a stronger understanding of their security footprint. Sometimes, they may have a compliance requirement that they have to meet. Either way, understanding why penetration testing is happening will help you understand the goal of the company. Plus, it will also let you know whether you are performing an internal penetration test or an external penetration test. External penetration tests will follow the flow of an external user and see what they have access to and what they can do with that access. Internal penetration tests are designed to test internal systems, so typically, the penetration box will have full access to that environment, being able to test all software and systems for known vulnerabilities. Since tests have different objectives, we need to treat them differently; therefore, our tools and methodologies will be different. Understanding the engagement One of the first tasks you need to complete prior to starting a penetration test is to have a meeting with the stakeholders and discuss various data points concerning the upcoming penetration test. This meeting could be you as an external entity performing a penetration test for a client or you as an internal security employee doing the test for your own company. The important element here is that the meeting should happen either way, and the same type of information needs to be discussed. During the scoping meeting, the goal is to discuss various items of the penetration test so that you have not only everything you need, but also full management buy-in with clearly defined objectives and deliverables. Full management buy-in is a key component for a successful penetration test. Without it, you may have trouble getting required information from certain teams, scope creep, or general pushback. Building the systems for the penetration test With a clear understanding of expectations, deliverables, and scope, it is now time to start working on getting our penetration systems ready to go. For the hardware, I will be utilizing a decently powered laptop. The laptop specifications are a Macbook Pro with 16 GB of RAM, 256 GB SSD, and a quad-core 2.3 Ghz Intel i7 running VMware Fusion. I will also be using the Raspberry Pi 3. The Raspberry Pi 3 is a 1.2 Ghz ARMv8 64-bit Quad Core, with 1GB of RAM and a 32 GB microSD. Obviously, there is quite a power discrepancy between the laptop and the Raspberry Pi. That is okay though, because I will be using both these devices differently. Any task that requires any sort of processing power will be done on the laptop. I love using the Raspberry Pi because of its small form factor and flexibility. It can be placed in just about any location we need, and if needed, it can be easily concealed. For software, I will be using Kali Linux as my operating system of choice. Kali is a security-oriented Linux distribution that contains a bunch of security tools already installed. Its predecessor, Backtrack, was also a very popular security operating system. One of the benefits of Kali Linux is that it is also available for the Raspberry Pi, which is perfect in our circumstance. This way, we can have a consistent platform between devices we plan to use in our penetration-testing labs. Kali Linux can be downloaded from their site at https://www.kali.org. For the Raspberry Pi, the Kali images are managed by Offensive Security at https://www.offensive-security.com. Even though I am using Kali Linux as my software platform of choice, feel free to use whichever software platform you feel most comfortable with. We will be using a bunch of open source tools for testing. A lot of these tools are available for other distributions and operating systems. Penetration system software setup Setting up Kali Linux on both systems is a bit different since they are different platforms. We won't be diving into a lot of details on the install, but we will be hitting all the major points. This is the process you can use to get the software up and running. We will start with the installation on the Raspberry Pi: Download the images from Offensive Security at https://www.offensive-security.com/kali-linux-arm-images/. Open the Terminal app on OS X. Using the utility xz, you can decompress the Kali image that was downloaded: xz-dkali-2.1.2-rpi2.img.xz Next, you insert the USB microSD card reader with the microSD card into the laptop and verify the disks that are installed so that you know the correct disk to put the Kali image on: diskutillist Once you know the correct disk, you can unmount the disk to prepare to write to it: diskutilunmountDisk/dev/disk2 Now that you have the correct disk unmounted, you will want to write the image to it using the dd command. This process can take some time, so if you want to check on the progress, you can run the Ctrl + T command anytime: sudoddif=kali-2.1.2-rpi2.imgof=/dev/disk2bs=1m Since the image is now written to the microSD drive, you can eject it with the following command: diskutileject/dev/disk2 You then remove the USB microSD card reader, place the microSD card in the Raspberry Pi, and boot it up for the first time. The default login credentials are as follows: Username:root Password:toor You then change the default password on the Raspberry Pi to make sure no one can get into it with the following command: Passwd<INSERTPASSWORDHERE> Making sure the software is up to date is important for any system, especially a secure penetration-testing system. You can accomplish this with the following commands: apt-getupdate apt-getupgrade apt-getdist-upgrade After a reboot, you are ready to go on the Raspberry Pi. Next, it's onto setting up the Kali Linux install on the Mac. Since you will be installing Kali as a VM within Fusion, the process will vary compared to another hypervisor or installing on a bare metal system. For me, I like having the flexibility of having OS X running so that I can run commands on there as well: Similar to the Raspberry Pi setup, you need to download the image. You will do that directly via the Kali website. They offer virtual images for downloads as well. If you go to select these, you will be redirected to the Offensive Security site at https://www.offensive-security.com/kali-linux-vmware-virtualbox-image-download/. Now that you have the Kali Linux image downloaded, you need to extract the VMDK. We used 7z via CLI to accomplish this task: Since the VMDK is ready to import now, you will need to go into VMware Fusion and navigate to File | New. A screen similar to the following should be displayed: Click on Create a custom virtual machine. You can select the OS as Other | Other and click on Continue: Now, you will need to import the previously decompressed VMDK. Click on the Use an existing virtual disk radio button, and hit Choose virtual disk. Browse the VMDK. Click on Continue. Then, on the last screen, click on the Finish button. The disk should now start to copy. Give it a few minutes to complete: Once completed, the Kali VM will now boot. Log in with the credentials we used in the Raspberry Pi image: Username:root Password:toor You need to then change the default password that was set to make sure no one can get into it. Open up a terminal within the Kali Linux VM and use the following command: Passwd<INSERTPASSWORDHERE> Make sure the software is up to date, like you did for the Raspberry Pi. To accomplish this, you can use the following commands: apt-getupdate apt-getupgrade apt-getdist-upgrade Once this is complete, the laptop VM is ready to go. Summary Now that we have reached the end of this article, we should have everything that we need for the penetration test. Having had the scoping meeting with all the stakeholders, we were able to get answers to all the questions that we required. Once we completed the planning portion, we moved onto the preparation phase. In this case, the preparation phase involved setting up Kali Linux on both the Raspberry Pi as well as setting it up as a VM on the laptop. We went through the steps of installing and updating the software on each platform as well as some basic administrative tasks. Resources for Article: Further resources on this subject: Introducing Penetration Testing [article] Web app penetration testing in Kali [article] BackTrack 4: Security with Penetration Testing Methodology [article]
Read more
  • 0
  • 0
  • 21847

article-image-network-evidence-collection
Packt
05 Jul 2017
16 min read
Save for later

Network Evidence Collection

Packt
05 Jul 2017
16 min read
In this article by Gerard Johansen, author of the book Digital Forensics and Incident Response, explains that the traditional focus of digital forensics has been to locate evidence on the host hard drive. Law enforcement officers interested in criminal activity such as fraud or child exploitation can find the vast majority of evidence required for prosecution on a single hard drive. In the realm of Incident Response though, it is critical that the focus goes far beyond a suspected compromised system. There is a wealth of information to be obtained within the points along the flow of traffic from a compromised host to an external Command and Control server for example. (For more resources related to this topic, see here.) This article focuses on the preparation, identification and collection of evidence that is commonly found among network devices and along the traffic routes within an internal network. This collection is critical during an incident where an external threat sources is in the process of commanding internal systems or is in the process of pilfering data out of the network. Network based evidence is also useful when examining host evidence as it provides a second source of event corroboration which is extremely useful in determining the root cause of an incident. Preparation The ability to acquire network-based evidence is largely dependent on the preparations that are untaken by an organization prior to an incident. Without some critical components of a proper infrastructure security program, key pieces of evidence will not be available for incident responders in a timely manner. The result is that evidence may be lost as the CSIRT members hunt down critical pieces of information. In terms of preparation, organizations can aid the CSIRT by having proper network documentation, up to date configurations of network devices and a central log management solution in place. Aside from the technical preparation for network evidence collection, CSIRT personnel need to be aware of any legal or regulatory issues in regards to collecting network evidence. CSIRT personnel need to be aware that capturing network traffic can be considered an invasion of privacy absent any other policy. Therefore, the legal representative of the CSIRT should ensure that all employees of the organization understand that their use of the information system can be monitored. This should be expressly stated in policies prior to any evidence collection that may take place. Network diagram To identify potential sources of evidence, incident responders need to have a solid understanding of what the internal network infrastructure looks like. One method that can be employed by organizations is to create and maintain an up to date network diagram. This diagram should be detailed enough so that incident responders can identify individual network components such as switches, routers or wireless access points. This diagram should also contain internal IP addresses so that incident responders can immediately access those systems through remote methods. For instance, examine the below simple network diagram: This diagram allows for a quick identification of potential evidence sources. In the above diagram, for example, suppose that the laptop connected to the switch at 192.168.2.1 is identified as communicating with a known malware Command and Control server. A CSIRT analyst could examine the network diagram and ascertain that the C2 traffic would have to traverse several network hardware components on its way out of the internal network. For example, there would be traffic traversing the switch at 192.168.10.1, through the firewall at 192.168.0.1 and finally the router out to the Internet. Configuration Determining if an attacker has made modifications to a network device such as a switch or a router can be made easier if the CSIRT has a standard configuration immediately available. Organizations should already have configurations for network devices stored for Disaster Recovery purposes but should have these available for CSIRT members in the event that there is an incident. Logs and log management The lifeblood of a good incident investigation is evidence from a wide range of sources. Even something as a malware infection on a host system requires corroboration among a variety of sources. One common challenge with Incident Response, especially in smaller networks is how the organization handles log management. For a comprehensive investigation, incident response analysts need access to as much network data as possible. All to often, organizations do not dedicate the proper resources to enabling the comprehensive logs from network devices and other systems. Prior to any incident, it is critical to clearly define the how and what an organization will log and as well as how it will maintain those logs. This should be established within a log management policy and associated procedure. The CSIRT personnel should be involved in any discussion as what logs are necessary or not as they will often have insight into the value of one log source over another. NIST has published a short guide to log management available at: http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-92.pdf. Aside from the technical issues regarding log management, there are legal issues that must be addressed. The following are some issues that should be addressed by the CSIRT and its legal support prior to any incident. Establish logging as a normal business practice: Depending on the type of business and the jurisdiction, users may have a reasonable expectation of privacy absent any expressly stated monitoring policy. In addition, if logs are enabled strictly to determine a user's potential malicious activity, there may be legal issues. As a result, the logging policy should establish that logging of network activity is part of the normal business activity and that users do not have a reasonable expectation of privacy. Logging as close to the event: This is not so much an issue with automated logging as they are often created almost as the event occurs. From an evidentiary standpoint, logs that are not created close to the event lose their value as evidence in a courtroom. Knowledgeable Personnel: The value of logs is often dependent on who created the entry and whether or not they were knowledgeable about the event. In the case of logs from network devices, the logging software addresses this issue. As long as the software can be demonstrated to be functioning properly, there should be no issue. Comprehensive Logging: Enterprise logging should be configured for as much of the enterprise as possible. In addition, logging should be consistent. A pattern of logging that is random will have less value in a court than a consistent patter of logging across the entire enterprise. Qualified Custodian: The logging policy should name a Data Custodian. This individual would speak to the logging and the types of software utilized to create the logs. They would also be responsible for testifying to the accuracy of the logs and the logging software used. Document Failures: Prolonged failures or a history of failures in the logging of events may diminish their value in a courtroom. It is imperative that any logging failure be documented and a reason is associated with such failure. Log File Discovery: Organizations should be made aware that logs utilized within a courtroom proceeding are going to be made available to opposing legal counsel. Logs from compromised systems: Logs that originate from a known compromised system are suspect. In the event that these logs are to be introduced as evidence, the custodian or incident responder will often have to testify at length concerning the veracity of the data contained within the logs. Original copies are preferred: Log files can be copied from the log source to media. As a further step, any logs should be archived off the system as well. Incident responders should establish a chain of custody for each log file used throughout the incident and these logs maintained as part of the case until an order from the court is obtained allowing their destruction. Network device evidence There are a number of log sources that can provide CSIRT personnel and incident responders with good information. A range of manufacturers provides each of these network devices. As a preparation task, CSIRT personnel should become familiar on how to access these devices and obtain the necessary evidence: Switches: These are spread throughout a network through a combination of core switches that handle traffic from a range of network segments to edge switches which handle the traffic for individual segments. As a result, traffic that originates on a host and travels out the internal network will traverse a number of switches. Switches have two key points of evidence that should be addressed by incident responders. First is the Content Addressable Memory (CAM) table. This CAM table maps the physical ports on the switch to the Network Interface Card (NIC) on each device connected to the switch. Incident responders in tracing connections to specific network jacks can utilize this information. This can aid in the identification of possible rogue devices. The second way switches can aid in an incident investigation is through facilitating network traffic capture. Routers: Routers allow organizations to connect multiple LANs into either Metropolitan Area Networks or Wide Area Networks. As a result, the handled an extensive amount of traffic. The key piece of evidentiary information that routers contain is the routing table. This table holds the information for specific physical ports that map to the networks. Routers can also be configured to deny specific traffic between networks and maintain logs on allowed traffic and data flow. Firewalls: Firewalls have changed significantly since the days when they were considered just a different type of router. Next generation firewalls contain a wide variety of features such as Intrusion Detection and Prevention, Web filtering, Data Loss Prevention and detailed logs about allowed and denied traffic. Firewalls often times serve as the detection mechanism that alerts security personnel to potential incidents. Incident responders should have as much visibility into how their organization's firewalls function and what data can be obtained prior to an incident. Network Intrusion Detection and Prevention systems: These systems were purposefully designed to provide security personnel and incident responders with information concerning potential malicious activity on the network infrastructure. These systems utilize a combination of network monitoring and rule sets to determine if there is malicious activity. Intrusion Detection Systems are often configured to alert to specific malicious activity while Intrusion Prevention Systems can detection but also block potential malicious activity. In either case, both types of platforms logs are an excellent place for incident responders to locate specific evidence on malicious activity. Web Proxy Servers: Organization often utilize Web Proxy Servers to control how users interact with websites and other internet based resources. As a result, these devices can give an enterprise wide picture of web traffic that both originates and is destined for internal hosts. Web proxies also have the additional feature set of alerting to connections to known malware Command and Control (C2) servers or websites that serve up malware. A review of web proxy logs in conjunction with a possible compromised host may identify a source of malicious traffic or a C2 server exerting control over the host. Domain Controllers / Authentication Servers: Serving the entire network domain, authentication servers are the primary location that incident responders can leverage for details on successful or unsuccessful logins, credential manipulation or other credential use. DHCP Server: Maintaining a list of assigned IP addresses to workstations or laptops within the organization requires an inordinate amount of upkeep. The use of Dynamic Host Configuration Protocol allows for the dynamic assignment of IP addresses to systems on the LAN. The DHCP servers often contain logs on the assignment of IP addresses mapped to the MAC address of the hosts NIC. This becomes important if an incident responder has to track down a specific workstation or laptop that was connected to the network at a specific data and time. Application Servers: A wide range of applications from Email to Web Applications is housed on network servers. Each of these can provide logs specific to the type of application. Network devices such as switches, routers and firewalls also have their own internal logs that maintain data on access and changes. Incident responders should become familiar with the types of network devices on their organization's network and also be able to access these logs in the event of an incident. Security information and Event management system A significant challenge that a great many organizations has is the nature of logging on network devices. With limited space, log files are often rolled over where the new log files are written over older log files. The result is that in some cases, an organization may only have a few days or even a few hours of important logs. If a potential incident happened several weeks ago, the incident response personnel will be without critical pieces of evidence. One tool that has been embraced by a number of enterprises is a Security Information and Event Management (SIEM) System. These appliances have the ability to aggregate log and event data from network sources and combine them into a single location. This allows the CSIRT and other security personnel to observe activity across the entire network without having to examine individual systems. The diagram below illustrates how a SIEM integrates into the overall network: A variety of sources from security controls to SQL databases are configured to send logs to the SIEM. In this case, the SQL database located at 10.100.20.18 indicates that the user account USSalesSyncAcct was utilized to copy a database to the remote host located at 10.88.6.12. The SIEM allows for quick examination of this type of activity. For example, if it is determined that the account USSalesSyncAcct had been compromised, CSIRT analysts can quickly query the SIEM for any usage of that account. From there, they would be able to see the log entry that indicated a copy of a database to the remote host. Without that SIEM, CSIRT analysts would have to search each individual system that might have been accessed, a process that may be prohibitive. From the SIEM platform, security and network analysts have the ability to perform a number of different tasks related to Incident Response: Log Aggregation: Typical enterprises have several thousand devices within the internal network, each with their own logs; the SIEM can be deployed to aggregate these logs in a central location. Log Retention: Another key feature that SIEM platforms provide is a platform to retain logs. Compliance frameworks such as the Payment Card Industry Data Security Standard (PCI-DSS) stipulate that logs should be maintained for a period of one year with 90 days immediately available. SIEM platforms can aid with log management by providing a system that archives logs in an orderly fashion and allows for the immediate retrieval. Routine Analysis: It is advisable with a SIEM platform to conduct period reviews of the information. SIEM platforms often provide a dashboard that highlights key elements such as the number of connections, data flow, and any critical alerts. SIEMs also allow for reporting so that stakeholders can keep informed of activity. Alerting: SIEM platforms have the ability to alert to specific conditions that may indicate malicious activity. This can include alerting from security controls such as anti-virus, Intrusion Prevention or Detection Systems. Another key feature of SIEM platforms is event correlation. This technique examines the log files and determines if there is a link or any commonality in the events. The SIEM then has the capability to alert on these types of events. For example, if a user account attempts multiple logins across a number of systems in the enterprise, the SIEM can identify that activity and alert to it. Incident Response: As the SIEM becomes the single point for log aggregation and analysis; CSIRT analysts will often make use of the SIEM during an incident. CSIRT analysis will often make queries on the platform as well as download logs for offline analysis. Because of the centralization of log files, the time to conduct searches and event collection is significantly reduced. For example, a CSIRT analysis has indicated a user account has been compromised. Without a SIEM, the CSIRT analyst would have to check various systems for any activity pertaining to that user account. With a SIEM in place, the analyst simply conducts a search of that user account on the SIEM platform, which has aggregated user account activity, logs from systems all over the enterprise. The result is the analyst has a clear idea of the user account activity in a fraction of the time it would have taken to examine logs from various systems throughout the enterprise. SIEM platforms do entail a good deal of time and money to purchase and implement. Adding to that cost is the constant upkeep, maintenance and modification to rules that is necessary. From an Incident Response perspective though, a properly configured and maintained SIEM is vital to gathering network-based evidence in a timely manner. In addition, the features and capability of SIEM platforms can significantly reduce the time it takes to determine a root cause of an incident once it has been detected. The following article has an excellent breakdown and use cases of SIEM platforms in enterprise environments: https://gbhackers.com/security-information-and-event-management-siem-a-detailed-explanation/. Security onion Full-featured SIEM platforms may be cost prohibitive for some organizations. One option that is available is the open source platform Security Onion. The Security Onion ties a wide range of security tools such as OSSEC, Suricata, and Snort into a single platform. Security Onion also has features such as dashboards and tools for deep analysis of log files. For example, the following screenshot shows the level of detail available: Although installing and deploying the Security Onion may require some resources in time, it is a powerful low cost alternative providing a solution to organizations that cannot deploy a full-featured SIEM solution. (The Security Onion platform and associated documentation is available at https://securityonion.net/). Summary Evidence that is pertinent to incident responders is not just located on the hard drive of a compromised host. There is a wealth of information available from network devices spread throughout the environment. With proper preparation, a CSIRT may be able to leverage the evidence provided by these devices through solutions such as a SIEM. CSIRT personnel also have the ability to capture the network traffic for later analysis through a variety of methods and tools. Behind all of these techniques though, is the legal and policy implications that CSIRT personnel and the organization at large needs to navigate. By preparing for the legal and technical challenges of network evidence collection, CSIRT members can leverage this evidence and move closer to the goal of determining the root cause of an incident and bringing the organization back up to operations. Resources for Article: Further resources on this subject: Selecting and Analyzing Digital Evidence [article] Digital and Mobile Forensics [article] BackTrack Forensics [article]
Read more
  • 0
  • 0
  • 31200

article-image-getting-started-predictive-analytics
Packt
04 Jul 2017
32 min read
Save for later

Getting Started with Predictive Analytics

Packt
04 Jul 2017
32 min read
In this article by Ralph Winters, the author of the book Practical Predictive Analytics we will explore the idea of how to start with predictive analysis. "In God we trust, all other must bring Data" – Deming (For more resources related to this topic, see here.) I enjoy explaining Predictive Analytics to people because it is based upon a simple concept:  Predicting the probability of future events based upon historical data. Its history may date back to at least 650 BC. Some early examples include the Babylonians, who tried to predict short term weather changes based cloud appearances and haloes Medicine also has a long history of a need to classify diseases.  The Babylonian king Adad-apla-iddina decreed that medical records be collected to form the “Diagnostic Handbook”. Some “predictions” in this corpus list treatments based on the number of days the patient had been sick, and their pulse rate. One of the first instances of bioinformatics! In later times, specialized predictive analytics were developed at the onset of the Insurance underwriting industries. This was used as a way to predict the risk associated with insuring Marine Vessels. At about the same time, Life Insurance companies began predicting the age that a person would live in order to set the most appropriate premium rates. [i]Although the idea of prediction always seemed to be rooted early in humans’ ability to want to understand and classify, it was not until the 20th century, and the advent of modern computing that it really took hold. In addition to aiding the US government in the 1940 with breaking the code, Alan Turing also worked on the initial computer chess algorithms which pitted man vs. machine.  Monte Carlo simulation methods originated as part of the Manhattan project, where mainframe computers crunched numbers for days in order to determine the probability of nuclear attacks. In the 1950’s Operation Research theory developed, in which one could optimize the shortest distance between two points. To this day, these techniques are used in logistics by companies such as UPS and Amazon. Non mathematicians have also gotten into the act.  In the 1970’s, Cardiologist Lee Goldman (who worked aboard a submarine) spend years developing a decision tree which did this efficiently.  This helped the staff determine whether or not the submarine needed to resurface in order to help the chest pain sufferer! What many of these examples had in common was that history was used to predict the future.  Along with prediction, came understanding of cause and effect and how the various parts of the problem were interrelated.  Discovery and insight came about through methodology and adhering to the scientific method. Most importantly, the solutions came about in order to find solutions to important, and often practical problems of the times.  That is what made them unique. Predictive Analytics adopted by some many different industries We have come a long way from then, and Practical Analytics solutions have furthered growth in so many different industries.  The internet has had a profound effect on this; it has enabled every click to be stored and analyzed. More data is being collected and stored, some with very little effort. That in itself has enabled more industries to enter Predictive Analytics. Marketing has always been concerned with customer acquisition and retention, and has developed predictive models involving various promotional offers and customer touch points, all geared to keeping customers and acquiring new ones.  This is very pronounced in certain industries, such as wireless and online shopping cards, in which customers are always searching for the best deal. Specifically, advanced analytics can help answer questions like "If I offer a customer 10% off with free shipping, will that yield more revenue than 15% off with no free shipping?".  The 360-degree view of the customer has expanded the number of ways one can engage with the customer, therefore enabling marketing mix and attribution modeling to become increasingly important.  Location based devices have enabled marketing predictive applications to incorporate real time data to issue recommendation to the customer while in the store. Predictive Analytics in Healthcare has its roots in clinical trials, which uses carefully selected samples to test the efficacy of drugs and treatments.  However, healthcare has been going beyond this. With the advent of sensors, data can be incorporated into predictive analytics to monitor patients with critical illness, and to send alerts to the patient when he is at risk. Healthcare companies can now predict which individual patients will comply with courses of treatment advocated by health providers.  This will send early warning signs to all parties, which will prevent future complications, as well as lower the total costs of treatment. Other examples can be found in just about every other industry.  Here are just a few: Finance:    Fraud detection is a huge area. Financial institutions can monitor clients internal and external transactions for fraud, through pattern recognition, and then alert a customer concerning suspicious activity.    Wall Street program trading. Trading algorithms will predict intraday highs and lows, and will decide when to buy and sell securities. Sports Management    Sports management are able to predict which sports events will yield the greatest attendance and institute variable ticket pricing based upon audience interest.     In baseball, a pitchers’ entire game can be recorded and then digitally analyzed. Sensors can also be attached to their arm, to alert when future injury might occur Higher Education    Colleges can predict how many, and which kind of students are likely to attend the next semester, and be able to plan resources accordingly.    Time based assessments of online modules can enable professors to identify students’ potential problems areas, and tailor individual instruction. Government    Federal and State Governments have embraced the open data concept, and have made more data available to the public. This has empowered “Citizen Data Scientists” to help solve critical social and government problems.    The potential use of using the data for the purpose of emergency service, traffic safety, and healthcare use is overwhelmingly positive. Although these industries can be quite different, the goals of predictive analytics are typically implement to increase revenue, decrease costs, or alter outcomes for the better. Skills and Roles which are important in Predictive Analytics So what skills do you need to be successful in Predictive Analytics? I believe that there are 3 basic skills that are needed: Algorithmic/Statistical/programming skills -These are the actual technical skills needed to implement a technical solution to a problem. I bundle these all together since these skills are typically used in tandem. Will it be a purely statistical solution, or will there need to be a bit of programming thrown in to customize an algorithm, and clean the data?  There are always multiple ways of doing the same task and it will be up to you, the predictive modeler to determine how it is to be done. Business skills –These are the skills needed for communicating thoughts and ideas among groups of all of the interested parties.  Business and Data Analysts who have worked in certain industries for long periods of time, and know their business very well, are increasingly being called upon to participate in Predictive Analytics projects.  Data Science is becoming a team sport, and most projects include working with others in the organization, Summarizing findings, and having good presentation and documentation skills are important.  You will often hear the term ‘Domain Knowledge’ associated with this, since it is always valuable to know the inner workings of the industry you are working in.  If you do not have the time or inclination to learn all about the inner workings of the problem at hand yourself, partner with someone who does. Data Storage / ETL skills:   This can refer to specialized knowledge regarding extracting data, and storing it in a relational, or non-relational NoSQL data store. Historically, these tasks were handled exclusively within a data warehouse.  But now that the age of Big Data is upon us, specialists have emerged who understand the intricacies of data storage, and the best way to organize it. Related Job skills and terms Along with the term Predictive Analytics, here are some terms which are very much related: Predictive Modeling:  This specifically means using a Mathematical/statistical model to predict the likelihood of a dependent or Target Variable Artificial Intelligence:  A broader term for how machines are able to rationalize and solve problems.  AI’s early days were rooted in Neural Networks Machine Learning- A subset of Artificial Intelligence. Specifically deals with how a machine learns automatically from data, usually to try to replicate human decision making or to best it. At this point, everyone knows about Watson, who beat two human opponents in “Jeopardy” [ii] Data Science - Data Science encompasses Predictive Analytics but also adds algorithmic development via coding, and good presentation skills via visualization. Data Engineering - Data Engineering concentrates on data extract and data preparation processes, which allow raw data to be transformed into a form suitable for analytics. A knowledge of system architecture is important. The Data Engineer will typically produce the data to be used by the Predictive Analysts (or Data Scientists) Data Analyst/Business Analyst/Domain Expert - This is an umbrella term for someone who is well versed in the way the business at hand works, and is an invaluable person to learn from in terms of what may have meaning, and what may not Statistics – The classical form of inference, typically done via hypothesis testing. Predictive Analytics Software Originally predictive analytics was performed by hand, by statisticians on mainframe computers using a progression of various language such as FORTRAN etc.  Some of these languages are still very much in use today.  FORTRAN, for example, is still one of the fasting performing languages around, and operators with very little memory. Nowadays, there are some many choices on which software to use, and many loyalists remain true to their chosen software.  The reality is, that for solving a specific type of predictive analytics problem, there exists a certain amount of overlap, and certainly the goal is the same. Once you get a hang of the methodologies used for predictive analytics in one software packages, it should be fairly easy to translate your skills to another package. Open Source Software Open source emphasis agile development, and community sharing.  Of course, open source software is free, but free must also be balance in the context of TCO (Total Cost of Ownership) R The R language is derived from the "S" language which was developed in the 1970’s.  However, the R language has grown beyond the original core packages to become an extremely viable environment for predictive analytics. Although R was developed by statisticians for statisticians, it has come a long way from its early days.  The strength of R comes from its 'package' system, which allows specialized or enhanced functionality to be developed and 'linked' to the core system. Although the original R system was sufficient for statistics and data mining, an important goal of R was to have its system enhanced via user written contributed packages.  As of this writing, the R system contains more than 8,000 packages.  Some are of excellent quality, and some are of dubious quality.  Therefore, the goal is to find the truly useful packages that add the most value.  Most, if not all of the R packages in use, address most of the common predictive analytics tasks that you will encounter.  If you come across a task that does not fit into any category, chances are good that someone in the R community has done something similar.  And of course, there is always a chance that someone is developing a package to do exactly what you want it to do.  That person could be eventually be you!. Closed Source Software Closed Source Software such as SAS and SPSS were on the forefront of predictive analytics, and have continued to this day to extend their reach beyond the traditional realm of statistics and machine learning.  Closed source software emphasis stability, better support, and security, with better memory management, which are important factors for some companies.  There is much debate nowadays regarding which one is 'better'.  My prediction is that they both will coexist peacefully, with one not replacing the other.  Data sharing and common API's will become more common.  Each has its place within the data architecture and ecosystem is deemed correct for a company.  Each company will emphasis certain factors, and both open and closed software systems and constantly improving themselves. Other helpful tools Man does not live by bread alone, so it would behoove you to learn additional tools in addition to R, so as to advance your analytic skills. SQL - SQL is a valuable tool to know, regardless of which language/package/environment you choose to work in. Virtually every analytics tool will have a SQL interface, and a knowledge of how to optimize SQL queries will definitely speed up your productivity, especially if you are doing a lot of data extraction directly from a SQL database. Today’s common thought is to do as much preprocessing as possible within the database, so if you will be doing a lot of extracting from databases like MySQL, Postgre, Oracle, or Teradata, it will be a good thing to learn how queries are optimized within their native framework. In the R language, there are several SQL packages that are useful for interfacing with various external databases.  We will be using SQLDF which is a popular R package for interfacing with R dataframes.  There are other packages which are specifically tailored for the specific database you will be working with Web Extraction Tools –Not every data source will originate from a data warehouse. Knowledge of API’s which extract data from the internet will be valuable to know. Some popular tools include Curl, and Jsonlite. Spreadsheets.  Despite their problems, spreadsheets are often the fastest way to do quick data analysis, and more importantly, enable them to share your results with others!  R offers several interface to spreadsheets, but again, learning standalone spreadsheet skills like PivotTables, and VBA will give you an advantage if you work for corporations in which these skills are heavily used. Data Visualization tools: Data Visualization tools are great for adding impact to an analysis, andfor concisely encapsulating complex information.  Native R visualization tools are great, but not every company will be using R.  Learn some third party visualization tools such as D3.js, Google Charts, Qlikview, or Tableau Big data Spark, Hadoop, NoSQL Database:  It is becoming increasingly important to know a little bit about these technologies, at least from the viewpoint of having to extract and analyze data which resides within these frameworks. Many software packages have API’s which talk directly to Hadoop and can run predictive analytics directly within the native environment, or extract data and perform the analytics locally. After you are past the basics Given that the Predictive Analytics space is so huge, once you are past the basics, ask yourself what area of Predictive analytics really interests you, and what you would like to specialize in.  Learning all you can about everything concerning Predictive Analytics is good at the beginning, but ultimately you will be called upon because you are an expert in certain industries or techniques. This could be research, algorithmic development, or even for managing analytics teams. But, as general guidance, if you are involved in, or are oriented towards data the analytics or research portion of data science, I would suggest that you concentrate on data mining methodologies and specific data modeling techniques which are heavily prevalent in the specific industries that interest you.  For example, logistic regression is heavily used in the insurance industry, but social network analysis is not. Economic research is geared towards time-series analysis, but not so much cluster analysis. If you are involved more on the data engineering side, concentrate more on data cleaning, being able to integrate various data sources, and the tools needed to accomplish this.  If you are a manager, concentrate on model development, testing and control, metadata, and presenting results to upper management in order to demonstrate value. Of course, predictive analytics is becoming more of a team sport, rather than a solo endeavor, and the Data Science team is very much alive.  There is a lot that has been written about the components of a Data Science team, much of it which can be reduced to the 3 basic skills that I outlined earlier. Two ways to look at predictive analytics Depending upon how you intend to approach a particular problem, look at how two different analytical mindsets can affect the predictive analytics process. Minimize prediction error goal: This is a very commonly used case within machine learning. The initial goal is to predict using the appropriate algorithms in order to minimize the prediction error. If done incorrectly, an algorithm will ultimately fail and it will need to be continually optimized to come up with the “new” best algorithm. If this is performed mechanically without regard to understanding the model, this will certainly result in failed outcomes.  Certain models, especially over optimized ones with many variables can have a very high prediction rate, but be unstable in a variety of ways. If one does not have an understanding of the model, it can be difficult to react to changes in the data inputs.  Understanding model goal: This came out of the scientific method and is tied closely with the concept of hypothesis testing.  This can be done in certain kinds of models, such as regression and decision trees, and is more difficult in other kinds of models such as SVM and Neural Networks.  In the understanding model paradigm, understanding causation or impact becomes more important than optimizing correlations. Typically, “Understanding” models have a lower prediction rate, but have the advantage of knowing more about the causations of the individual parts of the model, and how they are related. E.g. industries which rely on understanding human behavior emphasize model understanding goals.  A limitation to this orientation is that we might tend to discard results that are not immediately understood Of course the above examples illustrate two disparate approaches. Combination models, which use the best of both worlds should be the ones we should strive for.  A model which has an acceptable prediction error, is stable over, and is simple enough to understand. You will learn later that is this related to Bias/Variance Tradeoff R Installation R Installation is typically done by downloading the software directly from the CRAN site Navigate to https://cran.r-project.org/ Install the version of R appropriate for your operating system Alternate ways of exploring R Although installing R directly from the CRAN site is the way most people will proceed, I wanted to mention some alternative R installation methods. These methods are often good in instances when you are not always at your computer. Virtual Environment: Here are few methods to install R in the virtual environment: Virtual Box or VMware- Virtual environments are good for setting up protected environments and loading preinstalled operating systems and packages.  Some advantages are that they are good for isolating testing areas, and when you do not which to take up additional space on your own machine. Docker – Docker resembles a Virtual Machine, but is a bit more lightweight since it does not emulate an entire operating system, but emulates only the needed processes.  (See Rocker, Docker container) Cloud Based- Here are few methods to install R in the cloud based environment: AWS/Azure – These are Cloud Based Environments.  Advantages to this are similar to the reasons as virtual box, with the additional capability to run with very large datasets and with more memory.  Not free, but both AWS and Azure offer free tiers. Web Based - Here are few methods to install R in the web based environment: Interested in running R on the Web?  These sites are good for trying out quick analysis etc. R-Fiddle is a good choice, however there are other including: R-Web, ideaone.com, Jupyter, DataJoy, tutorialspoint, and Anaconda Cloud are just a few examples. Command Line – If you spend most of your time in a text editor, try ESS (Emacs Speaks Statistics) How is a predictive analytics project organized? After you install R on your own machine, I would give some thought about how you want to organize your data, code, documentation, etc. There probably be many different kinds of projects that you will need to set up, all ranging from exploratory analysis, to full production grade implementations.  However, most projects will be somewhere in the middle, i.e. those projects which ask a specific question or a series of related questions.  Whatever their purpose, each project you will work on will deserve their own project folder or directory. Set up your Project and Subfolders We will start by creating folders for our environment. Create a sub directory named “PracticalPredictiveAnalytics” somewhere on your computer. We will be referring to it by this name throughout this book. Often project start with 3 sub folders which roughly correspond with 1) Data Source, 2) Code Generated Outputs, and 3) The Code itself (in this case R) Create 3 subdirectories under this Project Data, Outputs, and R. The R directory will hold all of our data prep code, algorithms etc.  The Data directory will contain our raw data sources, and the Output directory will contain anything generated by the code.  This can be done natively within your own environment, e.g. you can use Windows Explorer to create these folders. Some important points to remember about constructing projects It is never a good idea to ‘boil the ocean’, or try to answer too many questions at once. Remember, predictive analytics is an iterative process. Another trap that people fall into is not having their project reproducible.  Nothing is worse than to develop some analytics on a set of data, and then backtrack, and oops! Different results. When organizing code, try to write code as building block, which can be reusable.  For R, write code liberally as functions. Assume that anything concerning requirements, data, and outputs will change, and be prepared. Considering the dynamic nature of the R language. Changes in versions, and packages could all change your analysis in various ways, so it is important to keep code and data in sync, by using separate folders for the different levels of code, data, etc.  or by using version management package use as subversion, git, or cvs GUI’s R, like many languages and knowledge discovery systems started from the command line (one reason to learn Linux), and is still used by many.  However, predictive analysts tend to prefer Graphic User Interfaces, and there are many choices available for each of the 3 different operating systems.   Each of them have their strengths and weakness, and of course there is always a matter of preference.  Memory is always a consideration with R, and if that is of critical concern to you, you might want to go with a simpler GUI, like the one built in with R. If you want full control, and you want to add some productive tools, you could choose RStudio, which is a full blown GUI and allows you to implement version control repositories, and has nice features like code completion.   RCmdr, and Rattle’s unique features are that they offer menus which allow guided point and click commands for common statistical and data mining tasks.  They are always both code generators.  This is a good for learning, and you can learn by looking at the way code is generated. Both RCmdr and RStudio offer GUI's which are compatible with Windows, Apple, and Linux operator systems, so those are the ones I will use to demonstrate examples in this book.  But bear in mind that they are only user interfaces, and not R proper, so, it should be easy enough to paste code examples into other GUI’s and decide for yourself which ones you like.   Getting started with RStudio After R installation has completed, download and install the RStudio executable appropriate for your operating system Click the RStudio Icon to bring up the program:  The program initially starts with 3 tiled window panes, as shown below. Before we begin to do any actual coding, we will want to set up a new Project. Create a new project by following these steps: Identify the Menu Bar, above the icons at the top left of the screen. Click “File” and then “New Project”   Select “Create project from Directory” Select “Empty Project” Name the directory “PracticalPredictiveAnalytics” Then Click the Browse button to select your preferred directory. This will be the directory that you created earlier Click “Create Project” to complete The R Console  Now that we have created a project, let’s take a look at of the R Console Window.  Click on the window marked “Console” and perform the following steps: Enter getwd() and press enter – That should echo back the current working directory Enter dir() – That will give you a list of everything in the current working directory The getwd() command is very important since it will always tell you which directory you are in. Sometimes you will need to switch directories within the same project or even to another project. The command you will use is setwd().  You will supply the directory that you want to switch to, all contained within the parentheses. This is a situation we will come across later. We will not change anything right now.  The point of this, is that you should always be aware of what your current working directory is. The Script Window The script window is where all of the R Code is written.  You can have several script windows open, all at once. Press Ctrl + Shift + N  to create a new R script.  Alternatively, you can go through the menu system by selecting File/New File/R Script.   A new blank script window will appear with the name “Untitled1” Our First Predictive Model Now that all of the preliminary things are out of the way, we will code our first extremely simple predictive model. Our first R script is a simple two variable regression model which predicts women’s height based upon weight.  The data set we will use is already built into the R package system, and is not necessary to load externally.   For quick illustration of techniques, I will sometimes use sample data contained within specific R packages to demonstrate. Paste the following code into the “Untitled1” scripts that was just created: require(graphics) data(women) head(women) utils::View(women) plot(women$height,women$weight) Click Ctrl+Shift+Enter to run the entire code.  The display should change to something similar as displayed below. Code Description What you have actually done is: Load the “Women” data object. The data() function will load the specified data object into memory. In our case, data(women)statement says load the 'women' dataframe into memory. Display the raw data in three different ways: utils::View(women) – This will visually display the dataframe. Although this is part of the actual R script, viewing a dataframe is a very common task, and is often issued directly as a command via the R Console. As you can see in the figure above, the “Women” data frame has 15 rows, and 2 columns named height and weight. plot(women$height,women$weight) – This uses the native R plot function which plots the values of the two variables against each other.  It is usually the first step one does to begin to understand the relationship between 2 variables. As you can see the relationship is very linear. Head(women) – This displays the first N rows of the women  data frame to the console. If you want no more than a certain number of rows, add that as a 2nd argument of the function.  E.g.  Head(women,99) will display UP TO 99 rows in the console. The tail() function works similarly, but displays the last rows of data. The very first statement in the code “require” is just a way of saying that R needs a specific package to run.  In this case require(graphics) specifies that the graphics package is needed for the analysis, and it will load it into memory.  If it is not available, you will get an error message.  However, “graphics” is a base package and should be available To save this script, press Ctrl-S (File Save) , navigate to the PracticalPredictiveAnalytics/R folder that was created, and name it Chapter1_DataSource Your 2nd script Create another Rscript by Ctrl + Shift + N  to create a new R script.  A new blank script window will appear with the name “Untitled2” Paste the following into the new script window lm_output <- lm(women$height ~ women$weight) summary(lm_output) prediction <- predict(lm_output) error <- women$height-prediction plot(women$height,error) Press Ctrl+Shift+Enter to run the entire code.  The display should change to something similar to what is displayed below. Code Description Here are some notes and explanations for the script code that you have just ran: lm() function: This functionruns a simple linear regression using lm() function. This function  predicts woman’s height based upon the value of their weight.  In statistical parlance, you will be 'regressing' height on weight. The line of code which accomplishes this is: lm_output <- lm(women$height ~ women$weight There are two operations that you will become very familiar with when running Predictive Models in R. The ~ operator (also called the tilde) is a shorthand way for separating what you want to predict, with what you are using to predict.   This is expression in formula syntax. What you are predicting (the dependent or Target variable) is usually on the left side of the formula, and the predictors (independent variables, features) are on the right side. Independent and dependent variables are height and weight, and to improve readability, I have specified them explicitly by using the data frame name together with the column name, i.e. women$height and women$weight The <- operator (also called assignment) says assign whatever function operators are on the right side to whatever object is on the left side.  This will always create or replace a new object that you can further display or manipulate. In this case we will be creating a new object called lm_output, which is created using the function lm(), which creates a Linear model based on the formula contained within the parentheses. Note that the execution of this line does not produce any displayed output.  You can see if the line was executed by checking the console.  If there is any problem with running the line (or any line for that matter) you will see an error message in the console. summary(lm_output): The following statement displays some important summary information about the object lm_output and writes to output to the R Console as pictured above summary(lm_output) The results will appear in the Console window as pictured in the figure above. Look at the lines market (Intercept), and women$weight which appear under the Coefficients line in the console.  The Estimate Column shows the formula needed to derive height from weight.  Like any linear regression formula, it includes coefficients for each independent variable (in our case only one variable), as well as an intercept. For our example the English rule would be "Multiply weight by 0.2872 and add 25.7235 to obtain height". Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 25.723456 1.043746 24.64 2.68e-12 *** women$weight 0.287249 0.007588 37.85 1.09e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.44 on 13 degrees of freedom Multiple R-squared: 0.991, Adjusted R-squared: 0.9903 F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14 We have already assigned the output of the lm() function to the lm_output object. Let’s apply another function to lm_output as well. The predict() function “reads” the output of the lm function and predicts (or scores the value), based upon the linear regression equation.  In the code we have assigned the output of this function to a new object named "prediction”. Switch over to the console area, and type “prediction” to see the predicted values for the 15 women. The following should appear in the console. > prediction 1 2 3 4 5 6 7 58.75712 59.33162 60.19336 61.05511 61.91686 62.77861 63.64035 8 9 10 11 12 13 14 64.50210 65.65110 66.51285 67.66184 68.81084 69.95984 71.39608 15 72.83233 There are 15 predictions.  Just to verify that we have one for each of our original observations we will use the nrow() function to count the number of rows. At the command prompt in the console area, enter the command: nrow(women) The following should appear: >nrow(women) [1] 15 The error object is a vector that was computed by taking the difference between the predicted value of height and the actual height.  These are also known as the residual errors, or just residuals. Since the error object is a vector, you cannot use the nrows() function to get its size.   But you can use the length() function: >length(error) [1] 15 In all of the above cases, the counts all compute as 15, so all is good. plot(women$height,error) :This plots the predicted height vs. the residuals.  It shows how much the prediction was ‘off’ from the original value.  You can see that the errors show a non-random pattern.  This is not good.  In an ideal regression model, you expect to see prediction errors randomly scatter around the 0 point on the why axis. Some important points to be made regarding this first example: The R-Square for this model is artificially high. Regression is often used in an exploratory fashion to explore the relationship between height and weight.  This does not mean a causal one.  As we all know, weight is caused by many other factors, and it is expected that taller people will be heavier. A predictive modeler who is examining the relationship between height and weight would want probably want to introduce additional variables into the model at the expense of a lower R-Square. R-Squares can be deceiving, especially when they are artificially high After you are done, press Ctrl-S (File Save), navigate to the PracticalPredictiveAnalytics/R folder that was created, and name it Chapter1_LinearRegression Installing a package Sometimes the amount of information output by statistic packages can be overwhelming. Sometime we want to reduce the amount of output and reformat it so it is easier on the eyes. Fortunately, there is an R package which reformats and simplifies some of the more important statistics. One package I will be using is named “stargazer”. Create another R script by Ctrl + Shift + N  to create a new R script.  Enter the following lines and then Press Ctrl+Shift+Enter to run the entire script.  install.packages("stargazer") library(stargazer) stargazer(lm_output, title="Lm Regression on Height", type="text") After the script has been run, the following should appear in the Console: Code Description install.packages("stargazer") This line will install the package to the default package directory on your machine.  Make sure you choose a CRAN mirror before you download. library(stargazer) This line loads the stargazer package stargazer(lm_output, title="Lm Regression on Height", type="text") The reformatted results will appear in the R Console. As you can see, the output written to the console is much cleaner and easier to read  After you are done, press Ctrl-S (File Save), navigate to the PracticalPredictiveAnalytics/Outputs folder that was created, and name it Chapter1_LinearRegressionOutput Installing other packages The rest of the book will concentrate on what I think are the core packages used for predictive modeling. There are always new packages coming out. I tend to favor packages which have been on CRAN for a long time and have large user base. When installing something new, I will try to reference the results against other packages which do similar things.  Speed is another reason to consider adopting a new package. Summary In this article we have learned a little about what predictive analytics is and how they can be used in various industries. We learned some things about data, and how they can be organized in projects.  Finally, we installed RStudio, and ran a simple linear regression, and installed and used our first package. We learned that it is always good practice to examine data after it has been brought into memory, and a lot can be learned from simply displaying and plotting the data. Resources for Article: Further resources on this subject: Stata as Data Analytics Software [article] Metric Analytics with Metricbeat [article] Big Data Analytics [article]
Read more
  • 0
  • 0
  • 11161
article-image-writing-your-first-cucumber-appium-test
Packt
27 Jun 2017
12 min read
Save for later

Writing Your First Cucumber Appium Test

Packt
27 Jun 2017
12 min read
In this article, by Nishant Verma, author of the book Mobile Test Automation with Appium, you will learn about creating a new cucumber, appium Java project in IntelliJ. Next, you will learn to write a sample feature and automate, thereby learning how to start appium server session with an app using appium app, find locators using appium inspector and write java classes for each step implementation in the feature file. We will also discuss, how to write the test for mobile web and use Chrome Developer Tool to find the locators. Let's get started! In this article, we will discuss the following topics: Create a sample Java project (using gradle) Introduction to Cucumber Writing first appium test Starting appium server session and finding locators Write a test for mobile web (For more resources related to this topic, see here.) Create a sample Java project (using gradle) Let's create a sample appium Java project in IntelliJ. Below steps will help you do the same: Launch IntelliJ and click Create New Project on Welcome Screen. On the New Project screen, select Gradle from left pane. Project SDK should get populated with the Java version. Click on Next, enter the GroupId as com.test and ArtifactId as HelloAppium. Version would already be populated. Click on Next. Check the option Use Auto-Import and make sure Gradle JVM is populated. Click on Next. Project name field would be auto populated with what you gave as ArtifactId. Choose a Project location and click on Finish. IntelliJ would be running the background task (Gradle build) which can be seen in the status bar. We should have a project created with default structure. Open the build.gradle file. You would see a message as shown below, click on Ok, apply suggestion! Enter the below two lines in build.gradle. This would add appium and cucumber-jvm under dependencies. compile group: 'info.cukes', name: 'cucumber-java', version: '1.2.5' compile group: 'io.appium', name: 'java-client', version: '5.0.0-BETA6' Below is how the gradle file should look like: group 'com.test' version '1.0-SNAPSHOT' apply plugin: 'java' sourceCompatibility = 1.5 repositories { mavenCentral() } dependencies { testCompile group: 'junit', name: 'junit', version: '4.11' compile group: 'info.cukes', name: 'cucumber-java', version:'1.2.5' compile group: 'io.appium', name: 'java-client', version:'5.0.0-BETA6' } Once done, navigate to View -> Tools Window -> Gradle and click on Refresh all gradle projects icon. This would pull all the dependency in External Libraries. Navigate to Preferences -> Plugins, search for cucumber for Java and click on Install (if it's not previously installed). Repeat the above step for gherkin and install the same. Once done restart IntelliJ, if it prompts. We are now ready to write our first sample feature file but before that let's try to understand a brief about cucumber. Introduction to Cucumber Cucumber is a test framework which supports behaviour driven development (or BDD). The core idea behind BDD is a domain specific language (known as DSL), where the tests are written in normal English, expressing how the application or system has to behave. DSL is an executable test, which starts with a known state, perform some action and verify the expected state. For e.g. Feature: Registration with Facebook/Google Scenario: Registration Flow Validation via App As a user I should be able to see Facebook/Google button. When I try to register myself in Quikr. Given I launch the app When I click on Register Then I should see register with Facebook and Google Dan North (creator of BDD) defined behaviour-driven development in 2009 as- BDD is a second-generation, outside-in, pull-based, multiple-stakeholder, multiple-scale, high-automation, agile methodology. It describes a cycle of interactions with well-defined outputs, resulting in the delivery of working, tested software that matters. Cucumber feature files serve as a living documentation which can be implemented in many languages. It was first implemented in Ruby and later extended to Java. Some of the basic features of Cucumber are: The core of cucumber is text files called feature which contains scenario. These scenarios expresses the system or application behaviour. Scenario files comprise of steps which are written following the syntax of Gherkin. Each step will have step implementation which is the code behind which interacts with application. So in the above example, Feature, Scenario, Given, When, Then are keywords. Feature: Cucumber tests are grouped into features. We use this name because we want engineers to describe the features that a user will be able to use. Scenario: A Scenario expresses the behaviour we want, each feature contains several scenarios. Each scenario is an example of how the system should behave in a particular situation. The expected behaviour of the feature would be the total scenarios. For a feature to pass all scenarios must pass. Test Runner: There are different way to run the feature file however we would be using the JUnit runner initially and then move on to gradle command for command line execution. So I am hoping now we have a brief idea of what cucumber is. Further details can be read on their site (https://cucumber.io/). In the coming section, we will create a feature file, write a scenario, implement the code behind and execute it. Writing first appium test Till now we have created a sample Java project and added the appium dependency. Next we need to add a cucumber feature file and implement the code behind. Let's start that: Under Project folder, create the directory structure src/test/java/features. Right click on the features folder and select New -> File and enter name as Sample.feature In the Sample.feature file, let's write a scenario as shown below which is about log in using Google. Feature: Hello World. Scenario: Registration Flow Validation via App. As a user I should be able to see my google account. when I try to register myself in Quikr. When I launch Quikr app. And I choose to log in using Google. Then I see account picker screen with my email address "testemail@gmail.com". Right click on the java folder in IntelliJ, select New -> Package and enter name as steps. Next step is to implement the cucumber steps, click on the first line in Sample.feature file When I launch Quikr app and press Alt+Enter, then select the option Create step definition. It will present you with a pop up to enter File name, File location and File type. We need to enter below values: File name: HomePageSteps File Location: Browse it to the steps folder created above File Type: Java So the idea is that the steps will belong to a page and each page would typically have it's own step implementation class. Once you click on OK, it will create a sample template in the HomePageSteps class file. Now we need to implement these methods and write the code behind to launch Quikr app on emulator. Starting appium server session and finding locators First thing we need to do, is to download a sample app (Quikr apk in this case). Download the Quikr app (version 9.16). Create a folder named app under the HelloAppium project and copy the downloaded apk under that folder. Launch the appium GUI app. Launch the emulator or connect your device (assuming you have Developer Options enabled). On the appium GUI app, click on the android icon and select the below options: App Path - browse to the .apk location under the app folder. Platform Name - Android. Automation Name - Appium. Platform Version - Select the version which matches the emulator from the dropdown, it allows to edit the value. Device Name - enter any string e.g. Nexus6. Once the above settings are done, click on General Settings icon and choose the below mentioned settings. Once the setup is done, click on the icon to close the pop up. Select Prelaunch application Select Strict Capabilities Select Override Existing Sessions Select Kill Processes Using Server Port Before Launch Select New Command Timeout and enter the value 7200 Click on Launch This would start the appium server session. Once you click on Appium Inspector, it will install the app on emulator and launch the same. If you click on the Record button, it will generate the boilerplate code which has Desired Capabilities respective to the run environment and app location: We can copy the above line and put into the code template generated for the step When I launch Quikr app. This is how the code should look like after copying it in the method: @When("^I launch Quikr app$") public void iLaunchQuikrApp() throws Throwable { DesiredCapabilities capabilities = new DesiredCapabilities(); capabilities.setCapability("appium-version", "1.0"); capabilities.setCapability("platformName", "Android"); capabilities.setCapability("platformVersion", "5.1"); capabilities.setCapability("deviceName", "Nexus6"); capabilities.setCapability("app", "/Users/nishant/Development/HelloAppium/app/quikr.apk"); AppiumDriver wd = new AppiumDriver(new URL("http://0.0.0.0:4723/wd/hub"), capabilities); wd.manage().timeouts().implicitlyWait(60, TimeUnit. SECONDS ); } Now the above code only sets the Desired Capabilities, appium server is yet to be started. For now, we can start it from outside like terminal (or command prompt) by running the command appium. We can close the appium inspector and stop the appium server by click in onStop on the appium GUI app. To run the above test, we need to do the following: Start the appium server via command line (Command: appium --session-override ). In IntelliJ, right click on the feature file and choose the option to "Run...". Now the scope of AppiumDriver is local to the method, hence we can refactor and extract appiumDriver as field. To continue with the other steps automation, we can use the appium inspector to find the element handle. We can launch appium inspector using the above mentioned steps, then click on the element whose locator we want to find out as shown in the below mentioned screen. Once we have the locator, we can use the appium api (as shown below) to click it: appiumDriver.findElement(By.id("sign_in_button")).click(); This way we can implement the remaining steps. Write a small test for mobile web To automate mobile web app, we don't need to install the app on the device. We need a browser and the app url which is sufficient to start the test automation. We can tweak the above written code by adding a desired capability browserName. We can write a similar scenario and make it mobile web specific: Scenario: Registration Flow Validation via web As a User I want to verify that I get the option of choosing Facebook when I choose to register When I launch Quikr mobile web And I choose to register Then I should see an option to register using Facebook So the method for mobile web would look like: @When("^I launch Quikr mobile web$") public void iLaunchQuikrMobileWeb() throws Throwable { DesiredCapabilities desiredCapabilities = new DesiredCapabilities(); desiredCapabilities.setCapability("platformName", "Android"); desiredCapabilities.setCapability("deviceName", "Nexus"); desiredCapabilities.setCapability("browserName", "Browser"); URL url = new URL("http://127.0.0.1:4723/wd/hub"); appiumDriver = new AppiumDriver(url, desiredCapabilities); appiumDriver.get("http://m.quikr.com"); } So in the above code, we don't need platformVersion and we need a valid value for browserName parameter. Possible values for browserName are: Chrome - For Chrome browser on Android Safari - For Safari browser on iOS Browser - For stock browser on Android We can follow the same steps as above to run the test. Finding locators in mobile web app To implement the remaining steps of above mentioned feature, we need to find locators for the elements we want to interact with. Once the locators are found then we need to perform the desired operation which could be click, send keys etc. Below mentioned are the steps which will help us find the locators for a mobile web app: Launch the chrome browser on your machine and navigate to the mobile site (in our case:  http://m.quikr.com) Select More Tools -> Developer Tools from the Chrome Menu In the Developer Tool menu items, click on the Toggle device toolbar icon. Once done the page would be displayed in a mobile layout format. In order to find the locator of any UI element, click on the first icon of the dev tool bar and then click on the desired element. The HTML in the dev tool layout would change to highlight the selected element. Refer the below screenshot which shows the same. In the highlight panel on the right side, we can see the following properties name=query and id=query . We can choose to use id and implement the step as: appiumDriver.findElement(By.id("query")).click(); Using the above way, we can find the locator of the various elements we need to interact with and proceed with our test automation. Summary So in this article, we briefly described how we would go about writing test for a native app as well as a mobile web. We discussed how to create a project in IntelliJ and write a sample feature file. We also learned how to start the appium inspector and look for locator. We learned about the chrome dev tool and how can use the same to find locator for mobile web. Resources for Article: Further resources on this subject: Appium Essentials [article] Ensuring Five-star Rating in the MarketPlace [article] Testing in Agile Development and the State of Agile Adoption [article]
Read more
  • 0
  • 0
  • 28397

article-image-getting-started-ansible-2
Packt
23 Jun 2017
5 min read
Save for later

Getting Started with Ansible 2

Packt
23 Jun 2017
5 min read
In this article, by Jonathan McAllister, author of the book, Implementing DevOps with Ansible 2, we will learn what is Ansible, how users can leverage it, it's architecture, the key differentiators of Ansible from other configuration managements. We will also see the organizations that were successfully able to leverage Ansible. (For more resources related to this topic, see here.) What is Ansible? Ansible is a relatively new addition to the DevOps and configuration management space. It's simplicity, structured automation format and development paradigm have caught the eyes of both small and large corporations alike. Organizations as large as Twitter have managed successfully to leverage Ansible for highly scaled deployments and configuration management across implementations across thousands of servers simultaneously. And Twitter isn't the only organization that has managed to implement Ansible at scale, other well-known organizations that have successfully leveraged Ansible include Logitech, NASA, NEC, Twitter, Microsoft and hundreds more. As it stands today, Ansible is in use by major players around the world managing thousands of deployments and configuration management solutions world wide. The Ansible's Automation Architecture Ansible was created with an incredibly flexible and scalable automation engine. It allows users to leverage it in many diverse ways and can be conformed to be used in the way that best suits your specific needs. Since Ansible is agentless (meaning there is no permanently running daemon on the systems it manages or executes from), it can be used locally to control a single system (without any network connectivity), or leveraged to orchestrate and execute automation against many systems, via a control server. In addition to the aforementioned named architectures, Ansible can also be leveraged via Vagrant or Docker to provision infrastructure automatically. This type of solution basically allows the Ansible user to bootstrap their hardware or infrastructure provisioning by running an Ansible playbook(s).  If you happen to be a Vagrant user, there is instructions within the HashiCorp Ansible provisioning located at https://www.vagrantup.com/docs/provisioning/ansible.html. Ansible is open source, module based, pluggable, and agentless. These key differentiators from other configuration management solutions give Ansible a significant edge. Let's take a look at each of these differentiators in details and see what that actually means for Ansible developers and users: Open source: It is no secret that successful open source solutions are usually extraordinarily feature rich. This is because instead of a simple 8 person (or even 100) person engineering team, there are potentially thousands of developers. Each development and enhancement has been designed to fit a unique need. As a result the end deliverable product provides the consumers of Ansible with a very well rounded solution that can be adapted or leveraged in numerous ways. Module based: Ansible has been developed with the intention to integrate with numerous other open and closed source software solutions. This idea means that Ansible is currently compatible with multiple flavors of Linux, Windows and Cloud providers. Aside from its OS level support Ansible currently integrates with hundreds of other software solutions; including EC2, Jira, Jenkins, Bamboo, Microsoft Azure, Digital Ocean, Docker, Google and MANY MANY more.   For a complete list of Ansible modules, please consult the Ansible official module support list located at http://docs.ansible.com/ansible/modules_by_category.html. Agentless: One of the key differentiators that gives Ansible an edge against the competition is the fact that it is 100% agentless. This means there are no daemons that need to be installed on remote machines, no firewall ports that need to be opened (besides traditional SSH), no monitoring that needs to be performed on the remote machines and no management that needs to be performed on the infrastructure fleet. In effect, this makes Ansible very self sufficient.  Since Ansible can be implemented in a few different ways the aim of this section is to highlight these options and help get us familiar with the architecture types that Ansible supports. Generally the architecture of Ansible can be categorized into three distinct architecture types. These are described next. Pluggable: While Ansible comes out of the box with a wide spectrum of software integrations support, it is often times a requirement to integrate the solution with a company based internal software solution or a software solution that has not already been integrated into Ansible's robust playbook suite. The answer to such a requirement would be to create a plugin based solution for Ansible, this providing the custom functionality necessary.  Summary In this article, we discussed the architecture of Ansible, the key differentiators that differentiate Ansible from other configuration management. We learnt that Ansible can also be leveraged via Vagrant or Docker to provision infrastructure automatically. We also saw that Ansible has been successfully leveraged by large oraganizations like Twitter, Microsoft, and many more. Resources for Article: Further resources on this subject: Getting Started with Ansible Getting Started with Ansible Introduction to Ansible Introduction to Ansible System Architecture and Design of Ansible System Architecture and Design of Ansible
Read more
  • 0
  • 0
  • 25509

article-image-spatial-data
Packt
23 Jun 2017
12 min read
Save for later

Spatial Data

Packt
23 Jun 2017
12 min read
In this article by Dominik Mikiewicz, the author of the book Mastering PostGIS, we will see about exporting data from PostgreSQL/PostGIS to files or other data sources. Sharing data via the Web is no less important, but it has its own specific process. There may be different reasons for having to export data from a database, but certainly sharing it with others is among the most popular ones. Backing the data up or transferring it to other software packages for further processing are other common reasons for learning the export techniques. (For more resources related to this topic, see here.) In this article we'll have a closer look at the following: Exporting data using COPY (and COPY) Exporting vector data using pgsql2shp Exporting vector data using ogr2ogr Exporting data using GIS clients Outputting rasters using GDAL Outputting rasters using psql Using the PostgreSQL backup functionality We just do the steps the other way round. In other words, this article may give you a bit of a déjà vu feeling. Exporting data using COPY in psql When we were importing data, we used the psql COPY FROM command to copy data from a file to a table. This time, we'll do it the other way round—from a table to a file—using the COPY TO command. COPY TO can not only copy a full table, but also the result of a SELECT query, and that means we can actually output filtered sub datasets of the source tables. Similarly to the method we used to import, we can execute COPY or COPY in different scenarios: We'll use psql in interactive and non-interactive mode, and we'll also do the very same thing in PgAdmin. It is worth remembering that COPY can only read/write files that can be accessed by an instance of the server, so usually files that reside on the same machine as the database server. For detailed information on COPY syntax and parameters, type: h copy Exporting data in psql interactively In order to export the data in interactive mode we first need to connect to the database using psql: psql -h localhost -p 5434 -U postgres Then type the following: c mastering_postgis Once connected, we can execute a simple command: copy data_import.earthquakes_csv TO earthquakes.csv WITH DELIMITER ';' CSV HEADER The preceding command exported a data_import. earthquakes_csv table to a file named earthquakes.csv, with ';' as a column separator. A header in the form of column names has also been added to the beginning of the file. The output should be similar to the following: COPY 50 Basically, the database told us how many records have been exported. The content of the exported file should exactly resemble the content of the table we exported from: time;latitude;longitude;depth;mag;magtype;nst;gap;dmin;rms;net;id;updated;place;type;horizontalerror;deptherror;magerror;magnst;status;locationsource;magsource 2016-10-08 14:08:08.71+02;36.3902;-96.9601;5;2.9;mb_lg;;60;0.029;0.52;us;us20007csd;2016-10-08 14:27:58.372+02;15km WNW of Pawnee, Oklahoma;earthquake;1.3;1.9;0.1;26;reviewed;us;us As mentioned, COPY can also output the results of a SELECT query. This means we can tailor the output to very specific needs, as required. In the next example, we'll export data from a 'spatialized' earthquakes table, but the geometry will be converted to a WKT (well-known text) representation. We'll also export only a subset of columns: copy (select id, ST_AsText(geom) FROM data_import.earthquakes_subset_with_geom) TO earthquakes_subset.csv WITH CSV DELIMITER '|' FORCE QUOTE * HEADER Once again, the output just specifies the amount of records exported: COPY 50 The executed command exported only the id column and a WKT-encoded geometry column. The export force wrapped the data into quote symbols, with a pipe (|) symbol used as a delimiter. The file has header: id|st_astext "us20007csd"|"POINT(-96.9601 36.3902)" "us20007csa"|"POINT(-98.7058 36.4314) Exporting data in psql non-interactively If you're still in psql, you can execute a script by simply typing the following: i path/to/the/script.sql For example: i code/psql_export.sql The output will not surprise us, as it will simply state the number of records that were outputted: COPY 50 If you happen to have already quitted psql, the cmd i equivalent is -f, so the command should look like this: Psql -h localhost -p 5434 -U postgres -d mastering_postgis -f code/psql_export.sql Not surprisingly, the cmd output is once again the following: COPY 50 Exporting data in PgAdmin In PgAdmin, the command is COPY rather than COPY. The rest of the code remains the same. Another difference is that we need to use an absolute path, while in psql we can use paths relative to the directory we started psql in. So the first psql query 'translated' to the PgAdmin SQL version looks like this: copy data_import.earthquakes_csv TO 'F:mastering_postgischapter06earthquakes.csv' WITH DELIMITER ';' CSV HEADER The second query looks like this: copy (select id, ST_AsText(geom) FROM data_import.earthquakes_subset_with_geom) TO 'F:mastering_postgischapter06earthquakes_subset.csv' WITH CSV DELIMITER '|' FORCE QUOTE * HEADER Both produce a similar output, but this time it is logged in PgAdmin's query output pane 'Messages' tab: Query returned successfully: 50 rows affected, 55 msec execution time. It is worth remembering that COPY is executed as part of an SQL command, so it is effectively the DB server that tries to write to a file. Therefore, it may be the case that the server is not able to access a specified directory. If your DB server is on the same machine as the directory that you are trying to write to, relaxing directory access permissions should help. Exporting vector data using pgsql2shp pgsql2shp is a command-line tool that can be used to output PostGIS data into shapefiles. Similarly to outgoing COPY, it can either export a full table or the result of a query, so this gives us flexibility when we only need a subset of data to be outputted and we do not want to either modify the source tables or create temporary, intermediate ones. pgsql2sph command line In order to get some help with the tool just type the following in the console: pgsql2shp The general syntax for the tool is as follows: pgsql2shp [<options>] <database> [<schema>.]<table> pgsql2shp [<options>] <database> <query> Shapefile is a format that is made up of a few files. The minimum set is SHP, SHX, and DBF. If PostGIS is able to determine the projection of the data, it will also export a PRJ file that will contain the SRS information, which should be understandable by the software able to consume a shapefile. If a table does not have a geometry column, then only a DBF file that is the equivalent of the table data will be exported. Let's export a full table first: pgsql2shp -h localhost -p 5434 -u postgres -f full_earthquakes_dataset mastering_postgis data_import.earthquakes_subset_with_geom The following output should be expected: Initializing... Done (postgis major version: 2). Output shape: Point Dumping: X [50 rows] Now let's do the same, but this time with the result of a query: pgsql2shp -h localhost -p 5434 -u postgres -f full_earthquakes_dataset mastering_postgis "select * from data_import.earthquakes_subset_with_geom limit 1" To avoid being prompted for a password, try providing it within the command via the -P switch. The output will be very similar to what we have already seen: Initializing... Done (postgis major version: 2). Output shape: Point Dumping: X [1 rows] In the data we previously imported, we do not have examples that would manifest shapefile limitations. It is worth knowing about them, though. You will find a decent description at https://en.wikipedia.org/wiki/Shapefile#Limitations. The most important ones are as follows: Column name length limit: The shapefile can only handle column names with a maximum length of 10 characters; pgsql2shp will not produce duplicate columns, though—if there were column names that would result in duplicates when truncated, then the tool will add a sequence number. Maximum field length: The maximum field length is 255; psql will simply truncate the data upon exporting. In order to demonstrate the preceding limitations, let's quickly create a test PostGIS dataset: Create a schema if an export does not exist: CREATE SCHEMA IF NOT EXISTS data_export; CREATE TABLE IF NOT EXISTS data_export.bad_bad_shp ( id character varying, "time" timestamp with time zone, depth numeric, mag numeric, very_very_very_long_column_that_holds_magtype character varying, very_very_very_long_column_that_holds_place character varying, geom geometry); INSERT INTO data_export.bad_bad_shp select * from data_import.earthquakes_subset_with_geom limit 1; UPDATE data_export.bad_bad_shp SET very_very_very_long_column_that_holds_magtype = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce id mauris eget arcu imperdiet tristique eu sed est. Quisque suscipit risus eu ante vestibulum hendrerit ut sed nulla. Nulla sit amet turpis ipsum. Curabitur nisi ante, luctus nec dignissim ut, imperdiet id tortor. In egestas, tortor ac condimentum sollicitudin, nisi lacus porttitor nibh, a tempus ex tellus in ligula. Donec pharetra laoreet finibus. Donec semper aliquet fringilla. Etiam faucibus felis ac neque facilisis vestibulum. Vivamus scelerisque at neque vel tincidunt. Phasellus gravida, ipsum vulputate dignissim laoreet, augue lacus congue diam, at tempus augue dolor vitae elit.'; Having prepared a vigilante dataset, let's now export it to SHP to see if our SHP warnings were right: pgsql2shp -h localhost -p 5434 -u postgres -f bad_bad_shp mastering_postgis data_export.bad_bad_shp When you now open the exported shapefile in a GIS client of your choice, you will see our very, very long column names renamed to VERY_VERY_ and VERY_VE_01. The content of the very_very_very_long_column_that_holds_magtype field has also been truncated to 255 characters, and is now 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce id mauris eget arcu imperdiet tristique eu sed est. Quisque suscipit risus eu ante vestibulum hendrerit ut sed nulla. Nulla sit amet turpis ipsum. Curabitur nisi ante, luctus nec dignissim ut'. For the sake of completeness, we'll also export a table without geometry so we are certain that pgsql2shp exports only a BDF file: pgsql2shp -h localhost -p 5434 -u postgres -f a_lonely_dbf mastering_postgis "select id, place from data_import.earthquakes_subset_with_geom limit 1" pgsql2shp gui We have already seen the PgAdmin's GUI for importing shapefiles. As you surely remember, the pgsql2shp GUI also has an Export tab. If you happen to encounter difficulties locating the pgsql2shp GUI in pgAdmin 4, try calling it from the shell/command line by executing shp2pgsql-gui. If because of some reason it is not recognized, try to locate the utility in your DB directory under bin/postgisgui/shp2pgsql-gui.exe. In order to export a shapefile from PostGIS, go to the PluginsPostGIS shapefile and DBF loader 2.2 (version may vary); then you have to switch to the Export tab: It is worth mentioning that you have some options to choose from when exporting. They are rather self-explanatory: When you press the Export button, you can choose the output destination. The log is displayed in the 'Log Window' area of the exporter GUI: Exporting vector data using ogr2ogr We have already seen a little preview of ogr2ogr exporting the data when we made sure that our KML import had actually brought in the proper data. This time we'll expand on the subject a bit and also export a few more formats to give you an idea of how sound a tool ogr2ogr is. In order to get some information on the tool, simply type the following in the console: ogr2ogr Alternatively, if you would like to get some more descriptive info, visit http://www.gdal.org/ogr2ogr.html. You could also type the following: ogr2ogr –long-usage The nice thing about ogr2ogr is that the tool is very flexible and offers some options that allow us to export exactly what we are after. You can specify what data you would like to select by specifying the columns in a -select parameter. -where parameter lets you specify the filtering for your dataset in case you want to output only a subset of data. Should you require more sophisticated output preparation logic, you can use a -sql parameter. This is obviously not all there is. The usual gdal/ogr2ogr parameters are available too. You can reproject the data on the fly using the -t_srs parameter, and if, for some reason, the SRS of your data has not been clearly defined, you can use -s_srs to instruct ogr2ogr what the source coordinate system is for the dataset being processed. There are obviously advanced options too. Should you wish to clip your dataset to a specified bounding box, polygon, or coordinate system, have a look at the -clipsrc, -clipdst parameters and their variations. The last important parameter to know is -dsco—dataset creation options. It accepts values in the form of NAME=VALUE. When you want to pass more than one option this way, simply repeat the parameter. The actual dataset creation options depend on the format used, so it is advised that you consult the appropriate format information pages available via the ogr2pgr website. Summary There are many ways of getting the data out of a database. Some are PostgreSQL specific, some are PostGIS specific. The point is that you can use and mix any tools you prefer. There will be scenarios where simple data extraction procedures will do just fine; some other cases will require a more specialized setup, SQL or psql, or even writing custom code in external languages. I do hope this article gives you a toolset you can use with confidence in your daily activities. Resources for Article: Further resources on this subject: Spatial Data Services [article] Data Around Us [article] R ─ Classification and Regression Trees [article]
Read more
  • 0
  • 0
  • 1952
article-image-exploring-compilers
Packt
23 Jun 2017
17 min read
Save for later

Exploring Compilers

Packt
23 Jun 2017
17 min read
In this article by Gabriele Lanaro, author of the book, Python High Performance - Second Edition, we will see that Python is a mature and widely used language and there is a large interest in improving its performance by compiling functions and methods directly to machine code rather than executing instructions in the interpreter. In this article, we will explore two projects--Numba and PyPy--that approach compilation in a slightly different way. Numba is a library designed to compile small functions on the fly. Instead of transforming Python code to C, Numba analyzes and compiles Python functions directly to machine code. PyPy is a replacement interpreter that works by analyzing the code at runtime and optimizing the slow loops automatically. (For more resources related to this topic, see here.) Numba Numba was started in 2012 by Travis Oliphant, the original author of NumPy, as a library for compiling individual Python functions at runtime using the Low-Level Virtual Machine  ( LLVM ) toolchain. LLVM is a set of tools designed to write compilers. LLVM is language agnostic and is used to write compilers for a wide range of languages (an important example is the clang compiler). One of the core aspects of LLVM is the intermediate representation (the LLVM IR), a very low-level platform-agnostic language similar to assembly, that can be compiled to machine code for the specific target platform. Numba works by inspecting Python functions and by compiling them, using LLVM, to the IR. As we have already seen in the last article, the speed gains can be obtained when we introduce types for variables and functions. Numba implements clever algorithms to guess the types (this is called type inference) and compiles type-aware versions of the functions for fast execution. Note that Numba was developed to improve the performance of numerical code. The development efforts often prioritize the optimization of applications that intensively use NumPy arrays. Numba is evolving really fast and can have substantial improvements between releases and, sometimes, backward incompatible changes.  To keep up, ensure that you refer to the release notes for each version. In the rest of this article, we will use Numba version 0.30.1; ensure that you install the correct version to avoid any error. The complete code examples in this article can be found in the Numba.ipynb notebook. First steps with Numba Getting started with Numba is fairly straightforward. As a first example, we will implement a function that calculates the sum of squares of an array. The function definition is as follows: def sum_sq(a): result = 0 N = len(a) for i in range(N): result += a[i] return result To set up this function with Numba, it is sufficient to apply the nb.jit decorator: from numba import nb @nb.jit def sum_sq(a): ... The nb.jit decorator won't do much when applied. However, when the function will be invoked for the first time, Numba will detect the type of the input argument, a , and compile a specialized, performant version of the original function. To measure the performance gain obtained by the Numba compiler, we can compare the timings of the original and the specialized functions. The original, undecorated function can be easily accessed through the py_func attribute. The timings for the two functions are as follows: import numpy as np x = np.random.rand(10000) # Original %timeit sum_sq.py_func(x) 100 loops, best of 3: 6.11 ms per loop # Numba %timeit sum_sq(x) 100000 loops, best of 3: 11.7 µs per loop You can see how the Numba version is order of magnitude faster than the Python version. We can also compare how this implementation stacks up against NumPy standard operators: %timeit (x**2).sum() 10000 loops, best of 3: 14.8 µs per loop In this case, the Numba compiled function is marginally faster than NumPy vectorized operations. The reason for the extra speed of the Numba version is likely that the NumPy version allocates an extra array before performing the sum in comparison with the in-place operations performed by our sum_sq function. As we didn't use array-specific methods in sum_sq, we can also try to apply the same function on a regular Python list of floating point numbers. Interestingly, Numba is able to obtain a substantial speed up even in this case, as compared to a list comprehension: x_list = x.tolist() %timeit sum_sq(x_list) 1000 loops, best of 3: 199 µs per loop %timeit sum([x**2 for x in x_list]) 1000 loops, best of 3: 1.28 ms per loop Considering that all we needed to do was apply a simple decorator to obtain an incredible speed up over different data types, it's no wonder that what Numba does looks like magic. In the following sections, we will dig deeper and understand how Numba works and evaluate the benefits and limitations of the Numba compiler. Type specializations As shown earlier, the nb.jit decorator works by compiling a specialized version of the function once it encounters a new argument type. To better understand how this works, we can inspect the decorated function in the sum_sq example. Numba exposes the specialized types using the signatures attribute. Right after the sum_sq definition, we can inspect the available specialization by accessing the sum_sq.signatures, as follows: sum_sq.signatures # Output: # [] If we call this function with a specific argument, for instance, an array of float64 numbers, we can see how Numba compiles a specialized version on the fly. If we also apply the function on an array of float32, we can see how a new entry is added to the sum_sq.signatures list: x = np.random.rand(1000).astype('float64') sum_sq(x) sum_sq.signatures # Result: # [(array(float64, 1d, C),)] x = np.random.rand(1000).astype('float32') sum_sq(x) sum_sq.signatures # Result: # [(array(float64, 1d, C),), (array(float32, 1d, C),)] It is possible to explicitly compile the function for certain types by passing a signature to the nb.jit function. An individual signature can be passed as a tuple that contains the type we would like to accept. Numba provides a great variety of types that can be found in the nb.types module, and they are also available in the top-level nb namespace. If we want to specify an array of a specific type, we can use the slicing operator, [:], on the type itself. In the following example, we demonstrate how to declare a function that takes an array of float64 as its only argument: @nb.jit((nb.float64[:],)) def sum_sq(a): Note that when we explicitly declare a signature, we are prevented from using other types, as demonstrated in the following example. If we try to pass an array, x, as float32, Numba will raise a TypeError: sum_sq(x.astype('float32')) # TypeError: No matching definition for argument type(s) array(float32, 1d, C) Another way to declare signatures is through type strings. For example, a function that takes a float64 as input and returns a float64 as output can be declared with the float64(float64) string. Array types can be declared using a [:] suffix. To put this together, we can declare a signature for our sum_sq function, as follows: @nb.jit("float64(float64[:])") def sum_sq(a): You can also pass multiple signatures by passing a list: @nb.jit(["float64(float64[:])", "float64(float32[:])"]) def sum_sq(a): Object mode versus native mode So far, we have shown how Numba behaves when handling a fairly simple function. In this case, Numba worked exceptionally well, and we obtained great performance on arrays and lists.The degree of optimization obtainable from Numba depends on how well Numba is able to infer the variable types and how well it can translate those standard Python operations to fast type-specific versions. If this happens, the interpreter is side-stepped and we can get performance gains similar to those of Cython. When Numba cannot infer variable types, it will still try and compile the code, reverting to the interpreter when the types can't be determined or when certain operations are unsupported. In Numba, this is called object mode and is in contrast to the intepreter-free scenario, called native mode. Numba provides a function, called inspect_types, that helps understand how effective the type inference was and which operations were optimized. As an example, we can take a look at the types inferred for our sum_sq function: sum_sq.inspect_types() When this function is called, Numba will print the type inferred for each specialized version of the function. The output consists of blocks that contain information about variables and types associated with them. For example, we can examine the N = len(a) line: # --- LINE 4 --- # a = arg(0, name=a) :: array(float64, 1d, A) # $0.1 = global(len: <built-in function len>) :: Function(<built-in function len>) # $0.3 = call $0.1(a) :: (array(float64, 1d, A),) -> int64 # N = $0.3 :: int64 N = len(a) For each line, Numba prints a thorough description of variables, functions, and intermediate results. In the preceding example, you can see (second line) that the argument a is correctly identified as an array of float64 numbers. At LINE 4, the input and return type of the len function is also correctly identified (and likely optimized) as taking an array of float64 numbers and returning an int64. If you scroll through the output, you can see how all the variables have a well-defined type. Therefore, we can be certain that Numba is able to compile the code quite efficiently. This form of compilation is called native mode. As a counter example, we can see what happens if we write a function with unsupported operations. For example, as of version 0.30.1, Numba has limited support for string operations. We can implement a function that concatenates a series of strings, and compiles it as follows: @nb.jit def concatenate(strings): result = '' for s in strings: result += s return result Now, we can invoke this function with a list of strings and inspect the types: concatenate(['hello', 'world']) concatenate.signatures # Output: [(reflected list(str),)] concatenate.inspect_types() Numba will return the output of the function for the reflected list (str) type. We can, for instance, examine how line 3 gets inferred. The output of concatenate.inspect_types() is reproduced here: # --- LINE 3 --- # strings = arg(0, name=strings) :: pyobject # $const0.1 = const(str, ) :: pyobject # result = $const0.1 :: pyobject # jump 6 # label 6 result = '' You can see how this time, each variable or function is of the generic pyobject type rather than a specific one. This means that, in this case, Numba is unable to compile this operation without the help of the Python interpreter. Most importantly, if we time the original and compiled function, we note that the compiled function is about three times slower than the pure Python counterpart: x = ['hello'] * 1000 %timeit concatenate.py_func(x) 10000 loops, best of 3: 111 µs per loop %timeit concatenate(x) 1000 loops, best of 3: 317 µs per loop This is because the Numba compiler is not able to optimize the code and adds some extra overhead to the function call.As you may have noted, Numba compiled the code without complaints even if it is inefficient. The main reason for this is that Numba can still compile other sections of the code in an efficient manner while falling back to the Python interpreter for other parts of the code. This compilation strategy is called object mode. It is possible to force the use of native mode by passing the nopython=True option to the nb.jit decorator. If, for example, we apply this decorator to our concatenate function, we observe that Numba throws an error on first invocation: @nb.jit(nopython=True) def concatenate(strings): result = '' for s in strings: result += s return result concatenate(x) # Exception: # TypingError: Failed at nopython (nopython frontend) This feature is quite useful for debugging and ensuring that all the code is fast and correctly typed. Numba and NumPy Numba was originally developed to easily increase performance of code that uses NumPy arrays. Currently, many NumPy features are implemented efficiently by the compiler. Universal functions with Numba Universal functions are special functions defined in NumPy that are able to operate on arrays of different sizes and shapes according to the broadcasting rules. One of the best features of Numba is the implementation of fast ufuncs. We have already seen some ufunc examples in article 3, Fast Array Operations with NumPy and Pandas. For instance, the np.log function is a ufunc because it can accept scalars and arrays of different sizes and shapes. Also, universal functions that take multiple arguments still work according to the  broadcasting rules. Examples of universal functions that take multiple arguments are np.sum or np.difference. Universal functions can be defined in standard NumPy by implementing the scalar version and using the np.vectorize function to enhance the function with the broadcasting feature. As an example, we will see how to write the Cantor pairing function. A pairing function is a function that encodes two natural numbers into a single natural number so that you can easily interconvert between the two representations. The Cantor pairing function can be written as follows: import numpy as np def cantor(a, b): return int(0.5 * (a + b)*(a + b + 1) + b) As already mentioned, it is possible to create a ufunc in pure Python using the np.vectorized decorator: @np.vectorize def cantor(a, b): return int(0.5 * (a + b)*(a + b + 1) + b) cantor(np.array([1, 2]), 2) # Result: # array([ 8, 12]) Except for the convenience, defining universal functions in pure Python is not very useful as it requires a lot of function calls affected by interpreter overhead. For this reason, ufunc implementation is usually done in C or Cython, but Numba beats all these methods by its convenience. All that is needed to do in order to perform the conversion is using the equivalent decorator, nb.vectorize. We can compare the speed of the standard np.vectorized version which, in the following code, is called cantor_py, and the same function is implemented using standard NumPy operations: # Pure Python %timeit cantor_py(x1, x2) 100 loops, best of 3: 6.06 ms per loop # Numba %timeit cantor(x1, x2) 100000 loops, best of 3: 15 µs per loop # NumPy %timeit (0.5 * (x1 + x2)*(x1 + x2 + 1) + x2).astype(int) 10000 loops, best of 3: 57.1 µs per loop You can see how the Numba version beats all the other options by a large margin! Numba works extremely well because the function is simple and type inference is possible. An additional advantage of universal functions is that, since they depend on individual values, their evaluation can also be executed in parallel. Numba provides an easy way to parallelize such functions by passing the target="cpu" or target="gpu" keyword argument to the nb.vectorize decorator. Generalized universal functions One of the main limitations of universal functions is that they must be defined on scalar values. A generalized universal function, abbreviated gufunc, is an extension of universal functions to procedures that take arrays. A classic example is the matrix multiplication. In NumPy, matrix multiplication can be applied using the np.matmul function, which takes two 2D arrays and returns another 2D array. An example usage of np.matmul is as follows: a = np.random.rand(3, 3) b = np.random.rand(3, 3) c = np.matmul(a, b) c.shape # Result: # (3, 3) As we saw in the previous subsection, a ufunc broadcasts the operation over arrays of scalars, its natural generalization will be to broadcast over an array of arrays. If, for instance, we take two arrays of 3 by 3 matrices, we will expect np.matmul to take to match the matrices and take their product. In the following example, we take two arrays containing 10 matrices of shape (3, 3). If we apply np.matmul, the product will be applied matrix-wise to obtain a new array containing the 10 results (which are, again, (3, 3) matrices): a = np.random.rand(10, 3, 3) b = np.random.rand(10, 3, 3) c = np.matmul(a, b) c.shape # Output # (10, 3, 3) The usual rules for broadcasting will work in a similar way. For example, if we have an array of (3, 3) matrices, which will have a shape of (10, 3, 3), we can use np.matmul to calculate the matrix multiplication of each element with a single (3, 3) matrix. According to the broadcasting rules, we obtain that the single matrix will be repeated to obtain a size of (10, 3, 3): a = np.random.rand(10, 3, 3) b = np.random.rand(3, 3) # Broadcasted to shape (10, 3, 3) c = np.matmul(a, b) c.shape # Result: # (10, 3, 3) Numba supports the implementation of efficient generalized universal functions through the nb.guvectorize decorator. As an example, we will implement a function that computes the euclidean distance between two arrays as a gufunc. To create a gufunc, we have to define a function that takes the input arrays, plus an output array where we will store the result of our calculation. The nb.guvectorize decorator requires two arguments: The types of the input and output: two 1D arrays as input and a scalar as output The so called layout string, which is a representation of the input and output sizes; in our case, we take two arrays of the same size (denoted arbitrarily by n), and we output a scalar In the following example, we show the implementation of the euclidean function using the nb.guvectorize decorator: @nb.guvectorize(['float64[:], float64[:], float64[:]'], '(n), (n) - > ()') def euclidean(a, b, out): N = a.shape[0] out[0] = 0.0 for i in range(N): out[0] += (a[i] - b[i])**2 There are a few very important points to be made. Predictably, we declared the types of the inputs a and b as float64[:], because they are 1D arrays. However, what about the output argument? Wasn't it supposed to be a scalar? Yes, however, Numba treats scalar argument as arrays of size 1. That's why it was declared as float64[:]. Similarly, the layout string indicates that we have two arrays of size (n) and the output is a scalar, denoted by empty brackets--(). However, the array out will be passed as an array of size 1. Also, note that we don't return anything from the function; all the output has to be written in the out array. The letter n in the layout string is completely arbitrary; you may choose to use k  or other letters of your liking. Also, if you want to combine arrays of uneven sizes, you can use layouts strings, such as (n, m). Our brand new euclidean function can be conveniently used on arrays of different shapes, as shown in the following example: a = np.random.rand(2) b = np.random.rand(2) c = euclidean(a, b) # Shape: (1,) a = np.random.rand(10, 2) b = np.random.rand(10, 2) c = euclidean(a, b) # Shape: (10,) a = np.random.rand(10, 2) b = np.random.rand(2) c = euclidean(a, b) # Shape: (10,) How does the speed of euclidean compare to standard NumPy? In the following code, we benchmark a NumPy vectorized version with our previously defined euclidean function: a = np.random.rand(10000, 2) b = np.random.rand(10000, 2) %timeit ((a - b)**2).sum(axis=1) 1000 loops, best of 3: 288 µs per loop %timeit euclidean(a, b) 10000 loops, best of 3: 35.6 µs per loop The Numba version, again, beats the NumPy version by a large margin! Summary Numba is a tool that compiles fast, specialized versions of Python functions at runtime. In this article, we learned how to compile, inspect, and analyze functions compiled by Numba. We also learned how to implement fast NumPy universal functions that are useful in a wide array of numerical applications.  Tools such as PyPy allow us to run Python programs unchanged to obtain significant speed improvements. We demonstrated how to set up PyPy, and we assessed the performance improvements on our particle simulator application. Resources for Article: Further resources on this subject: Getting Started with Python Packages [article] Python for Driving Hardware [article] Python Data Science Up and Running [article]
Read more
  • 0
  • 0
  • 12993

article-image-use-real-world-application
Packt
23 Jun 2017
11 min read
Save for later

Use in Real World Application

Packt
23 Jun 2017
11 min read
In this article by Giorgio Zarrelli author of the book Mastering Bash, we are moving a step now into the real world creating something that can turn out handy for your daily routine and during this process we will have a look at the common pitfalls in coding and how to make our script reliable. Be short or a long script, we must always ask ourselves the same questions: What do we really want to accomplish? How much time do we have? Do we have all the resources needed? Do we have the knowledge required for the task? (For more resources related to this topic, see here.) We will start coding with a Nagios plugin which will give us a broad understanding of how this monitoring system is and how to make a script dynamically interact with other programs. What is Nagios Nagios is one of the most widely adopted Open Source IT infrastructure monitoring tool whose main interesting feature being the fact that it does not know how to monitor anything. Well, it can sound like a joke but actually Nagios can be defined as an evaluating core which takes some informations as input and reacting accordingly. How this information is gathered? It is not the main concern of this tool and this leads us to an interesting point: Nagios leave sthe task of getting the monitored data to an external plugin which: Knows how to connect to the monitored services Knows how to collect the data from the monitored services Knows how to evaluate the data Inform Nagios if the values gathered are beyond or in the boundaries to raise an alarm. So, a plugin does a lot of things and one would ask himself what does Nagios do then? Imagine it as an exchange pod where information is flowing in and out and decisions are taken based on the configurations set; the core triggers the plugin to monitor a service, the plugin itself returns some information and Nagios takes a decision about: If to raise an alarm Send a notification Whom to notify For how long Which, if any action is taken in order to get back into normality The core Nagios program does everything except actually knock at the door of a service, ask for information and decide if this information shows some issues or not. Planning must be done, but it can be fun. Active and passive checks To understand how to code a plugin we have first to grasp how, on a broad scale, a Nagios check works. There are two different kinds of checks: Active check Based on a time range, or manually triggered, an active check sees a plugin actively connecting to a service and collecting informations. A typical example could be a plugin to check the disk space: once invoked it interfaces with (usually) the operating system, execute a df command, works on the output, extracts the value related to the disk space, evaluates it against some thresholds and report back a status, like OK , WARNING , CRITICAL or UNKNOWN. Passive check In this case, Nagios does not trigger anything but waits to be contacted by some means by the service which must be monitored. It seems quite confusing but let’s make a real life example. How would you monitor if a disk backup has been completed successfully? One quick answer would be: knowing when the backup task starts and how long it lasts, we can define a time and invoke a script to check the task at that given hour. Nice, but when we plan something we must have a full understanding of how real life goes and a backup is not our little pet in the living room, it’s rather a beast which does what it wants. A backup can last a variable amount of time depending on unpredictable factor. For instance, your typical backup task would copy 1 TB of data in 2 hours, starting at 03:00, out of a 6 TB disk. So, the next backup task would start at 03:00+02:00=05:00 AM, give or take some minutes, and you setup an active check for it at 05:30 and it works well for a couple of months. Then, one early morning your receive a notification on your smartphone, the backup is in CRITICAL. You wake up, connect to the backup console and see that at 06:00 in the morning you are asleep and the backup task has not even been started by the console. Then you have to wait until 08:00 AM until some of your colleagues shows up at the office to find out that the day before the disk you backup has been filled with 2 extra TB of data due to an unscheduled data transfer. So, the backup task preceding the one you are monitoring lasted not for a couple of hours but 6 hours, and the task you are monitoring then started at 09:30 AM. Long story short, your active check has been fired up too early, that is why it failed. Maybe your are tempted to move your schedule some hours ahead, but simply do not do it, these time slots are not sliding frames. If you move your check ahead you should then move all the checks for the subsequent tasks ahead. You do it in one week, the project manager will ask someone to delete the 2 TB in excess (they are no more of any use for the project), and your schedules will be 2 hours ahead making your monitoring useless. So, as we insisted before, planning and analyzing the context are the key factors in making a good script and, in this case, a good plugin. We have a service that does not run 24/7 like a web service or a mail service, what is specific to the backup is that it is run periodically but we do not know exactly when. The best approach to this kind of monitoring is letting the service itself to notify us when it finished its task and what was its outcome. That is usually accomplished using the ability of most of the backup programs to send a Simple Network Monitoring Protocol (SNMP) trap to a destination to inform it of the outcome and for our case it would be the Nagios server which would have been configured to receive the trap and analyze. Add to this an event horizon so that if you do not receive that specific trap in, let’s say, 24 hours we raise an alarm anyway and you are covered: whenever the backup task gets completed, or when it times out, we receive a notification. Nagios notifications flowchart Return codes and thresholds Before coding a plugin we must face some concepts that will be the stepping stone of our Nagios coding in base, one of these being the return codes of the plugin itself. As we already discussed, once the plugin collects the data about how the service is going, it evaluates these data and determines if the situation falls under one of the following status: Return code Status Description 0 OK The plugin checked the service and the results are inside the acceptable range 1 WARNING The plugin checked the service and the results are above a warning threshold. We must keep an eye on the service 2 CRITICAL The plugin checked the service and the results are above a CRITICAL threshold or the service not responding. We must react now. 3 UNKNOWN Either we passed the wrong arguments to the plugin or there is some internal error in it. So, our plugin will check a service, evaluate the results, and based on a threshold will return to Nagios one of the values listed in the tables and a meaningful message like we can see in the description column in the following image: Notice the service check in red and the message in the figure above. In the image we can see that some checks are green, meaning ok, and they have an explicative message in the description section: what we see in this section is the output of the plugin written in the stdout and it is what we will craft as a response to Nagios. Pay attention at the ssh check, it is red, it is failing because it is checking the service at the default port which is 22 but on this server the ssh daemon is listening on a different port. This leads us to a consideration: our plugin will need a command line parser able to receive some configuration options and some threshold limits as well because we need to know what to check, where to check and what are the acceptable working limits for a service: Where: In Nagios there can be a host without service checks (except for the implicit host alive carried on by a ping), but no services without a host to be performed onto. So any plugin must receive on the command line the indication of the host to be run against, be it a dummy host but there must be one. How: This is where our coding comes in, we will have to write the lines of code that instruct the plugin how to connect to the server, query, collect and parse the answer. What: We must instruct the plugin, usually with some meaningful options on the command line, on what are the acceptable working limits so that it can evaluate them and decide if notify an OK, WARNING or CRITICAL message. That is all for our script, who to notify, when, how, for how many times and so forth. These are tasks carried on by the core, a Nagios plugin is unaware of all of this. What he really must know for an effective monitoring is what are the correct values that identify a working service. We can pass to our script two different kinds of value: Range:A series of numeric values with a starting and ending point, like from 3 to 7 or from one number to infinite. Threshold: It is a range with an associated alert level. So, when our plugins perform its check, it collects a numeric value that is within or outside a range, based on the threshold we impose then, based on the evaluation it replies to Nagios with a return code and a message. How do we specify some ranges on the command line? Essentially in the following way: [@] start_value:end_value If the range starts from 0, the part from : to the left can be omitted. The start_value must always be a lower number than end_value. If the range starts as start_value, it means from that number to infinity. Negative infinity can be specified using ~ Alert is generated when the collected value resides outside the range specified, comprised of the endpoints. If @ is specified, the alert is generated if the value resides inside the range. Let's see some practical example on how we would call our script imposing some thresholds: Plugin call Meaning ./my_plugin -c 10 CRITICAL if less than 0 or higher than 10 ./my_plugin -w 10:20 WARNING if less than 10 or higher than 20 /my_plugin -w ~:15 -c 16 WARNING if between -infinite and 15, critical from 16 and higher ./my_plugin -c 35: CRITICAL if the value collected is below 35 ./my_plugin -w @100:200 CRITICAL if the value is from 100 to 200, OK otherwise We covered the basic requirements for our plugin that in its simplest form should be called with the following syntax: ./my_plugin -h hostaddress|hostname -w value -c value We already talked about the need to relate a check to a host and we can do this either using a host name or a host address. It is up to us what to use but we will not fill in this piece of information because it will be drawn by the service configuration as a standard macro. We just introduced a new concept, service configuration, which is essential in making our script work in Nagios. Summary In this article, we learned that how the real world can turn out handy for your daily routine and during this process we also looked at the common pitfalls in coding and how we make our script reliable. Resources for Article: Further resources on this subject: From Code to the Real World [article] Getting Ready to Launch Your PhoneGap App in the Real World [article] Implementing a WCF Service in the Real World [article]
Read more
  • 0
  • 0
  • 2613
Modal Close icon
Modal Close icon