Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7008 Articles
article-image-create-your-first-augmented-reality-experience-using-programming-language-you-already
Andreas Zeitler
13 Feb 2017
7 min read
Save for later

Create Your First Augmented Reality Experience Using the Programming Language You Already Know

Andreas Zeitler
13 Feb 2017
7 min read
This post gives a short summary of the different ways to get AR up and running with the tools available in different programming languages. We will outline the advantages and drawbacks of specific solutions. Pick your language and follow along. We will introduce the tools that you will need. I know Objective-C This one is for iPhone and iPad users only and requires a Mac. The weapons of choice would be Vuforia, one of the industry leaders, or Kudan, a bit of an up-and-comer and sometimes a bit unprofessional when it comes to maintaining their website, but the tech is solid. Both frameworks allow you to code in Objective-C. In order to do so, download and install Xcode. You will also need an Apple Developer Program membership, which comes at 99 USD per year. This is necessary to install your own app on your own phone or tablet. You can ask your school or university, a lot of them are Apple partners by now and provide free access. Read this guide to get the basics of building and running an iPhone app on an iOS device out of the way. After this, head over to the Vuforia download area and download the Vuforia SDK and what they call the samples for Core Features for iOS. You need to sign up for a free Vuforia Developer account beforehand. Extract both ZIP files and open the folders side by side, and then copy the samples folder into the SDK folder, as shown in the image. Open VuforiaSamples.xcodeproj in Xcode. Press Command + B to build and run with your iOS device connected. The first try will fail because, by default, the project’s code signing is messed up. Click on the error message and follow the Fixing Instructions. After that, build and run again. It will succeed and launch the app on your device. However, if you try one of the sample screens in the app, it will tell you that the license key is missing. Head to the developer portal, generate a free Development license key , and add it to the code of your app. Open the file SampleApplicationSession.mm and look for this line: Vuforia::setInitParameters(mVuforiaInitFlags,””); And, replace with it the following: Vuforia::setInitParameters(mVuforiaInitFlags,”<YOUR_KEY_HERE>”); Save, build, and run. The trigger images are located in the media folder, which is part of the unzipped samples folder. Just pull one up on the screen and follow the instructions in the app. You get a nice little tea pot, which has been the default 3D example visualization for the past decades. I know Java Hold on to your Android phone… or tablet. This one is easy; you can use any PC running Windows or Linux or a Mac. You need Android Studio. Android can be very tricky when it comes to actually building and running your app because the IDE can run on so many different operating systems. Take a look at this excellent tutorial if you need to, and circle back here once you have your first Android app up and running. After that, head over to the Vuforia download area and download the Vuforia SDK and what they call the samples for Core Features for Android. You need to sign up for a free Vuforia Developer account beforehand. After that, the steps are identical to the one outlined in the previous section (I know Objective-C). I know Swift Ask again later. Sorry. As AR is basically an extension of computer vision, it is hugely reliant on running complex algorithms as often as possible every second (30 FPS minimum). Thus, most frameworks are built with earlier generation programming languages, which are closer to the metal and allow for more optimization. No AR framework provides examples or tutorials written in Swift today. I know JavaScript Wikitude is for you. Their approach is to offer a native SDK for iOS and Android. The SDK then provides an interface that allows you to create AR views with JavaScript only. Wikitude just released version 6 of their SDK, adding support for SLAM Instant Tracking. The features and software quality are good. The development version will display a watermark. Download the JavaScript SDK here. There’s also a native SDK that can be programmed with Objective-C or Java, very similar to Vuforia. I know C++ Use Android NDK in combination with Vuforia for Android. Alternatively, use Xcode and Objective-C, again combined with Vuforia because the sample code provided there is actually Objective-C++ code (Objective-C with C++ mixed in). I know C# (a.k.a. I know Unity 3D) Unity 3D is a game engine and authoring environment for 2D and 3D interactive experiences. If you are not familiar with any of the programming languages above, it will be much more efficient to dive into Unity 3D. It allows you to code in C# backed by the .Net framework (well, its open source cousin called mono). This provides a good API, and Unity 3D allows you to publish the same code to Android, iOS, and PC without much modification. Of course, handling a 3D engine and authoring environment has its very own challenges, but all in all, it will still be much more efficient to learn Unity 3D than to learn Objective-C/Xcode and Java/Android Studio at the same time. There’s a nice guide available here on how to use Unity 3D and Vuforia with a comprehensive set of samples. Please note that you will always need Xcode and an Apple Developer Program account, even if you use Unity 3D. I don’t know any of the programming languages You can sign up with Augment, Blippar, or WakingApp. All three allow you to create AR experiences with the help of a more or less user-friendly online editor. You can upload your own content and create some basic interactions. After that, you can run it in the company’s respective AR Viewer App available for mobile devices. The overall process for this is more often than not pretty bumpy and the results are severely limited. However, it’s a good way to get a taste and try AR first hand if you don’t know how to code. In case you are already working with professional design tools, such as Unity 3D, you can head over to vuframe.com and signup there. It offers the same as any of the three tools mentioned above (and more) and is targeted at professional users. You can publish professional AR with a single click. Summary At this point, if you have tried one of the coding approaches and have never created an app before, you will be exhausted. Creating AR experiences with today’s technology is exhausting because you spend at least as much time on preparing the underlying infrastructure (the mobile app) as you spend on designing and implementing your AR experience. This is a significant obstacle on the road to large-scale AR adoption—for now. About the Author Andreas is the founder and CEO at Vuframe. He’s been working with augmented and virtual reality on a daily basis for the last 8 years. Vuframe’s mission is to democratize AR and VR by removing the tech barrier for everyone.
Read more
  • 0
  • 0
  • 20407

article-image-review-sql-server-features-developers
Packt
13 Feb 2017
43 min read
Save for later

Review of SQL Server Features for Developers

Packt
13 Feb 2017
43 min read
In this article by Dejan Sarka, Miloš Radivojević, and William Durkin, the authors of the book, SQL Server 2016 Developer's Guide explains that before dwelling into the new features in SQL Server 2016, let's make a quick recapitulation of the SQL Server features for developers available already in the previous versions of SQL Server. Recapitulating the most important features with help you remember what you already have in your development toolbox and also understanding the need and the benefits of the new or improved features in SQL Server 2016. The recapitulation starts with the mighty T-SQL SELECT statement. Besides the basic clauses, advanced techniques like window functions, common table expressions, and APPLY operator are explained. Then you will pass quickly through creating and altering database objects, including tables and programmable objects, like triggers, views, user-defined functions, and stored procedures. You will also review the data modification language statements. Of course, errors might appear, so you have to know how to handle them. In addition, data integrity rules might require that two or more statements are executed as an atomic, indivisible block. You can achieve this with help of transactions. The last section of this article deals with parts of SQL Server Database Engine marketed with a common name "Beyond Relational". This is nothing beyond the Relational Model, the "beyond relational" is really just a marketing term. Nevertheless, you will review the following: How SQL Server supports spatial data How you can enhance the T-SQL language with Common Language Runtime (CLR) elements written is some .NET language like Visual C# How SQL Server supports XML data The code in this article uses the WideWorldImportersDW demo database. In order to test the code, this database must be present in your SQL Server instance you are using for testing, and you must also have SQL Server Management Studio (SSMS) as the client tool. This article will cover the following points: Core Transact-SQL SELECT statement elements Advanced SELECT techniques Error handling Using transactions Spatial data XML support in SQL Server (For more resources related to this topic, see here.) The Mighty Transact-SQL SELECT You probably already know that the most important SQL statement is the mighty SELECT statement you use to retrieve data from your databases. Every database developer knows the basic clauses and their usage: SELECT to define the columns returned, or a projection of all table columns FROM to list the tables used in the query and how they are associated, or joined WHERE to filter the data to return only the rows that satisfy the condition in the predicate GROUP BY to define the groups over which the data is aggregated HAVING to filter the data after the grouping with conditions that refer to aggregations ORDER BY to sort the rows returned to the client application Besides these basic clauses, SELECT offers a variety of advanced possibilities as well. These advanced techniques are unfortunately less exploited by developers, although they are really powerful and efficient. Therefore, I urge you to review them and potentially use them in your applications. The advanced query techniques presented here include: Queries inside queries, or shortly subqueries Window functions TOP and OFFSET...FETCH expressions APPLY operator Common tables expressions, or CTEs Core Transact-SQL SELECT Statement Elements Let us start with the most simple concept of SQL which every Tom, Dick, and Harry is aware of! The simplest query to retrieve the data you can write includes the SELECT and the FROM clauses. In the select clause, you can use the star character, literally SELECT *, to denote that you need all columns from a table in the result set. The following code switches to the WideWorldImportersDW database context and selects all data from the Dimension.Customer table. USE WideWorldImportersDW; SELECT * FROM Dimension.Customer; The code returns 403 rows, all customers with all columns. Using SELECT * is not recommended in production. Such queries can return an unexpected result when the table structure changes, and is also not suitable for good optimization. Better than using SELECT * is to explicitly list only the columns you need. This means you are returning only a projection on the table. The following example selects only four columns from the table. SELECT [Customer Key], [WWI Customer ID], [Customer], [Buying Group] FROM Dimension.Customer; Below is the shortened result, limited to the first three rows only. Customer Key WWI Customer ID Customer Buying Group ------------ --------------- ----------------------------- ------------- 0 0 Unknown N/A 1 1 Tailspin Toys (Head Office) Tailspin Toys 2 2 Tailspin Toys (Sylvanite, MT) Tailspin Toys You can see that the column names in the WideWorldImportersDW database include spaces. Names that include spaces are called delimited identifiers. In order to make SQL Server properly understand them as column names, you must enclose delimited identifiers in square parentheses. However, if you prefer to have names without spaces, or is you use computed expressions in the column list, you can add column aliases. The following query returns completely the same data as the previous one, just with columns renamed by aliases to avoid delimited names. SELECT [Customer Key] AS CustomerKey, [WWI Customer ID] AS CustomerId, [Customer], [Buying Group] AS BuyingGroup FROM Dimension.Customer; You might have noticed in the result set returned from the last two queries that there is also a row in the table for an unknown customer. You can filter this row with the WHERE clause. SELECT [Customer Key] AS CustomerKey, [WWI Customer ID] AS CustomerId, [Customer], [Buying Group] AS BuyingGroup FROM Dimension.Customer WHERE [Customer Key] <> 0; In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, like customers in the examples you have seen so far. In order to get result sets meaningful for the business your database supports, you most of the time need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. Rows returned are those for which the condition in the join predicate for the two tables joined evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL keyword. A predicate can thus evaluate to true, false or NULL. For an inner join, the order of the tables involved in the join is not important. In the following example, you can see the Fact.Sale table joined with an inner join to the Dimension.Customer table. SELECT c.[Customer Key] AS CustomerKey, c.[WWI Customer ID] AS CustomerId, c.[Customer], c.[Buying Group] AS BuyingGroup, f.Quantity, f.[Total Excluding Tax] AS Amount, f.Profit FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key]; In the query, you can see that table aliases are used. If a column's name is unique across all tables in the query, then you can use it without table name. If not, you need to use table name in front of the column, to avoid ambiguous column names, in the format table.column. In the previous query, the [Customer Key] column appears in both tables. Therefore, you need to precede this column name with the table name of its origin to avoid ambiguity. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM clause. Once you specify table aliases, you must always use the aliases; you can't refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, later somebody could add a column with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query. The previous query returns 228,265 rows. It is always recommendable to know at least approximately the number of rows your query should return. This number is the first control of the correctness of the result set, or said differently, whether the query is written logically correct. The query returns the unknown customer and the orders associated for this customer, of more precisely said associated to this placeholder for an unknown customer. Of course, you can use the WHERE clause to filter the rows in a query that joins multiple tables, like you use it for a single table query. The following query filters the unknown customer rows. SELECT c.[Customer Key] AS CustomerKey, c.[WWI Customer ID] AS CustomerId, c.[Customer], c.[Buying Group] AS BuyingGroup, f.Quantity, f.[Total Excluding Tax] AS Amount, f.Profit FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0; The query returns 143,968 rows. You can see that a lot of sales is associated with the unknown customer. Of course, the Fact.Sale table cannot be joined to the Dimension.Customer table. The following query joins it to the Dimension.Date table. Again, the join performed is an inner join. SELECT d.Date, f.[Total Excluding Tax], f.[Delivery Date Key] FROM Fact.Sale AS f INNER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date; The query returns 227,981 rows. The query that joined the Fact.Sale table to the Dimension.Customer table returned 228,265 rows. It looks like not all Fact.Sale table rows have a known delivery date, not all rows can match the Dimension.Date table rows. You can use an outer join to check this. With an outer join, you preserve the rows from one or both tables, even if they don't have a match in the other table. The result set returned includes all of the matched rows like you get from an inner join plus the preserved rows. Within an outer join, the order of the tables involved in the join might be important. If you use LEFT OUTER JOIN, then the rows from the left table are preserved. If you use RIGHT OUTER JOIN, then the rows from the right table are preserved. Of course, in both cases, the order of the tables involved in the join is important. With a FULL OUTER JOIN, you preserve the rows from both tables, and the order of the tables is not important. The following query preserves the rows from the Fact.Sale table, which is on the left side of the join to the Dimension.Date table. In addition, the query sorts the result set by the invoice date descending using the ORDER BY clause. SELECT d.Date, f.[Total Excluding Tax], f.[Delivery Date Key], f.[Invoice Date Key] FROM Fact.Sale AS f LEFT OUTER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date ORDER BY f.[Invoice Date Key] DESC; The query returns 228,265 rows. Here is the partial result of the query. Date Total Excluding Tax Delivery Date Key Invoice Date Key ---------- -------------------- ----------------- ---------------- NULL 180.00 NULL 2016-05-31 NULL 120.00 NULL 2016-05-31 NULL 160.00 NULL 2016-05-31 … … … … 2016-05-31 2565.00 2016-05-31 2016-05-30 2016-05-31 88.80 2016-05-31 2016-05-30 2016-05-31 50.00 2016-05-31 2016-05-30 For the last invoice date (2016-05-31), the delivery date is NULL. The NULL in the Date column form the Dimension.Date table is there because the data from this table is unknown for the rows with an unknown delivery date in the Fact.Sale table. Joining more than two tables is not tricky if all of the joins are inner joins. The order of joins is not important. However, you might want to execute an outer join after all of the inner joins. If you don't control the join order with the outer joins, it might happen that a subsequent inner join filters out the preserved rows if an outer join. You can control the join order with parenthesis. The following query joins the Fact.Sale table with an inner join to the Dimension.Customer, Dimension.City, Dimension.[Stock Item], and Dimension.Employee tables, and with an left outer join to the Dimension.Date table. SELECT cu.[Customer Key] AS CustomerKey, cu.Customer, ci.[City Key] AS CityKey, ci.City, ci.[State Province] AS StateProvince, ci.[Sales Territory] AS SalesTeritory, d.Date, d.[Calendar Month Label] AS CalendarMonth, d.[Calendar Year] AS CalendarYear, s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product, s.Color, e.[Employee Key] AS EmployeeKey, e.Employee, f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit FROM (Fact.Sale AS f INNER JOIN Dimension.Customer AS cu ON f.[Customer Key] = cu.[Customer Key] INNER JOIN Dimension.City AS ci ON f.[City Key] = ci.[City Key] INNER JOIN Dimension.[Stock Item] AS s ON f.[Stock Item Key] = s.[Stock Item Key] INNER JOIN Dimension.Employee AS e ON f.[Salesperson Key] = e.[Employee Key]) LEFT OUTER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date; The query returns 228,265 rows. Note that with the usage of the parenthesis the order of joins is defined in the following way: Perform all inner joins, with an arbitrary order among them Execute the left outer join after all of the inner joins So far, I have tacitly assumed that the Fact.Sale table has 228,265 rows, and that the previous query needed only one outer join of the Fact.Sale table with the Dimension.Date to return all of the rows. It would be good to check this number in advance. You can check the number of rows by aggregating them using the COUNT(*) aggregate function. The following query introduces that function. SELECT COUNT(*) AS SalesCount FROM Fact.Sale; Now you can be sure that the Fact.Sale table has exactly 228,265 rows. Many times you need to aggregate data in groups. This is the point where the GROUP BY clause becomes handy. The following query aggregates the sales data for each customer. SELECT c.Customer, SUM(f.Quantity) AS TotalQuantity, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer; The query returns 402 rows, one for each known customer. In the SELECT clause, you can have only the columns used for grouping, or aggregated columns. You need to get a scalar, a single aggregated value for each row for each column not included in the GROUP BY list. Sometimes you need to filter aggregated data. For example, you might need to find only frequent customers, defined as customers with more than 400 rows in the Fact.Sale table. You can filter the result set on the aggregated data by using the HAVING clause, like the following query shows. SELECT c.Customer, SUM(f.Quantity) AS TotalQuantity, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer HAVING COUNT(*) > 400; The query returns 45 rows for 45 most frequent known customers. Note that you can't use column aliases from the SELECT clause in any other clause introduced in the previous query. The SELECT clause logically executes after all other clause from the query, and the aliases are not known yet. However, the ORDER BY clause executes after the SELECT clause, and therefore the columns aliases are already known and you can refer to them. The following query shows all of the basic SELECT statement clauses used together to aggregate the sales data over the known customers, filters the data to include the frequent customers only, and sorts the result set descending by the number of rows of each customer in the Fact.Sale table. SELECT c.Customer, SUM(f.Quantity) AS TotalQuantity, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer HAVING COUNT(*) > 400 ORDER BY InvoiceLinesCountDESC; The query returns 45 rows. Below is the shortened result set. Customer TotalQuantity TotalAmount SalesCount ------------------------------------- ------------- ------------ ----------- Tailspin Toys (Vidrine, LA) 18899 340163.80 455 Tailspin Toys (North Crows Nest, IN) 17684 313999.50 443 Tailspin Toys (Tolna, ND) 16240 294759.10 443 Advanced SELECT Techniques Aggregating data over the complete input rowset or aggregating in groups produces aggregated rows only – either one row for the whole input rowset or one row per group. Sometimes you need to return aggregates together with the detail data. One way to achieve this is by using subqueries, queries inside queries. The following query shows an example of using two subqueries in a single query. In the SELECT clause, a subquery that calculates the sum of quantity for each customer. It returns a scalar value. The subquery refers to the customer key from the outer query. The subquery can't execute without the outer query. This is a correlated subquery. There is another subquery in the FROM clause that calculates overall quantity for all customers. This query returns a table, although it is a table with a single row and single column. This query is a self-contained subquery, independent of the outer query. A subquery in the FROM clause is also called a derived table. Another type of join is used to add the overall total to each detail row. A cross join is a Cartesian product of two input rowsets—each row from one side is associated with every single row from the other side. No join condition is needed. A cross join can produce an unwanted huge result set. For example, if you cross join just a 1,000 rows from the left side of the join with 1,000 rows from the right side, you get 1,000,000 rows in the output. Therefore, typically you want to avoid a cross join in production. However, in the example in the following query, 143,968 from the left side rows is cross joined to a single row from the subquery, therefore producing 143,968 only. Effectively, this means that the overall total column is added to each detail row. SELECT c.Customer, f.Quantity, (SELECT SUM(f1.Quantity) FROM Fact.Sale AS f1 WHERE f1.[Customer Key] = c.[Customer Key]) AS TotalCustomerQuantity, f2.TotalQuantity FROM (Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key]) CROSS JOIN (SELECT SUM(f2.Quantity) FROM Fact.Sale AS f2 WHERE f2.[Customer Key] <> 0) AS f2(TotalQuantity) WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.Quantity DESC; Here is an abbreviated output of the query. Customer Quantity TotalCustomerQuantity TotalQuantity ---------------------------- ----------- --------------------- ------------- Tailspin Toys (Absecon, NJ) 360 12415 5667611 Tailspin Toys (Absecon, NJ) 324 12415 5667611 Tailspin Toys (Absecon, NJ) 288 12415 5667611 In the previous example, the correlated subquery in the SELECT clause has to logically execute once per row of the outer query. The query was partially optimized by moving the self-contained subquery for the overall total in the FROM clause, where logically executes only once. Although SQL Server can many times optimize correlated subqueries and convert them to joins, there exist also a much better and more efficient way to achieve the same result as the previous query returned. You can do this by using the window functions. The following query is using the window aggregate function SUM to calculate the total over each customer and the overall total. The OVER clause defines the partitions, or the windows of the calculation. The first calculation is partitioned over each customer, meaning that the total quantity per customer is reset to zero for each new customer. The second calculation uses an OVER clause without specifying partitions, thus meaning the calculation is done over all input rowset. This query produces exactly the same result as the previous one/ SELECT c.Customer, f.Quantity, SUM(f.Quantity) OVER(PARTITION BY c.Customer) AS TotalCustomerQuantity, SUM(f.Quantity) OVER() AS TotalQuantity FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.Quantity DESC; You can use many other functions for window calculations. For example, you can use the ranking functions, like ROW_NUMBER(), to calculate some rank in the window or in the overall rowset. However, rank can be defined only over some order of the calculation. You can specify the order of the calculation in the ORDER BY sub-clause inside the OVER clause. Please note that this ORDER BY clause defines only the logical order of the calculation, and not the order of the rows returned. A stand-alone, outer ORDER BY at the end of the query defines the order of the result. The following query calculates a sequential number, the row number of each row in the output, for each detail row of the input rowset. The row number is calculated once in partitions for each customer and once ever the whole input rowset. Logical order of calculation is over quantity descending, meaning that row number 1 gets the largest quantity, either the largest for each customer or the largest in the whole input rowset. SELECT c.Customer, f.Quantity, ROW_NUMBER() OVER(PARTITION BY c.Customer ORDER BY f.Quantity DESC) AS CustomerOrderPosition, ROW_NUMBER() OVER(ORDER BY f.Quantity DESC) AS TotalOrderPosition FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.Quantity DESC; The query produces the following result, abbreviated to couple of rows only again. Customer Quantity CustomerOrderPosition TotalOrderPosition ----------------------------- ----------- --------------------- -------------------- Tailspin Toys (Absecon, NJ) 360 1 129 Tailspin Toys (Absecon, NJ) 324 2 162 Tailspin Toys (Absecon, NJ) 288 3 374 … … … … Tailspin Toys (Aceitunas, PR) 288 1 392 Tailspin Toys (Aceitunas, PR) 250 4 1331 Tailspin Toys (Aceitunas, PR) 250 3 1315 Tailspin Toys (Aceitunas, PR) 250 2 1313 Tailspin Toys (Aceitunas, PR) 240 5 1478 Note the position, or the row number, for the second customer. The order does not look to be completely correct – it is 1, 4, 3, 2, 5, and not 1, 2, 3, 4, 5, like you might expect. This is due to repeating value for the second largest quantity, for the quantity 250. The quantity is not unique, and thus the order is not deterministic. The order of the result is defined over the quantity, and not over the row number. You can't know in advance which row will get which row number when the order of the calculation is not defined on unique values. Please also note that you might get a different order when you execute the same query on your SQL Server instance. Window functions are useful for some advanced calculations, like running totals and moving averages as well. However, the calculation of these values can't be performed over the complete partition. You can additionally frame the calculation to a subset of rows of each partition only. The following query calculates the running total of the quantity per customer (the column alias Q_RT in the query) ordered by the sale key and framed differently for each row. The frame is defined from the first row in the partition to the current row. Therefore, the running total is calculated over one row for the first row, over two rows for the second row, and so on. Additionally, the query calculates the moving average of the quantity (the column alias Q_MA in the query) for the last three rows. SELECT c.Customer, f.[Sale Key] AS SaleKey, f.Quantity, SUM(f.Quantity) OVER(PARTITION BY c.Customer ORDER BY [Sale Key] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Q_RT, AVG(f.Quantity) OVER(PARTITION BY c.Customer ORDER BY [Sale Key] ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS Q_MA FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.[Sale Key]; The query returns the following (abbreviated) result. Customer SaleKey Quantity Q_RT Q_MA ---------------------------- -------- ----------- ----------- ----------- Tailspin Toys (Absecon, NJ) 2869 216 216 216 Tailspin Toys (Absecon, NJ) 2870 2 218 109 Tailspin Toys (Absecon, NJ) 2871 2 220 73 Let's find the top three orders by quantity for the Tailspin Toys (Aceitunas, PR) customer! You can do this by using the OFFSET…FETCH clause after the ORDER BY clause, like the following query shows. SELECT c.Customer, f.[Sale Key] AS SaleKey, f.Quantity FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.Customer = N'Tailspin Toys (Aceitunas, PR)' ORDER BY f.Quantity DESC OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY; This is the complete result of the query. Customer SaleKey Quantity ------------------------------ -------- ----------- Tailspin Toys (Aceitunas, PR) 36964 288 Tailspin Toys (Aceitunas, PR) 126253 250 Tailspin Toys (Aceitunas, PR) 79272 250 But wait… Didn't the second largest quantity, the value 250, repeat three times? Which two rows were selected in the output? Again, because the calculation is done over a non-unique column, the result is somehow nondeterministic. SQL Server offers another possibility, the TOP clause. You can specify TOP n WITH TIES, meaning you can get all of the rows with ties on the last value in the output. However, this way you don't know the number of the rows in the output in advance. The following query shows this approach. SELECT TOP 3 WITH TIES c.Customer, f.[Sale Key] AS SaleKey, f.Quantity FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.Customer = N'Tailspin Toys (Aceitunas, PR)' ORDER BY f.Quantity DESC; This is the complete result of the previous query – this time it is four rows. Customer SaleKey Quantity ------------------------------ -------- ----------- Tailspin Toys (Aceitunas, PR) 36964 288 Tailspin Toys (Aceitunas, PR) 223106 250 Tailspin Toys (Aceitunas, PR) 126253 250 Tailspin Toys (Aceitunas, PR) 79272 250 The next task is to get the top three orders by quantity for each customer. You need to perform the calculation for each customer. The APPLY Transact-SQL operator comes handy here. You use it in the FROM clause. You apply, or execute, a table expression defined on the right side of the operator once for each row of the input rowset from the left side of the operator. There are two flavors of this operator. The CROSS APPLY version filters out the rows from the left rowset if the tabular expression on the right side does not return any row. The OUTER APPLY version preserves the row from the left side, even is the tabular expression on the right side does not return any row, similarly as the LEFT OUTER JOIN does. Of course, columns for the preserved rows do not have known values from the right-side tabular expression. The following query uses the CROSS APPLY operator to calculate top three orders by quantity for each customer that actually does have some orders. SELECT c.Customer, t3.SaleKey, t3.Quantity FROM Dimension.Customer AS c CROSS APPLY (SELECT TOP(3) f.[Sale Key] AS SaleKey, f.Quantity FROM Fact.Sale AS f WHERE f.[Customer Key] = c.[Customer Key] ORDER BY f.Quantity DESC) AS t3 WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, t3.Quantity DESC; Below is the result of this query, shortened to first nine rows. Customer SaleKey Quantity ---------------------------------- -------- ----------- Tailspin Toys (Absecon, NJ) 5620 360 Tailspin Toys (Absecon, NJ) 114397 324 Tailspin Toys (Absecon, NJ) 82868 288 Tailspin Toys (Aceitunas, PR) 36964 288 Tailspin Toys (Aceitunas, PR) 126253 250 Tailspin Toys (Aceitunas, PR) 79272 250 Tailspin Toys (Airport Drive, MO) 43184 250 Tailspin Toys (Airport Drive, MO) 70842 240 Tailspin Toys (Airport Drive, MO) 630 225 For the final task in this section, assume that you need to calculate some statistics over totals of customers' orders. You need to calculate the average total amount for all customers, the standard deviation of this total amount, and the average count of total count of orders per customer. This means you need to calculate the totals over customers in advance, and then use aggregate functions AVG() and STDEV() on these aggregates. You could do aggregations over customers in advance in a derived table. However, there is another way to achieve this. You can define the derived table in advance, in the WITH clause of the SELECT statement. Such subquery is called a common table expression, or a CTE. CTEs are more readable than derived tables, and might be also more efficient. You could use the result of the same CTE multiple times in the outer query. If you use derived tables, then you need to define them multiple times if you want to use the multiple times in the outer query. The following query shows the usage of a CTE to calculate the average total amount for all customers, the standard deviation of this total amount, and the average count of total count of orders per customer. WITH CustomerSalesCTE AS ( SELECT c.Customer, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer ) SELECT ROUND(AVG(TotalAmount), 6) AS AvgAmountPerCustomer, ROUND(STDEV(TotalAmount), 6) AS StDevAmountPerCustomer, AVG(InvoiceLinesCount) AS AvgCountPerCustomer FROM CustomerSalesCTE; It returns the following result. AvgAmountPerCustomer StDevAmountPerCustomer AvgCountPerCustomer --------------------- ---------------------- ------------------- 270479.217661 38586.082621 358 Transactions and Error Handling In a real world application, errors always appear. Syntax or even logical errors can be in the code, the database design might be incorrect, there might even be a bug in the database management system you are using. Even is everything works correctly, you might get an error because the users insert wrong data. With Transact-SQL error handling you can catch such user errors and decide what to do upon them. Typically, you want to log the errors, inform the users about the errors, and sometimes even correct them in the error handling code. Error handling for user errors works on the statement level. If you send SQL Server a batch of two or more statements and the error is in the last statement, the previous statements execute successfully. This might not be what you desire. Many times you need to execute a batch of statements as a unit, and fail all of the statements if one of the statements fails. You can achieve this by using transactions. You will learn in this section about: Error handling Transaction management Error Handling You can see there is a need for error handling by producing an error. The following code tries to insert an order and a detail row for this order. EXEC dbo.InsertSimpleOrder @OrderId = 6, @OrderDate = '20160706', @Customer = N'CustE'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 6, @ProductId = 2, @Quantity = 0; In SQL Server Management Studio, you can see that an error happened. You should get a message that the error 547 occurred, that The INSERT statement conflicted with the CHECK constraint. If you remember, in order details only rows where the value for the quantity is not equal to zero are allowed. The error occurred in the second statement, in the call of the procedure that inserts an order detail. The procedure that inserted an order executed without an error. Therefore, an order with id equal to six must be in the dbo. SimpleOrders table. The following code tries to insert order six again. EXEC dbo.InsertSimpleOrder @OrderId = 6, @OrderDate = '20160706', @Customer = N'CustE'; Of course, another error occurred. This time it should be error 2627, a violation of the PRIMARY KEY constraint. The values of the OrderId column must be unique. Let's check the state of the data after these successful and unsuccessful inserts. SELECT o.OrderId, o.OrderDate, o.Customer, od.ProductId, od.Quantity FROM dbo.SimpleOrderDetails AS od RIGHT OUTER JOIN dbo.SimpleOrders AS o ON od.OrderId = o.OrderId WHERE o.OrderId > 5 ORDER BY o.OrderId, od.ProductId; The previous query checks only orders and their associated details where the order id value is greater than five. The query returns the following result set. OrderId OrderDate Customer ProductId Quantity ----------- ---------- -------- ----------- ----------- 6 2016-07-06 CustE NULL NULL You can see that only the first insert of the order with the id 6 succeeded. The second insert of an order with the same id and the insert of the detail row for the order six did not succeed. You start handling errors by enclosing the statements in the batch you are executing in the BEGIN TRY … END TRY block. You can catch the errors in the BEGIN CATCH … END CATCH block. The BEGIN CATCH statement must be immediately after the END TRY statement. The control of the execution is passed from the try part to the catch part immediately after the first error occurs. In the catch part, you can decide how to handle the errors. If you want to log the data about the error or inform an end user about the details of the error, the following functions might be very handy: ERROR_NUMBER() – this function returns the number of the error. ERROR_SEVERITY() - it returns the severity level. The severity of the error indicates the type of problem encountered. Severity levels 11 to 16 can be corrected by the user. ERROR_STATE() – this function returns the error state number. Error state gives more details about a specific error. You might want to use this number together with the error number to search Microsoft knowledge base for the specific details of the error you encountered. ERROR_PROCEDURE() – it returns the name of the stored procedure or trigger where the error occurred, or NULL if the error did not occur within a stored procedure or trigger. ERROR_LINE() – it returns the line number at which the error occurred. This might be the line number in a routine if the error occurred within a stored procedure or trigger, or the line number in the batch. ERROR_MESSAGE() – this function returns the text of the error message. The following code uses the try…catch block to handle possible errors in the batch of the statements, and returns the information of the error using the above mentioned functions. Note that the error happens in the first statement of the batch. BEGIN TRY EXEC dbo.InsertSimpleOrder @OrderId = 6, @OrderDate = '20160706', @Customer = N'CustF'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 6, @ProductId = 2, @Quantity = 5; END TRY BEGIN CATCH SELECT ERROR_NUMBER() AS ErrorNumber, ERROR_MESSAGE() AS ErrorMessage, ERROR_LINE() as ErrorLine; END CATCH There was a violation of the PRIMARY KEY constraint again, because the code tried to insert an order with id six again. The second statement would succeed if you would execute in its own batch, without error handling. However, because of the error handling, the control was passed to the catch block immediately after the error in the first statement, and the second statement never executed. You can check the data with the following query. SELECT o.OrderId, o.OrderDate, o.Customer, od.ProductId, od.Quantity FROM dbo.SimpleOrderDetails AS od RIGHT OUTER JOIN dbo.SimpleOrders AS o ON od.OrderId = o.OrderId WHERE o.OrderId > 5 ORDER BY o.OrderId, od.ProductId; The result set should be the same as the results set of the last check of the orders with id greater than five – a single order without details. The following code produces an error in the second statement. BEGIN TRY EXEC dbo.InsertSimpleOrder @OrderId = 7, @OrderDate = '20160706', @Customer = N'CustF'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 7, @ProductId = 2, @Quantity = 0; END TRY BEGIN CATCH SELECT ERROR_NUMBER() AS ErrorNumber, ERROR_MESSAGE() AS ErrorMessage, ERROR_LINE() as ErrorLine; END CATCH You can see that the insert of the order detail violates the CHECK constraint for the quantity. If you check the data with the same query as last two times again, you would see that there are orders with id six and seven in the data, both without order details. Using Transactions Your business logic might request that the insert of the first statement fails when the second statement fails. You might need to repeal the changes of the first statement on the failure of the second statement. You can define that a batch of statements executes as a unit by using transactions. The following code shows how to use transactions. Again, the second statement in the batch in the try block is the one that produces an error. BEGIN TRY BEGIN TRANSACTION EXEC dbo.InsertSimpleOrder @OrderId = 8, @OrderDate = '20160706', @Customer = N'CustG'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 8, @ProductId = 2, @Quantity = 0; COMMIT TRANSACTION END TRY BEGIN CATCH SELECT ERROR_NUMBER() AS ErrorNumber, ERROR_MESSAGE() AS ErrorMessage, ERROR_LINE() as ErrorLine; IF XACT_STATE() <> 0 ROLLBACK TRANSACTION; END CATCH You can check the data again. SELECT o.OrderId, o.OrderDate, o.Customer, od.ProductId, od.Quantity FROM dbo.SimpleOrderDetails AS od RIGHT OUTER JOIN dbo.SimpleOrders AS o ON od.OrderId = o.OrderId WHERE o.OrderId > 5 ORDER BY o.OrderId, od.ProductId; Here is the result of the check: OrderId OrderDate Customer ProductId Quantity ----------- ---------- -------- ----------- ----------- 6 2016-07-06 CustE NULL NULL 7 2016-07-06 CustF NULL NULL You can see that the order with id 8 does not exist in your data. Because of the insert of the detail row for this order failed, the insert of the order was rolled back as well. Note that in the catch block, the XACT_STATE() function was used to check whether the transaction still exists. If the transaction was rolled back automatically by SQL Server, then the ROLLBACK TRANSACTION would produce a new error. The following code drops the objects (in correct order, due to object contraints) created for the explanation of the DDL and DML statements, programmatic objects, error handling, and transactions. DROP FUNCTION dbo.Top2OrderDetails; DROP VIEW dbo.OrdersWithoutDetails; DROP PROCEDURE dbo.InsertSimpleOrderDetail; DROP PROCEDURE dbo.InsertSimpleOrder; DROP TABLE dbo.SimpleOrderDetails; DROP TABLE dbo.SimpleOrders; Beyond Relational The "beyond relational" is actually only a marketing term. The relational model, used in the relational database management system, is nowhere limited to specific data types, or specific languages only. However, with the term beyond relational, we typically mean specialized and complex data types that might include spatial and temporal data, XML or JSON data, and extending the capabilities of the Transact-SQL language with CLR languages like Visual C#, or statistical languages like R. SQL Server in versions before 2016 already supports some of the features mentioned. Here is a quick review of this support that includes: Spatial data CLR support XML data Defining Locations and Shapes with Spatial Data In modern applications, many times you want to show your data on a map, using the physical location. You might also want to show the shape of the objects that your data describes. You can use spatial data for tasks like these. You can represent the objects with points, lines, or polygons. From the simple shapes you can create complex geometrical objects or geographical objects, for example cities and roads. Spatial data appear in many contemporary database. Acquiring spatial data has become quite simple with the Global Positioning System (GPS) and other technologies. In addition, many software packages and database management systems help you working with spatial data. SQL Server supports two spatial data types, both implemented as .NET common language runtime (CLR) data types, from version 2008: The geometry type represents data in a Euclidean (flat) coordinate system. The geography type represents data in a round-earth coordinate system. We need two different spatial data types because of some important differences between them. These differences include units of measurement and orientation. In the planar, or flat-earth, system, you define the units of measurements. The length of a distance and the surface of an area are given in the same unit of measurement as you use for the coordinates of your coordinate system. You as the database developer know what the coordinates mean and what the unit of measure is. In geometry, the distance between the points described with the coordinates (1, 3) and (4, 7) is 5 units, regardless of the units used. You, as the database developer who created the database where you are storing this data, know the context. You know what these 5 units mean, is this 5 kilometers, or 5 inches. When talking about locations on earth, coordinates are given in degrees of latitude and longitude. This is the round-earth, or ellipsoidal system Lengths and areas are usually measured in the metric system, in meters and square meters. However, not everywhere in the world the metric system is used for the spatial data. The spatial reference identifier (SRID) of the geography instance defines the unit of measure. Therefore, whenever measuring some distance or area in the ellipsoidal system, you should always quote also the SRID used, which defines the units. In the planar system, the ring orientation of a polygon is not an important factor. For example, a polygon described by the points ((0, 0), (10, 0), (0, 5), (0, 0)) is the same as a polygon described by ((0, 0), (5, 0), (0, 10), (0, 0)). You can always rotate the coordinates appropriately to get the same feeling of the orientation. However, in geography, the orientation is needed to completely describe a polygon. Just think of the equator, which divides the earth in the two hemispheres. Is your spatial data describing the northern or southern hemisphere? The Wide World Importers data warehouse includes the city location in the Dimension.City table. The following query retrieves it for cities in the main part of the USA> SELECT City, [Sales Territory] AS SalesTerritory, Location AS LocationBinary, Location.ToString() AS LocationLongLat FROM Dimension.City WHERE [City Key] <> 0 AND [Sales Territory] NOT IN (N'External', N'Far West'); Here is the partial result of the query. City SalesTerritory LocationBinary LocationLongLat ------------ --------------- -------------------- ------------------------------- Carrollton Mideast 0xE6100000010C70... POINT (-78.651695 42.1083969) Carrollton Southeast 0xE6100000010C88... POINT (-76.5605078 36.9468152) Carrollton Great Lakes 0xE6100000010CDB... POINT (-90.4070632 39.3022693) You can see that the location is actually stored as a binary string. When you use the ToString() method of the location, you get the default string representation of the geographical point, which is the degrees of longitude and latitude. If SSMS, you send the results of the previous query to a grid, you get in the results pane also an additional representation for the spatial data. Click the Spatial results tab, and you can see the points represented in the longitude – latitude coordinate system, like you can see in the following figure. Figure 2-1: Spatial results showing customers' locations If you executed the query, you might have noticed that the spatial data representation control in SSMS has some limitations. It can show only 5,000 objects. The result displays only first 5,000 locations. Nevertheless, as you can see from the previous figure, this is enough to realize that these points form a contour of the main part of the USA. Therefore, the points represent the customers' locations for customers from USA. The following query gives you the details, like location and population, for Denver, Colorado. SELECT [City Key] AS CityKey, City, [State Province] AS State, [Latest Recorded Population] AS Population, Location.ToString() AS LocationLongLat FROM Dimension.City WHERE [City Key] = 114129 AND [Valid To] = '9999-12-31 23:59:59.9999999'; Spatial data types have many useful methods. For example, the STDistance() method returns the shortest line between two geography types. This is a close approximate to the geodesic distance, defined as the shortest route between two points on the Earth's surface. The following code calculates this distance between Denver, Colorado, and Seattle, Washington. DECLARE @g AS GEOGRAPHY; DECLARE @h AS GEOGRAPHY; DECLARE @unit AS NVARCHAR(50); SET @g = (SELECT Location FROM Dimension.City WHERE [City Key] = 114129); SET @h = (SELECT Location FROM Dimension.City WHERE [City Key] = 108657); SET @unit = (SELECT unit_of_measure FROM sys.spatial_reference_systems WHERE spatial_reference_id = @g.STSrid); SELECT FORMAT(@g.STDistance(@h), 'N', 'en-us') AS Distance, @unit AS Unit; The result of the previous batch is below. Distance Unit ------------- ------ 1,643,936.69 metre Note that the code uses the sys.spatial_reference_system catalog view to get the unit of measure for the distance of the SRID used to store the geographical instances of data. The unit is meter. You can see that the distance between Denver, Colorado, and Seattle, Washington, is more than 1,600 kilometers. The following query finds the major cities within a circle of 1,000 km around Denver, Colorado. Major cities are defined as the cities with population larger than 200,000. DECLARE @g AS GEOGRAPHY; SET @g = (SELECT Location FROM Dimension.City WHERE [City Key] = 114129); SELECT DISTINCT City, [State Province] AS State, FORMAT([Latest Recorded Population], '000,000') AS Population, FORMAT(@g.STDistance(Location), '000,000.00') AS Distance FROM Dimension.City WHERE Location.STIntersects(@g.STBuffer(1000000)) = 1 AND [Latest Recorded Population] > 200000 AND [City Key] <> 114129 AND [Valid To] = '9999-12-31 23:59:59.9999999' ORDER BY Distance; Here is the result abbreviated to the twelve closest cities to Denver, Colorado. City State Population Distance ----------------- ----------- ----------- ----------- Aurora Colorado 325,078 013,141.64 Colorado Springs Colorado 416,427 101,487.28 Albuquerque New Mexico 545,852 537,221.38 Wichita Kansas 382,368 702,553.01 Lincoln Nebraska 258,379 716,934.90 Lubbock Texas 229,573 738,625.38 Omaha Nebraska 408,958 784,842.10 Oklahoma City Oklahoma 579,999 809,747.65 Tulsa Oklahoma 391,906 882,203.51 El Paso Texas 649,121 895,789.96 Kansas City Missouri 459,787 898,397.45 Scottsdale Arizona 217,385 926,980.71 There are many more useful methods and properties implemented in the two spatial data types. In addition, you can improve the performance of spatial queries with help of specialized spatial indexes. Please refer to the MSDN article "Spatial Data (SQL Server)" at https://msdn.microsoft.com/en-us/library/bb933790.aspx for more details on the spatial data types, their methods, and spatial indexes. XML Support in SQL Server SQL Server in version 2005 also started to feature extended support for XML data inside the database engine, although some basic support was already included in version 2000. The support starts by generating XML data from tabular results. You can use the FOR XML clause of the SELECT statement for this task. The following query generates an XML document from the regular tabular result set by using the FOR XML clause with AUTO option, to generate element-centric XML instance, with namespace and inline schema included. SELECT c.[Customer Key] AS CustomerKey, c.[WWI Customer ID] AS CustomerId, c.[Customer], c.[Buying Group] AS BuyingGroup, f.Quantity, f.[Total Excluding Tax] AS Amount, f.Profit FROM Dimension.Customer AS c INNER JOIN Fact.Sale AS f ON c.[Customer Key] = f.[Customer Key] WHERE c.[Customer Key] IN (127, 128) FOR XML AUTO, ELEMENTS, ROOT('CustomersOrders'), XMLSCHEMA('CustomersOrdersSchema'); GO Here is the partial result of this query. First part of the result is the inline schema/ <CustomersOrders> <xsd:schema targetNamespace="CustomersOrdersSchema" … <xsd:import namespace="http://schemas.microsoft.com/sqlserver/2004/sqltypes" … <xsd:element name="c"> <xsd:complexType> <xsd:sequence> <xsd:element name="CustomerKey" type="sqltypes:int" /> <xsd:element name="CustomerId" type="sqltypes:int" /> <xsd:element name="Customer"> <xsd:simpleType> <xsd:restriction base="sqltypes:nvarchar" … <xsd:maxLength value="100" /> </xsd:restriction> </xsd:simpleType> </xsd:element> … </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema> <c > <CustomerKey>127</CustomerKey> <CustomerId>127</CustomerId> <Customer>Tailspin Toys (Point Roberts, WA)</Customer> <BuyingGroup>Tailspin Toys</BuyingGroup> <f> <Quantity>3</Quantity> <Amount>48.00</Amount> <Profit>31.50</Profit> </f> <f> <Quantity>9</Quantity> <Amount>2160.00</Amount> <Profit>1363.50</Profit> </f> </c> <c > <CustomerKey>128</CustomerKey> <CustomerId>128</CustomerId> <Customer>Tailspin Toys (East Portal, CO)</Customer> <BuyingGroup>Tailspin Toys</BuyingGroup> <f> <Quantity>84</Quantity> <Amount>420.00</Amount> <Profit>294.00</Profit> </f> </c> … </CustomersOrders> You can also do the opposite process: convert XML to tables. Converting XML to relational tables is known as shredding XML. You can do this by using the nodes() method of the XML data type or with the OPENXML() rowset function. Inside SQL Server, you can also query the XML data from Transact-SQL to find specific elements, attributes, or XML fragments. XQuery is a standard language for browsing XML instances and returning XML, and is supported inside XML data type methods. You can store XML instances inside SQL Server database in a column of the XML data type. An XML data type includes five methods that accept XQuery as a parameter. The methods support querying (the query() method), retrieving atomic values (the value() method), existence checks (the exist() method), modifying sections within the XML data (the modify() method) as opposed to overriding the whole thing, and shredding XML data into multiple rows in a result set (the nodes() method). The following code creates a variable of the XML data type to store an XML instance in it. Then it uses the query() method to return XML fragments from the XML instance. This method accepts XQuery query as a parameter. The XQuery query uses the FLWOR expressions to define and shape the XML returned. DECLARE @x AS XML; SET @x = N' <CustomersOrders> <Customer custid="1"> <!-- Comment 111 --> <companyname>CustA</companyname> <Order orderid="1"> <orderdate>2016-07-01T00:00:00</orderdate> </Order> <Order orderid="9"> <orderdate>2016-07-03T00:00:00</orderdate> </Order> <Order orderid="12"> <orderdate>2016-07-12T00:00:00</orderdate> </Order> </Customer> <Customer custid="2"> <!-- Comment 222 --> <companyname>CustB</companyname> <Order orderid="3"> <orderdate>2016-07-01T00:00:00</orderdate> </Order> <Order orderid="10"> <orderdate>2016-07-05T00:00:00</orderdate> </Order> </Customer> </CustomersOrders>'; SELECT @x.query('for $i in CustomersOrders/Customer/Order let $j := $i/orderdate where $i/@orderid < 10900 order by ($j)[1] return <Order-orderid-element> <orderid>{data($i/@orderid)}</orderid> {$j} </Order-orderid-element>') AS [Filtered, sorted and reformatted orders with let clause]; Here is the result of the previous query. <Order-orderid-element> <orderid>1</orderid> <orderdate>2016-07-01T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>3</orderid> <orderdate>2016-07-01T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>9</orderid> <orderdate>2016-07-03T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>10</orderid> <orderdate>2016-07-05T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>12</orderid> <orderdate>2016-07-12T00:00:00</orderdate> </Order-orderid-element> Summary In this article, you got a review of the SQL Server features for developers that exists already in the previous versions. You can see that this support goes well beyond basic SQL statements, and also beyond pure Transact-SQL. Resources for Article: Further resources on this subject: Configuring a MySQL linked server on SQL Server 2008 [article] Basic Website using Node.js and MySQL database [article] Exception Handling in MySQL for Python [article]
Read more
  • 0
  • 0
  • 1780

article-image-bitcoin
Packt
10 Feb 2017
13 min read
Save for later

Bitcoin

Packt
10 Feb 2017
13 min read
In this article by Imran Bashir, the author of the book Mastering Blockchain, will see about bitcoin and it's importance in electronic cash system. (For more resources related to this topic, see here.) Bitcoin is the first application of blockchain technology. In this article readers will be introduced to the bitcoin technology in detail. Bitcoin has started a revolution with the introduction of very first fully decentralized digital currency that has been proven to be extremely secure and stable. This has also sparked great interest in academic and industrial research and introduced many new research areas. Since its introduction in 2008 bitcoin has gained much popularity and currently is the most successful digital currency in the world with billions of dollars invested in it. It is built on decades of research in the field of cryptography, digital cash and distributed computing. In the following section brief history is presented in order to provide background required to understand the foundations behind the invention of bitcoin. Digital currencies have always been an active area of research for many decades. Early proposals to create digital cash goes as far back as the early 1980s. In 1982 David Chaum proposed a scheme that used blind signatures to build untraceable digital currency. In this scheme a bank would issue digital money by signing a blinded and random serial number presented to it by the user. The user can then use the digital token signed by the bank as currency. The limitation in this scheme is that the bank has to keep track of all used serial numbers. This is a central system by design and requires to be trusted by the users. Later on in 1990 David Chaum proposed a refined version named ecash that not only used blinded signature but also some private identification data to craft a message that was then sent to the bank. This scheme allowed detection of double spending but did not prevent it. If the same token is used at two different location then the identity of the double spender would be revealed. ecash could only represent fixed amount of money. Adam Back's hashcash introduced in 1997 was originally proposed to thwart the email spam. The idea behind hashcash is to solve a computational puzzle that is easy to verify but is comparatively difficult to compute. The idea is that for a single user and single email extra computational effort it not noticeable but someone sending large number of spam emails would be discouraged as the time and resources required to run the spam campaign will increase substantially. B-money was proposed by Wei Dai in 1998 which introduced the idea of using proof of work to create money. Major weakness in the system was that some adversary with higher computational power could generate unsolicited money without giving the chance to the network to adjust to an appropriate difficulty level. The system was lacking details on the consensus mechanism between nodes and some security issues like Sybil attacks were also not addressed. At the same time Nick Szabo introduced the concept of bit gold which was also based on proof of work mechanism but had same problems as b-money had with one exception that network difficulty level was adjustable. Tomas Sander and Ammon TaShama introduced an ecash scheme in 1999 that for the first time used merkle trees to represent coins and zero knowledge proofs to prove possession of coins. In the scheme a central bank was required who kept record of all used serial numbers. This scheme allowed users to be fully anonymous albeit at some computational cost. RPOW (Reusable Proof of Work) was introduced by Hal Finney in 2004 that used hash cash scheme by Adam Back as a proof of computational resources spent to create the money. This was also a central system that kept a central database to keep track of all used PoW tokens. This was an online system that used remote attestation made possible by trusted computing platform (TPM hardware). All the above mentioned schemes are intelligently designed but were weak from one aspect or another. Especially all these schemes rely on a central server which is required to be trusted by the users. Bitcoin In 2008 bitcoin paper Bitcoin: A Peer-to-Peer Electronic Cash System was written by Satoshi Nakamoto. First key idea introduced in the paper is that it is a purely peer to peer electronic cash that does need an intermediary bank to transfer payments between peers. Bitcoin is built on decades of Cryptographic research like merkle trees, hash functions, public key cryptography and digital signatures. Moreover ideas like bit gold, b-money, hashcash and cryptographic time stamping have provided the foundations for bitcoin invention. All these technologies are cleverly combined in bitcoin to create world's first decentralized currency. Key issue that has been addressed in bitcoin is an elegant solution to Byzantine Generals problem along with a practical solution of double spend problem. Value of bitcoin has increased significantly since 2011 as shown in the graph below: Bitcoin price and volume since 2012 (on logarithmic scale) Regulation of bitcoin is a controversial subject and as much as it is a libertarian's dream law enforcement agencies and governments are proposing various regulations to control it such as bitlicense issued by NewYorks state department of financial services. This is a license issued to businesses which perform activities related to virtual currencies. Growth of bitcoin is also due to so called Network Effect. Also called demand-side economies of scale, it is a concept which basically means that more users who use the network the more valuable it becomes. Over time exponential increase has been seen in bitcoin network growth. Even though the price of bitcoin is quite volatile it has increased significantly over a period of last few years. Currently (at the time of writing) bitcoin price is 815 GBP. Bitcoin definition Bitcoin can be defined in various ways, it's a protocol, a digital currency and a platform. It is a combination of peer to peer network, protocols and software that facilitate the creation and usage of digital currency named bitcoin. Note that Bitcoin with capital B is used to refer to Bitcoin protocol whereas bitcoin with lower case b is used to refer to bitcoin, the currency. Nodes in this peer to peer to network talk to each other using the Bitcoin protocol. Decentralization of currency was made possible for the first time with the invention of Bitcoin. Moreover double spending problem was solved in an elegant and ingenious way in bitcoin. Double spending problem arises when for example a user sends coins to two different users at the same time and they will be verified independently as valid transactions. Keys and addresses Elliptic curve cryptography is used to generate public and private key pairs in the Bitcoin network. The Bitcoin address is created by taking the corresponding public key of a private key and hashing it twice, first with SHA256 algorithm and then with RIPEMD160. The resultant 160-bit hash is then prefixed with a version number and finally encoded with Base58Check encoding scheme. The bitcoin addresses are 26 to 35 characters long and begin with digit 1 or 3. A typical bitcoin address looks like a string shown as follows: 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt This is also commonly encoded in a QR code for easy sharing. The QR code of the preceding shown address is as follows: QR code of a bitcoin address 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt There are currently two types of addresses, commonly used P2PKH and another P2SH type starting with 1 and 3 respectively. In early days bitcoin used direct Pay-to-Pubkey which is now superseded by P2PKH. However direct Pay-to-Pubkey is still used in bitcoin for coinbase addresses. Addresses should not be used for more than once otherwise privacy and security issues can arise. Avoiding address reuse circumvents anonymity issues to some extent, bitcoin has some other security issues also, such as transaction malleability which requires different approach to resolve. from bitaddress.org private key and bitcoin address in a paper wallet Public keys in bitcoin In Public key cryptography, public keys are generated from private keys. Bitcoin uses ECC based on SECP256K1 standard. A private key is randomly selected and is 256-bit in length. Public keys can be presented in uncompressed or compressed format. Public keys are basically x and y coordinates on an elliptic curve and in uncompressed format are presented with a prefix of 04 in hexadecimal format. X and Y co-ordinates are both 32-bit in length. In total the compressed public key is 33 bytes long as compared to 65 bytes in uncompressed format. Compressed version of public keys basically include only X part, since Y part can be derived from it. The reason why compressed version of public keys works is that bitcoin client initially used uncompressed keys, but starting from bitcoin core client 0.6 compressed keys are used as standard. Keys are identified by various prefixes described as follows: Uncompressed public keys used 0x04 as prefix. Compressed public key starts with 0x03 if the y 32-bit part of the public key is odd. Compressed public key starts with 0x02 if the y 32-bit part of the public key is even. More mathematical description and reason why it works is described later. If the ECC graph is visualized it reveals that the y co-ordinate can be either below the x-axis or above the x-axis and as the curve is symmetric only the location in the prime field is required to be stored. Private keys in bitcoin Private keys are basically 256-bit numbers chosen in the range specified by SECP256K1 ECDSA recommendation. Any randomly chosen 256-bit number from 0x1 to 0xFFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFE BAAE DCE6 AF48 A03B BFD2 5E8C D036 4140 is a valid private key. Private keys are usually encoded using Wallet Import Format (WIF) in order to make them easier to copy and use. WIF can be converted to private key and vice versa. Steps are described later. Also Mini Private Key Format is used sometimes to encode the key in under 30 characters to allow storage where physical space is limited, for example, etching on physical coins or damage resistant QR codes. Bitcoin core client also allows encryption of the wallet which contains the private keys. Bitcoin currency units Bitcoin currency units are described as follows. Smallest bitcoin denomination is Satoshi. Base58Check encoding This encoding is used to limit the confusion between various characters such as 0OIl as they can look same in different fonts. The encoding basically takes the binary byte arrays and converts them into human readable string. This string is composed by utilizing a set of 58 alphanumeric symbols. More explanation and logic can be found in base58.h source file in bitcoin source code. Explanation from bitcoin source code Bitcoin addresses are encoded using Base58check encoding. Vanity addresses As the bitcoin addresses are based on base 58 encoding, it is possible to generate addresses that contain human readable messages. An example is shown as follows: Public address encoded in QR Vanity addresses are generated by using a purely brute force method. An example is shown as follows: Vanity address generated from https://bitcoinvanitygen.com/ Transactions Transactions are at the core of bitcoin ecosystem. Transactions can be as simple as just sending some bitcoins to a bitcoin address or it can be quite complex depending on the requirements. Each transaction is composed of at least one input and output. Inputs can be thought of as coins being spent that have been created in a previous transaction and outputs as coins being created. If a transaction is minting new coins then there is no input and therefore no signature is needed. If a transaction is to send coins to some other user (a bitcoin address), then it needs to be signed by the sender with their private key and also a reference is required to the previous transaction to show the origin of the coins. Coins are in fact unspent transactions outputs represented in Satoshis. Transactions are not encrypted and are publicly visible in the blockchain. Blocks are made up of transactions and these can be viewed by using any online blockchain explorer. Transaction life cycle A user/sender sends a transaction using wallet software or some other interface. Wallet software signs the transaction using the sender's private key. Transaction is broadcasted to the Bitcoin network using a flooding algorithm. Mining nodes include this transaction in the next block to be mined. Mining starts and once a miner who solves the Proof of Work problem broadcasts the newly mined block to the network. Proof of Work is explained in detail later. The nodes verify the block and propagate the block further and confirmation start to generate. Finally the confirmations start to appear in the receiver's wallet and after approximately 6 confirmations the transaction is considered finalized and confirmed. However 6 is just a recommended number , the transaction can be considered final even after first confirmation. The key idea behind waiting for six confirmations is that the probability of double spending virtually eliminates after 6 confirmations. Transaction structure A transaction at a high level contains metadata, inputs and outputs. Transactions are combined together to create a block. The transaction structure is shown in the following table: Field Size Description Version Number 4 bytes Used to specify rules to be used by the miners and nodes for transaction processing. Input counter 1 to 9 bytes Number of inputs included in the transaction. list of inputs variable Each input is composed of several fields including Previous Transaction hash, Previous Txout-index, Txin-script length, Txin-script and optional sequence number. The first transaction in a block is also called coinbase transaction. Specifies on or more transaction inputs. Out-counter 1 to 9 bytes positive integer representing the number of outputs. list of outputs variable Outputs included in the transaction. lock_time 4 bytes It defines the earliest time when a transaction becomes valid. It is either a Unix timestamp or block number. MetaData: This part of the transaction contains some values like size of transaction, number of inputs and outputs, hash of the transaction and a lock_time field. Every transaction has a prefix specifying the version number. Inputs: Generally each input spends a previous output. Each output is considered a UTXO, Unspent transaction output until an input consumes it. Outputs: Outputs have only two fields and it contains instructions for sending bitcoins. First field contains the amount of Satoshis where as second field is a locking script which contains the conditions that needs to be met in order for the output to be spent. More information about transaction spending by using locking and unlocking scripts and producing outputs is discussed later. Verification: Verification is performed by using Bitcoin's scripting language Summary In this article, we learned the importance of bitcoin in digital currency and how bitcoins are encoded using various private keys and encoding techniques. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [article] Protecting Your Bitcoins [article] FPGA Mining [article]
Read more
  • 0
  • 0
  • 26924

article-image-detecting-fake-reviews
Janu Verma
10 Feb 2017
6 min read
Save for later

Detecting Fake Reviews

Janu Verma
10 Feb 2017
6 min read
Product and service reviews play an important role in making purchase decisions. In current times, when we are faced with many choices, the opinion-based reviews help us narrow down the options and make decisions based on our needs. This is especially true online, where the reviews are easily accessible. Some companies where review-based decisions are very prominent are Amazon, TripAdviser, Yelp, and AirBnB, to name a few. From a business point of view, positive reviews can result in significant financial benefits. This also provides opportunities for deception, where fake reviews can be generated to garner positive opinion about a product, or to disrepute some business. To ensure credibility of the reviews posted on a platform, it is important to use a strong detecting model. In this post, we’ll talk about some methods for detecting fake reviews. The models discussed here fall into three categories: Textbased, Sentimentbased, and Userbased. Text-based Model This approach to classify fake and non-fake reviews is very similar to the ideas used in spam—classification. By creating the linguistic n-gram features and using a supervised learning algorithm such as Naive Bayes or SVM, one can construct the classification model. This approach, of course, relies on the assumption that the fake and non-fake reviews consist of words with significantly different frequencies. In case the spammers had a little knowledge of the product, or they didn’t have a genuine interest in writing the reviews (for example, the cheaply paid spammers), there are more chances of them creating reviews linguistically different from the non-fake ones. We don’t have any reason to believe that the spammer won’t be careful enough to create reviews linguistically similar to the genuine ones, or have strong inclinations to write fake opinions. In that case, the pure text-based models won’t be successful. We will need to incorporate more information. Other features from the review Length of the review: Even if a spammer tried to use words similar to real reviews, he probably didn’t spend much time in writing the review. Thus, length of the fake-review is smaller than the other reviews of the same product. Lack of domain knowledge also increases the chances of a shorter review. Also, it could have happened that the spammer tried to overdo his job and wrote a longer review. Deviation from the average rating: There is a high probability for the spamming review to deviate from the general consensus rating for the product or the service. Sentiment-based Model Because the fake reviews are created to enhance the positive opinion or tarnish the image, these reviews should have a strong positive or negative sentiment. Therefore, sentiment analysis of the reviews can be an important tool to separate the spam reviews. Though more sophisticated sentiment analysis methods can be employed, the static AFINN model should give high accuracy as it contains the sentiment scores for the terms, which project very high and low sentiment, and such words are going to be very prominent in the fake reviews. Some of these words include 'nice', 'great', 'awesome', 'amazing', 'bad', 'awful', 'helpful', 'shitty', and so on. In AFINN model, the authors have computed sentiment scores for a list of words. Compute the sentiment of a review based on the sentiment scores of the terms in the review. The sentiment of the review is defined to be equal to the sum of the sentiment scores for each term in the review. The AFINN is a list of English words rated for valence with an integer value between -5 and 5. The words have been manually labelled by Finn Arup Neilsen in 2009-2010. If the spammer made an attempt to sound convincing by using words that have high positive or negative sentiment, this model can be very successful. Even if the numeric rating of the spammer does not deviate much from the general consensus, the text reviews are going to be overwhelmingly positive. In such cases, sentiment scores of the reviews can shed light on the problem of detecting fake reviews, for example, you can compute the sentiment scores of all the 5-star reviews and see if some reviews have extremely high sentiment scores. User-based Model The user-based model asserts that a spamming user displays an abnormal behavior, and it is possible to classify users as spammers and non-spammers. The user information can be extracted from their public profiles. The relevant features include: Number of reviews: A spammer is likely to create a lot of reviews, and this can be used to identify fake reviewers. Most of the users create not more than 1 review per day. Average review length: As mentioned earlier, a spammer is not going to invest much time in creating his reviews (especially when you are being paid by number of the reviews you write) and is more likely to create shorter reviews. Number of positive votes: Most of the fake reviews tend to be extremely positive. A high percent of strong positive votes indicated abnormal behavior. Non-fake reviewers have varying rating levels. Deviation from other users: One can compute the deviation of the spammer from the rest of the users by averaging over all of his reviews the deviation from the other ratings for the same product. A standard learning algorithm,such as SVM or Random Forests, on these features can create a classification model for fake reviewers and non-fake reviewers. Other than these important features, there are some other features that can be extracted from the user’s profile, which can be used in detecting fake reviews. Geographical Information: A user who is reviewing location-based products (for example, businesses on Yelp) at two or more locations in a day is surely exhibiting suspicious behavior. The credit card companies use this kind of information to track down scams. Activity: On social sites (for example, Yelp, Foursquare, and more), the account activity can also be an indicator of abnormal behavior. Users with a friend base and who post share check-ins on Facebook and Twitter are mostly genuine. In fact, linking your other accounts is a positive indicator. Useful votes: Yelp also allows its users to vote on a review, and the number of people of ’useful’ votes for a review can also be used to classify spammers and non-spammers. Conclusions In this post, we discussed three different categories of models for detecting online fake reviews. Though the basic text-mining approach should detect spam reviews with reasonable accuracy, a smart spammer can make it harder for this model to classify. The sentiment-based model and user behavior can help achieve better accuracy in filtering false opinions. We propose that a combination of these models can be very effective in detecting fake reviews. References [1] Paul Graham "A plan for spam" [2] More information and a link to download the AFINN wordlist is available here [3] Finn Arup Nielsen Evaluation of a word list for sentiment analysis in microblogs Proceedings of the ESWC2011 Workshop on "Making Sense of Microposts: Big things come in small packages" 718 in CEUR Workshop Proceedings - 93-98. 2011 May
Read more
  • 0
  • 0
  • 5089

article-image-cloud-native-applications
Packt
09 Feb 2017
5 min read
Save for later

Cloud Native Applications

Packt
09 Feb 2017
5 min read
In this article by Ranga Rao Karanam, the author of the book Mastering Spring, we will see what are Cloud Native applications and Twelve Factor App. (For more resources related to this topic, see here.) Cloud Native applications Cloud is disrupting the world. A number of possibilities emerge that were never possible before. Organizations are able to provision computing, network and storage devices on demand. This has high potential to reduce costs in a number of industries. Consider the retail industry where there is high demand in pockets (Black Friday, Holiday Season and so on). Why should they pay for hardware round the year when they could provision it on demand? While we would like to be benefit from the possibilities of the cloud, these possibilities are limited by architecture and the nature of applications. How do we build applications that can be easily deployed on the cloud? That's where Cloud Native applications come into picture. Cloud Native applications are those that can easily be deployed on the cloud. These applications share a few common characteristics. We will begin with looking at Twelve Factor App - A combination of common patterns among Cloud Native applications. Twelve Factor App Twelve Factor App evolved from experiences of engineers at Heroku. This is a list of patterns that are typically used in Cloud Native application architectures. It is important to note, that an App here refers to a single deployable unit. Essentially every microservice is an App (because each microservice is independently deployable). One codebase Each App has one codebase in revision control. There can be multiple environments where the App can be deployed. However, all these environments use code from a single codebase. An example for anti-pattern is building a deployable from multiple codebases. Dependencies Explicitly declare and isolate dependencies. Typical Java applications use build management tools like Maven and Gradle to isolate and track dependencies. The following screenshot shows the typical Java applications managing dependencies using Maven: The following screenshot shows the content of the file: Config All applications have configuration that varies from one environment to another environment. Configuration is typically littered at multiple locations - Application code, property files, databases, environment variables, Java Naming and Directory Interface (JNDI) and system variables are a few examples. A Twelve Factor App should store config in the environment. While environment variables are recommended to manage configuration in a Twelve Factor App, other alternatives like having a centralized repository for application configuration should be considered for more complex systems. Irrespective of mechanism used, we recommended to manage configuration outside application code (independent of the application deployable unit). Use one standardized way of configuration Backing services Typically applications depend on other services being available - data-stores, external services among others. Twelve Factor App treats backing services as attached resources. A backing service is typically declared via an external configuration. Loose coupling to a backing service has many advantages including ability to gracefully handle an outage of a backing service. Build, release, run Strictly separate build and run stages. Build: Creates an executable bundle (ear, war or jar) from code and dependencies that can be deployed to multiple environments. Release: Combine the executable bundle with specific environment configuration to deploy in an environment. Run: Run the application in an execution environment using a specific release An anti-pattern is to build separate executable bundles specific for each environment. Stateless A Twelve Factor App does not have state. All data that it needs is stored in a persistent store. An anti-pattern is a sticky session. Port binding A Twelve Factor App exposes all services using port binding. While it is possible to have other mechanisms to expose services, these mechanisms are implementation dependent. Port binding gives full control of receiving and handling messages irrespective of where an application is deployed. Concurrency A Twelve Factor App is able to achieve more concurrency by scaling out horizontally. Scaling vertically has its limits. Scaling out horizontally provides opportunities to expand without limits. Disposability A Twelve Factor App should promote elastic scaling. Hence, they should be disposable. They can be started and stopped when needed. A Twelve Factor App should: Have minimum start up time. Long start up times means long delay before an application can take requests. Shutdown gracefully. Handle hardware failures gracefully. Environment parity All the environments - development, test, staging, and production - should be similar. They should use same processes and tools. With continuous deployment, they should have similar code very frequently. This makes finding and fixing problems easier. Logs as event streams Visibility is critical to a Twelve Factor App. Since applications are deployed on the cloud and are automatically scaled, it is important to have a centralized visibility into what's happening across different instances of the applications. Treating all logs as stream enables routing of the log stream to different destinations for viewing and archival. This stream can be used to debug issues, perform analytics and create alerting systems based on error patterns. No distinction of admin processes Twelve Factor Apps treat administrative tasks (migrations, scripts) similar to normal application processes. Summary This article thus explains about Cloud Native applications and what are Twelve Factor Apps. Resources for Article: Further resources on this subject: Cloud and Async Communication [article] Setting up of Software Infrastructure on the Cloud [article] Integrating Accumulo into Various Cloud Platforms [article]
Read more
  • 0
  • 0
  • 23211

article-image-deploying-and-synchronizing-azure-active-directory
Packt
08 Feb 2017
5 min read
Save for later

Deploying and Synchronizing Azure Active Directory

Packt
08 Feb 2017
5 min read
In this article by Jan-henrik Damaschke and Oliver Michalski, authors of the book Implementing Azure Solutions, we will learn about growing cloud services identity management and security as well as access policies within cloud environments and cloud services become more essential and important. The Microsoft central instance for this is Azure Active Directory. Every security policy or identity Microsoft provides for their cloud services is based on Azure Active Directory. Within this article you will learn the basics about Azure Active Directory, how you implement Azure AD and hybrid Azure Active Directory with connection to Active Directory Domain Services. We are going to explore the following topics: Azure Active Directory overview Azure Active Directory Subscription Options Azure Active Directory Deployment Azure Active Directory User and Subscription Management How to deploy Azure Active Directory Hybrid Identities with Active Directory Domain Service Azure Active Directory Hybrid high available and none high available Deployments (For more resources related to this topic, see here.) Azure Active Directory Azure Active Directory or Azure AD a multi-tenant cloud based directory and identity management service developed by Microsoft. Azure AD also includes a full suite of identity management capabilities including multi-factor authentication, device registration, self-service password management, self-service group management, privileged account management, role based access control, application usage monitoring, rich auditing and security monitoring and alerting. Azure AD can be integrated with an existing Windows Server Active Directory, giving organizations the ability to leverage their existing on-premises identities to manage access to cloud based SaaS applications. After this article you will know how to setup Azure AD and Azure Connect. You will also able to design a high available infrastructure for identity replication. The following figure describes the general structure of Azure AD in a hybrid deployment with Active Directory Domain Services: Source: https://azure.microsoft.com/en-us/documentation/articles/active-directory-whatis/ Customers who already using Office 365, CRM online or Intune using Azure AD for their service. You can easily identify if you use Azure AD if you have a username like user@domain.onmicrosoft.com (.de or .cn are also possible if you are using Microsoft Cloud Germany or Azure China). Azure AD is a multi-tenant, geo-distributed, high available service running in 28 and more datacenters around the world. Microsoft implemented automated failover and at least a minimum of two copies of you Active Directory service in other regional or global datacenters. Regularly your directory is running in your primary datacenter, is replicated into another two in your region. If you only have two Azure datacenters in your region like in Europe, the copy will distribute to another region outside yours: Azure Active Directory options There are currently three selectable options for Azure Active Directory with different features to use. There will be a fourth option, Azure Active Directory Premium P2 available later 2016. Azure AD free Supports common features such as: Directory objects with up to 500,000 objects User/group management (add/update/delete)/ user-based provisioning, device registration Single Sign-On for up to 10 applications per user Self-service password change for cloud users Connect and sync with on-premises Active Directory Domain Service Up to 3 basic security and usage reports Azure AD basic Supports common features such as: Directory objects with unlimited objects User/group management (add/update/delete)/ user-based provisioning, device registration Single Sign-On for up to 10 applications per user Self-service password change for cloud users Connect and sync with on-premises Active Directory Domain Service Up to 3 basic security and usage reports Basic features such as: Group-based access management/provisioning Self-service password reset for cloud users Company branding (logon pages/ access panel customization) Application proxy Service level agreement 99,9% Azure AD premium P1 Supports common features such as: Directory Objects with unlimited objects User/group management (add/update/delete)/ user-based provisioning, device registration Single Sign-On for up to 10 applications per user Self-service password change for cloud users Connect and sync with on-premises Active Directory Domain Service Up to 3 basic security and usage reports Basic features such as: Group-based access management/provisioning Self-service password reset for cloud users Company branding (logon pages/ access panel customization) Application proxy Service level agreement 99,9% Premium features such as: Self-service group and app management/self-service application additions/ dynamic groups Self-service password reset/change/unlock with on-premises write-back Multi-factor authentication (the Cloud and on-premises (MFA server)) MIM CAL with MIM server Cloud app discovery Connect health Automatic password rollover for group accounts Within Q3/2016 Microsoft will also enable customers to use the Azure Active Directory P2 plan, includes all the capabilities in Azure AD Premium P1 as well as the new identity protection and privileged identity management capabilities. That is a necessary step for Microsoft to extend it’s offering for Windows 10 Device Management with Azure AD. Currently Azure AD enable Windows 10 customers to join a device to Azure AD, implement SSO for desktops, Microsoft passport for Azure AD and central Administrator Bitlocker recovery. It also adds automatic classification to Active Directory Rights Management Service in Azure AD. Depending on what you plan to do with your Azure Environment, you should choose your Azure Active Directory Option.  
Read more
  • 0
  • 0
  • 9938
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-basic-routing-expressjs
Antonio Cucciniello
08 Feb 2017
6 min read
Save for later

Basic Routing with Express.js

Antonio Cucciniello
08 Feb 2017
6 min read
Want an easy way to have a few routes in your Node.js web application? If you answered yes, you willunderstand how to do exactly that by the end of this post. But first, what is routing? In simple terms, it is how an application is told to respond to a client on a specific endpoint. An endpoint is basically a path or URI and one of the HTTP request methods (GET, POST, and so on). Express is a Node.js framework that allows you to help organize your web applications on the server side. It is one of the most popular node packages and many of the other popular packages on NPM are built using Express. So, today we are going to learn how to use this in our web app! This tutorial assumes that you have Node installed; if you do not, visit this link. Install and Setup The first step is to setup your environment. Make sure that you create a new project directory for this sample app using the following: $ mkdir basic-routing-example I will be using basic-routing-example, but you can use whatever name you would like. Enter that directory using this: $ cd basic-routing-example Now that you are in that directory, create a file called server.jsfor our code. This is the entry point of our application. Now, you need to install the Express.js package with the following command: $ npm install express --save Then, if you are familiar with Node.js development, we need to create a package.json file using this command: $ npm init Now that you are all setup to use Express, it's time to start coding! Code Open the server.js file we created earlier and start editing it by adding this to the top of your file: const express = require('express') const app = express() var port = 3000 The first line allows us to use the Express.js package in our code. The second line creates a variable called app that is an instance of Express. This is how we will access the functionalities of Express. We then create a variable called port to store the port the server will listen on when it is started. Then below our variables, add this: app.get('/', function (req, res) { res.send('This is our home page!') }) app.listen(port, function () { console.log('We are listening on port ' + port) }) The first bit here is our first example of handling a route. It says that we would like to handle a HTTP GET request method on the route /, which in this case is the homepage of our web application. If there is a GET request to the route /, then it will handle it by calling the callback function specified. So upon receiving a GET request, it will send a response to the page with the text This is our home page!. In order to get this to show, we need to have our server listen on a port. We do that with the second bit of code: app.listen(). To test that you have handled this request properly, save the file, and in the command line enter,$ node server.js. The command line should outputWe are listening on port 3000. Then, go to your web browser and go tothe pagelocalhost:3000. Here is a sample image of what you should see on that page:   GET_reqImage Now, let's test this with a second web page for your web application. Suppose you had an About page as part of your site. The route you would want for this page would be /about. Let's see how we would handle a GET Request to that page: app.get('/about', function (req, res) { res.send('This is a basic app with routing in express') }) This is similar to how we handled the GET request for the home page route of /. Here, we change the route to /about to specify the About page of our application and we change how we would like to handle it by changing the message being sent in with res.send(). Now, let's test to see if this works by saving our file, running it with: $ node server.js, and then opening up a web browser and going to the URLlocalhost:3000/about. This image is what your webpage should look like:   GET_aboutImage Let's say you wanted to handle another request such as a POST request. We would add the following code to our application: app.post('/', function (req, res) { res.send('we got a POST req from the client') }) This handles a POST request on the route /. To make this fully functional, we would have to go out of the scope of this tutorial, but this is simply an example of how you would handle a different type of HTTP request for a route. Now, what if you had multiple types of HTTP request methods that need to be handled for one route? We could use app.route(). Remove the code for app.get('/') and app.post('/') handlers and add this following code snippet: app.route('/') .get(function (req, res) { res.send('This is our home page!') }) .post(function (req, res) { res.send('we got a POST req from the client') }) This allows us to handle multiple types of HTTP requests for one specific route. In this case, the route is /, and it has a handler for a GET and POST request. Conclusion You made it! You now have a simple web app that can handle different HTTP request methods on different routes. Here is a high-level overview of what we did here: We installed Express.js. Created an instance of Express called appin our code. Used that instance to handle multiple routes. Used that instance to handle different types of HTTP requests. Learned how to use app.route() to simplify handling of HTTP requests for one route. If you enjoyed this post, share it on twitter. Check out the code for this tutorial on GitHub. About the Author Antonio Cucciniello is a software engineer with a background in C, C++, and Javascript (Node.Js) from New Jersey. His most recent project called Edit Docs is an Amazon Echo skill that allows users to edit Google Drive files using theirvoice. He loves building cool things with software and reading books on self-help and improvement, finance, and entrepreneurship. You can find Antonio on Twitter @antocucciniello and on GitHub.
Read more
  • 0
  • 0
  • 3507

article-image-writing-applications-scale
Packt
08 Feb 2017
13 min read
Save for later

Writing Applications that Scale

Packt
08 Feb 2017
13 min read
In this article by Anand Balachandran Pillai, the author of the book Software Architecture with Python, learn to build complex software architecture for a software application using Python. (For more resources related to this topic, see here.) Imagine the checkout counter of a supermarket on a Saturday evening, the usual rush hour time. It is common to see long queues of people waiting to check out with their purchases. What would a store manager do to reduce the rush and waiting time? A typical manager would try a few approaches includingtelling those manning the checkout counters to pick up their speed and try and redistribute people to different queues so that each queue roughly has the same wait time. In other words, the manager would manage the current load with available resources by optimizing performance of existing resources. However, if the store has existing counters which are not in operation and enough people at hand to manage them, the manager could enable those counters and move people to these new counters,in other words, add resources to the store to scalethe operation. Software systems too scale in a similar way. An existing software application can be scaled by adding compute resources to it. When the system scales by either adding or making better use of resources inside a compute node, such as CPU or RAM, it is said to scale verticallyor scale up. On the contrary, when a system scales by adding more compute nodes to it, such as a creating a load balanced cluster, it is said to scale horizontallyor scale out. The degree to which a software system is able to scale when compute resources are added is called its scalability. Scalability is measured in terms of how much the systems performance characteristics, such as throughput or latency, improve with respect to the addition of resources. For example, if a system doubles its capacity by doubling the number of servers, it is scaling linearly. Increasing the concurrency of a system often increases its scalability. In the preceding supermarket example, the manager is able to scale out his operations by opening additional counters. In other words, he increases the amount of concurrent processing done in his store. Concurrency is the amount of work that gets done simultaneously in a system. We look at different techniques of scaling a software application with Python. We start with concurrency techniques within a machine, such as multithreading and multiprocessing, and go on to discuss asynchronous execution. We also look at how to scale outan application across multiple servers and also some theoretical aspects of scalability and its relation to availability. Scalability andperformance How do we measure the scalability of a system? Let's take an example and see how this could be done. Let's say our application is a simple report generation system for employees. It is able to load employee data from a database and generate a variety of reports at bulk, such as payslips, tax deduction reports, employee leave reports, and so on. The system is able to generate 120 reports per minute––this is the throughputor capacityof the system expressed as the number of successfully completed operations in a given unit of time. Let's say the time it takes to generate a report at the server side (latency) is roughly 2seconds. Let's say, the architect decides to scale up the system by doubling the RAM on its server ––scaling upthe system. Once this is done, a test shows that the system is able to increase its throughput to 180 reports per minute. The latency remains the same at 2 seconds. So at this point, the system has scaled close to linearin terms of the memory added. The scalability of the system expressed in terms of throughput increase is as follows: Scalability (throughput) = 180/120 = 1.5X As the second step, the architect decides to double the number of servers on the backend all with the same memory. After this step, he finds that the system's performance throughput has now increased to 350 reports per minute. The scalability achieved by this step is as follows: Scalability (throughput) = 350/180 = 1.9X The system has now responded much better, with a close to linear increase in scalability. After further analysis, the architect finds that by rewriting the code that was processing reports on the server to run in multiple processes instead of a single process, he is able to reduce the processing time at the server and hence the latency of each request by roughly 1 second per request at peak time. The latency has now gone down from 2seconds to 1 second. The system's performance with respect to latency has become better,as follows: Performance (latency) X = 2/1 = 2X How does this affect thescalability ? Since the latency per request has come down, the system overall would be able to respond to similar loads at a faster rate (since processing time per request is lesser now) than what it was able to earlier. In other words, with the exact same resources, the system's throughput performance, and hence scalability, would have increased, assuming other factors remain the same. Let's summarize what we discussed so far in the following lists: First, the architect increased the throughput of a single system by scaling it up by adding extra memory as a resource, which increased the overall scalability of the system. In other words, he scaled the performance of a single system by scaling up which boosted overall performance of the whole system. Next, he added more nodes to the system, and hence its ability to perform work concurrently, and found that the system responded well by rewarding him with a near linear scalability factor. Simply put, he increased the throughput of the system by scaling its resource capacity. In other words, he increased scalability of the system by scaling out by adding more compute nodes. Finally, he made a critical fix by running a computation in more than one process. In other words, he increased the concurrency of a single system by dividing the computation to more than one part. He found that this increased the performance characteristic of the application by reducing its latency, potentially setting up the application to handle workloads better at high stress. We find that there is a relation between scalability, performance, concurrency, and latency as follows: When performance of a single system goes up, the scalability of the total system goes up When an application scales in a single machine by increasing its concurrency, it has the potential to improve performance and hence the net scalability of the system in deployment When a system reduces its performance time at server or its latency,it positively contributes to scalability We have captured the relationships between concurrency, latency, performance and scalability in the following table: Concurrency Latency Performance Scalability High Low High High High High Variable Variable Low High Poor Poor An ideal system is one which has good concurrency and low latency––a system that has high performance and would respond better to scaling up and/or scaling out. A system with high concurrency but also high latency would have variable characteristics,its performance, and hence scalability, would be potentially very sensitive to other factors such as network load, current system load, geographical distribution of compute resources and requests,and so on. A system with low concurrency and high latency is the worst case––it would be difficult to scale such a system as it has poor performance characteristics. The latency and concurrency issues should be addressed before the architect decides to scale the system either horizontally or vertically. Scalability is always described in terms of variation in performance throughput. Concurrency A system's concurrency is the degree to which the system is able to perform work simultaneously instead of sequentially. An application written to be concurrent in general can execute more units of work in a given time than one which is written to be sequential or serial. When wemake a serial application concurrent, we make the application make better use of existing compute resources in the system––CPU and/or RAM at a given time. Concurrency in other words is the cheapest way of making an application scale inside a machinein terms of the cost of compute resources. Concurrency can be achieved using different techniques. The common techniquesare as follows: Multithreading: The simplest form of concurrency is to rewrite the application to perform parallel tasks in different threads. A thread is the simplest sequence of programming instructions that can be performance by a CPU. A program can consist of any number of threads. By distributing tasks to multiple threads, a program can execute more work simultaneously. All threads run inside the same process. Multiprocessing: The next step of concurrency is to scale the program to run in multiple processes instead of a single process. Multiprocessing involves more overhead than multithreading in terms of message passing and shared memory. However, programs that perform a lot of latent operations such as disk reads and those which perform lot of CPU heavy computation can benefit more from multiple processes than multiple threads. Asynchronous Processing: In this technique, operations are performed asynchronously,in other words, there is no ordering of concurrent tasks with respect to time. Asynchronous processing picks tasks usually from a queue of tasks and schedules them to execute at a future time, often receiving the results in callback functions or special future objects. Typically, operations are performed in a single thread. There are other forms of concurrent computing, but in this article, we will focus our attention to only these three, hence we are not introducing any other types of concurrent computing here. Python, especially Python3, has built-in support for all these types of concurrent computing techniques in its standard library. For example, it supports multithreading via its threading module and multiple processes via its multiprocessing module. Asynchronous execution support is available via the asynciomodule. A form of concurrent processing which combines asynchronous execution with threads and processes is available via the concurrent.futuresmodule. Concurrency versusparallelism We will take a brief look at the concept of concurrency and its close cousin, namely parallelism. Both concurrency and parallelism are about executing work simultaneously than sequentially. However, in concurrency, the two tasks need not be executing at the exact same time. Instead, they just need to be scheduled to be executed simultaneously. Parallelism, on the other hand, requires that both the tasks execute together at a given pointin time. To take a real-life example, let's say you are painting two exterior walls of your house. You have employed just one painter and you find that he is taking a lot more time than you thought. You can solve the problem in two ways. Instruct the painter to paint a few coats on one wall before switching to the next wall and doing the same there. Assuming he is efficient, he will work on both walls simultaneously (though not at the same time) and achieve same degree completion on both walls for a given time. This is a concurrent solution. Employ one more painter, and instruct first painter to paint the first wall and second painter to paint the second wall. This is a truly parallel solution. For example, two threads performing byte code computations in a single core CPU are not exactly performing parallel computation as the CPU can accommodate only one thread at a time. However, from a programmer's perspective, they are concurrent, since the CPU scheduler performs fast switching in and out of the threads, so theylook parallel in all appearances and purposes. But they are not truly parallel. However on a multi-core CPU, two threads can perform parallel computations at any given time in its different cores. This is true parallelism. Parallel computation requires computation resources to increase at least linearly with respect to its scale. Concurrent computation can be achieved using techniques of multi-tasking where work is scheduled and executed in batches, making better use of existing resources. In this article, we will use the term concurrent nearly uniformly to indicate both types of execution. In some places, it may indicate concurrent processing in the traditional way, and in some other, it may indicate true parallel processing. Kindly use the context to disambiguate. Concurrency in Python –multithreading We will start our discussion on concurrent techniques in Python with multithreading. Python supports multiple threads in programming via its threading module. The threading module exposes a Threadclass thatencapsulates a thread of execution. Along with it, it also exposes the following synchronization primitives: Lock object:This is useful for synchronized, protected access to share resources and its cousin RLock Condition object:This is useful for threads to synchronize while waiting for arbitrary conditions Event object: This provides a basic signaling mechanism between threads Semaphore object:This allows synchronized access to limited resources Barrier object:This allows a set of fixed number of threads to wait for each other and synchronize to a particular state and proceed The thread objects in Python can be combined with the synchronized Queueclass in the queue module for implementing thread-safe producer/consumer workflows. Thumbnail generator Let's start our discussion of multithreading in Python with the example of a program which is used to generate thumbnails of image URLs. We use the Python Imaging Library (PIL) for performing the following operation: # thumbnail_converter.py from PIL import Image import urllib.request def thumbnail_image(url, size=(64, 64), format='.png'): """ Save thumbnail of an image URL """ im = Image.open(urllib.request.urlopen(url)) # filename is last part of the URL minus extension + '.format' pieces = url.split('/') filename = ''.join((pieces[-2],'_',pieces[-1].split('.')[0],'_thumb',format)) im.thumbnail(size, Image.ANTIALIAS) im.save(filename) print('Saved',filename) This works very well for single URLs. Let's say we want to convert five image URLs to their thumbnails as shown in the following code snippet: img_urls = ['https://dummyimage.com/256x256/000/fff.jpg', 'https://dummyimage.com/320x240/fff/00.jpg', 'https://dummyimage.com/640x480/ccc/aaa.jpg', 'https://dummyimage.com/128x128/ddd/eee.jpg', 'https://dummyimage.com/720x720/111/222.jpg'] The code forusing the preceding function would be as follows: for url in img_urls: thumbnail_image(urls) Let's see how such a function performs with respect to the time taken: Let's now scale the program to multiple threads so that we can perform the conversions concurrently. Here is the rewritten code to run each conversion in its own thread (not showing the function itself as it hasn't changed): import threading for url in img_urls: t=threading.Thread(target=thumbnail_image,args=(url,)) t.start() Take a look at the response time of the threaded thumbnail convertor for five URLs as shown in the following screenshot: With this change, the program returns in 1.76 seconds, almost equal to the time taken by a single URL in the serial execution we saw earlier. In other words, the program has now linearly scaled with respect to the number of threads. Note that we had to make no change to the function itself to get this scalability boost. Summary In this article, you learned the importance of writing Scalable applications. We also saw the relationships between concurrency, latency, performance, and scalability and the techniques we can use to achieve concurrency. You also learned how to generate the thumbnail of image URLs using PIL. Resources for Article: Further resources on this subject: Putting the Fun in Functional Python [article] Basics of Jupyter Notebook and Python [article] Jupyter and Python Scripting [article]
Read more
  • 0
  • 0
  • 2908

article-image-what-is-functional-reactive-programming
Packt
08 Feb 2017
4 min read
Save for later

What is functional reactive programming?

Packt
08 Feb 2017
4 min read
Reactive programming is, quite simply, a programming paradigm where you are working with an asynchronous data flow. There are a lot of books and blog posts that argue about what reactive programming is, exactly, but if you delve too deeply too quickly it's easy to get confused. Then reactive programming isn't useful at all. Functional reactive programming takes the principles of functional programming and uses them to enhance reactive programming. You take functions - like map, filter, and reduce - and use them to better manage streams of data. Read now: What is the Reactive manifesto? How does imperative programming compare to reactive programming? Imperative programming makes you describe the steps a computer must do to execute a task. In comparison, functional reactive programming gives you the constructs to propagate changes. This means so you have to think more about what to do than how to do it. This can be illustrated in a simple sum of two numbers. This could be presented as a = b + c in an imperative programming. A single line of code expresses the sum - that's straightforward, right? However, if we change the value of b or c, the value doesn't change - you wouldn't want it to change if you were using an imperative approach. In reactive programming, by contrast, the changes you make to different figures would react accordingly. Imagine the sum in a Microsoft Excel spreadsheet. Every time you change the value of the column b or c it recalculate the value of a. This is like a very basic form of software propagation. You probably already use an asynchronous data flow. Every time you add a listener to a mouse click or a keystroke in a web page we pass a function to react to that user input. So, a mouse click might be seen as a stream of events which you can observe; you can then execute a function when it happens. But, this is only one way of using event streams. You might want more sophistication and control over your streams. Reactive programming takes this to the next level. When you use it you can react to changes in anything - that could be changes in: user inputs external sources database changes changes to variables and properties This then means you can create a stream of events following on from specific actions. For example, we can see the changing value of stocks stock as an EventStream. If you can do this you can then use it to show a user when to buy or sell those stocks in real time. Facebook and Twitter are another good example of where software reacts to changes in external source streams -reactive programming is an important component in developing really dynamic UI that are characteristic of social media sites. Functional reactive programming Functional reactive programming, then gives you the ability to do a lot with streams of data or events. You can filter, combine, map buffer, for example. Going back to the stock example above, you can 'listen' to different stocks and use a filter function to present ones worth buying to the user in real time: Why do I need functional reactive programming? Functional reactive programming is especially useful for: Graphical user interface Animation Robotics Simulation Computer Vision A few years ago, all a user could do in a web app was fill in a form with bits of data and post it to a server.  Today web and mobile apps are much richer for users. To go into more detail, by using reactive programming, you can abstract the source of data to the business logic of the application. What this means in practice is that you can write more concise and decoupled code. In turn, this makes code much more reusable and testable, as you can easily mock streams to test your business logic when testing application. Read more: Introduction to JavaScript Breaking into Microservices Architecture JSON with JSON.Net
Read more
  • 0
  • 0
  • 30567

article-image-measuring-geographic-distributions-arcgis-tool
Packt
08 Feb 2017
5 min read
Save for later

Measuring Geographic Distributions with ArcGIS Tool

Packt
08 Feb 2017
5 min read
In this article by Eric Pimpler, the author of the book Spatial Analytics with ArcGIS, you will be introduced to the use of spatial statistics tool available in ArcGIS to solve complex geographic analysis. Obtaining basic spatial statistics about a dataset is often the first step in the analysis of geographic data. The Measuring Geographic Distributions toolset in the ArcGIS Spatial Statistics Tools toolbox contains a tool that provides descriptive geographic statistics such as the Central Feature tool. In this article, you will learn how to use the central feature tool to obtain basic spatial statistical information about a dataset including the following topics: Preparing for geographic analysis Measuring geographic centrality with the central feature tool (For more resources related to this topic, see here.) Measuring geographic centrality The Central Feature tool in the Measuring Geographic Distributions toolset can all be used to measure the geographic centrality of spatial data. In this exercise, the central feature tool will be used to obtain descriptive spatial statistics about crime data for the city of Denver. Preparation Let's get prepared for the geographic analysis by performing the following steps: In ArcMap, open the C:GeospatialTrainingSpatialStatsDenverCrime.mxd file. You should see a point feature class called Crime, as shown in the following screenshot: The Crime feature class contains point locations for all crimes for the city of Denver in 2013. The first thing we need to do is isolate a type of crime for our analysis. Open the attribute table for the crime feature class. Use the Select by Attributes tool to select all records, where OFFENSE_CATEGORY_ID = 'burglary' as shown in the following screenshot. This will select 25,743 burglaries from the dataset. These are burglaries within the city limits of Denver in 2013: Close the attribute table. In the Table of Contents block, right-click on the Crime layer and select Properties. Go to the Source tab and under Geographic Coordinate System note that the value is GCS_WGS_1984. Data is often stored in this WGS84 Web Mercator coordinate system for display purposes on the Web. The WGS84 Web Mercator coordinate system that is so popular today for online mapping applications is not suitable for use with the Spatial Statistics tools. These tools require accurate distance measurements that aren't possible with WGS84 Web Mercator, so it's important to project your datasets to a coordinate system that supports accurate distance measurements. Close this dialog by clicking on the Cancel button. Now, right-click on the Layers data frame and select Properties… and then Coordinate System. The current coordinate system of the data frame should be set to NAD_1983_UTM_Zone_13N, which is acceptable for our analysis. With the records from the crime layer still selected, right-click on the Layers and go to Data | Export Data. The next dialog is very important. Select the the data frame option as the coordinate system, as shown in the following screenshot. Name the layer Burglary and export it to the crime geodatabase in C:GeospatialTrainingSpatialStatsExercisesDatacrime.gdb and then click on OK. The new Burglary layer will be added to the Layers data frame. Rename the layer to Denver Burglary. You can now remove the Crime layer. Save your map document file. Run the Central Feature tool The Central Feature tool identifies the most centrally located feature from a point, line, or polygon feature class. It adds and sums the distances from each feature to every other feature. The one with the shortest distance is the central feature. This tool creates an output feature class containing a single feature that represents the most centrally located feature. For example, if you have a feature class of burglaries, the Central Feature tool will identify the crime location that is the central most location from the group and create a new feature class with a single point feature that represents this location:  If necessary, open ArcToolbox and find the Spatial Statistics Tools toolbox. Open the toolbox and expand the Measuring Geographic Distributions toolset. Double-click on Central Feature to display the tool as shown in the following screenshot: Select Denver Burglary as the Input Feature Class, C:GeospatialTrainingSpatialStatsDatacrime.gdbBurglary_CentralFeature as the Output Feature Class, and EUCLIDEAN_DISTANCE as the Distance Method. Euclidean distance is a straight-line distance between two points. The other distance method is Manhattan distance, which is the distance between two points, measured along axes at right angles and is calculated by summing the difference between the x and y coordinates. There are the following three optional parameters for the Central Feature tool, including Weight Field(optional), Self Potential Weight Field(optional), and Case Field(optional). We won't use any of these optional parameters for this analysis, but they do warrant an explanation: Weight Field(optional): This parameter is a numeric field used to weigh distances in the origin-destination matrix. For example, if you had a dataset containing real-estate sales information each point might contain a sales price. The sales price could be used to weigh the output of the Central Feature tool. Self Potential Weight Field: This is a field representing self-potential or the distance or weight between a feature and itself. Case Field(optional): This parameter is a field used to group feature for separate central feature computations. This field can be an integer, data, or string. Click on the OK button. The most centrally located burglary will be displayed as shown in the following screenshot. The output is a single point feature: Summary This article covered the use of a descriptive spatial statistics tool, Central Feature tool found in the Measuring Geographic Distributions toolset. This central feature tool returns basic spatial statistical information about a dataset. Resources for Article: Further resources on this subject: Introduction to Mobile Web ArcGIS Development [article] Learning to Create and Edit Data in ArcGIS [article] ArcGIS – Advanced ArcObjects [article]
Read more
  • 0
  • 0
  • 2372
article-image-encapsulation-data-properties
Packt
07 Feb 2017
7 min read
Save for later

Encapsulation of Data with Properties

Packt
07 Feb 2017
7 min read
In this article by Gastón C. Hillar, the author of the book Swift 3 Object Oriented Programming - Second Edition, you will learn about all the elements that might compose a class. We will start organizing data in blueprints that generate instances. We will work with examples to understand how to encapsulate and hide data by working with properties combined with access control. In addition, you will learn about properties, methods, and mutable versus immutable classes. (For more resources related to this topic, see here.) Understanding the elements that compose a class So far, we worked with a very simple class and many instances of this class in the Playground, the Swift REPL and the web-based Swift Sandbox. Now, it is time to dive deep into the different members of a class. The following list enumerates the most common element types that you can include in a class definition in Swift and their equivalents in other programming languages. We have already worked with a few of these elements: Initializers: This is equivalent to constructors in other programming languages Deinitializer: This is equivalent to destructors in other programming languages Type properties: This is equivalent to class fields or class attributes in other programming languages Type methods: This is equivalent to class methods in other programming languages Subscripts: This is also known as shortcuts Instance properties: This is equivalent to instance fields or instance attributes in other programming languages Instance methods: This is equivalent to instance functions in other programming languages Nested types: These are types that only exist within the class in which we define them We could access the instance property without any kind of restrictions as a variable within an instance. However, as it happens sometimes in real-world situations, restrictions are necessary to avoid serious problems. Sometimes, we want to restrict access or transform specific instance properties into read-only attributes. We can combine the restrictions with computed properties that can define getters and/or setters. Computed properties can define get and or set methods, also known as getters and setters. Setters allow us to control how values are set, that is, these methods are used to change the values of related properties. Getters allow us to control the values that we return when computed properties are accessed. Getters don't change the values of related properties. Sometimes, all the members of a class share the same attribute, and we don't need to have a specific value for each instance. For example, the superhero types have some profile values, such as the average strength, average running speed, attack power, and defense power. We can define the following type properties to store the values that are shared by all the instances: averageStrength, averageRunningSpeed, attackPower, and defensePower. All the instances have access to the same type properties and their values. However, it is also possible to apply restrictions to their access. It is also possible to define methods that don't require an instance of a specific class to be called; therefore, you can invoke them by specifying both the class and method names. These methods are known as type methods, operate on a class as a whole, and have access to type properties, but they don't have access to any instance members, such as instance properties or methods, because there is no instance at all. Type methods are useful when you want to include methods related to a class and don't want to generate an instance to call them. Type methods are also known as static or class methods. However, we have to pay attention to the keyword we use to declare type methods in Swift because a type method declared with the static keyword has a different behavior from a type method declared with the class keyword. Declaring stored properties When we design classes, we want to make sure that all the necessary data is available to the methods that will operate on this data; therefore, we encapsulate data. However, we just want relevant information to be visible to the users of our classes that will create instances, change values of accessible properties, and call the available methods. Thus, we want to hide or protect some data that is just needed for internal use. We don't want to make accidental changes to sensitive data. For example, when we create a new instance of any superhero, we can use both its name and birth year as two parameters for the initializer. The initializer sets the values of two properties: name and birthYear. The following lines show a sample code that declares the SuperHero class. class SuperHero { var name: String var birthYear: Int init(name: String, birthYear: Int) { self.name = name self.birthYear = birthYear } } The next lines create two instances that initialize the values of the two properties and then use the print function to display their values in the Playground: var antMan = SuperHero(name: "Ant-Man", birthYear: 1975) print(antMan.name) print(antMan.birthYear) var ironMan = SuperHero(name: "Iron-Man", birthYear: 1982) print(ironMan.name) print(ironMan.birthYear) The following screenshot shows the results of the declaration of the class and the execution of the lines in the Playground: The following screenshot shows the results of running the code in the Swift REPL. The REPL displays details about the two instances we just created: antMan and ironMan. The details include the values of the name and birthYear properties: The following lines show the output that the Swift REPL displays after we create the two SuperHero instances: antMan: SuperHero = { name = "Ant-Man" birthYear = 1975 } ironMan: SuperHero = { name = "Iron-Man" birthYear = 1982 } We can read the two lines as follows: the antMan variable holds an instance of SuperHero with its name set to "Ant-Man" and its birthYear set to 1975. The ironMan variable holds an instance of SuperHero with its name set to "Iron-Man" and its birthYear set to 1982. The following screenshot shows the results of running the code in the web-based IBM Swift Sandbox: We don't want a user of our SuperHero class to be able to change a superhero's name after an instance is initialized because the name is not supposed to change. There is a simple way to achieve this goal in our previously declared class. We can use the let keyword to define an immutable name stored property of type string instead of using the var keyword. We can also replace the var keyword with let when we define the birthYear stored property because the birth year will never change after we initialize a superhero instance. The following lines show the new code that declares the SuperHero class with two stored immutable properties: name and birthYear. Note that the initializer code hasn't changed, and it is possible to initialize both the immutable stored properties with the same code: class SuperHero { let name: String let birthYear: Int init(name: String, birthYear: Int) { self.name = name self.birthYear = birthYear } } Stored immutable properties are also known as stored nonmutating properties. The next lines create an instance that initializes the values of the two immutable stored properties and then use the print function to display their values in the Playground. Then, the two highlighted lines of code try to assign a new value to both properties and fail to do so because they are immutable properties: var antMan = SuperHero(name: "Ant-Man", birthYear: 1975) print(antMan.name) print(antMan.birthYear) antMan.name = "Batman" antMan.birthYear = 1976 The Playground displays the following two error messages for the last two lines, as shown in the next screenshot. We will see similar error messages in the Swift REPL and in the Swift Sandbox: Cannot assign to property: 'name' is a 'let' constant Cannot assign to property: 'birthYear' is a 'let' constant When we use the let keyword to declare a stored property, we can initialize the property, but it becomes immutable-that is, a constant-after its initialization. Summary In this article, you learned about the different members of a class or blueprint. We worked with instance properties, type properties, instance methods, and type methods. We worked with stored properties, getters, setters. Resources for Article: Further resources on this subject: Swift's Core Libraries [article] The Swift Programming Language [article] Network Development with Swift [article]
Read more
  • 0
  • 0
  • 2477

article-image-deploying-openstack-devops-way
Packt
07 Feb 2017
16 min read
Save for later

Deploying OpenStack – the DevOps Way

Packt
07 Feb 2017
16 min read
In this article by Chandan Dutta Chowdhury, Omar Khedher, authors of the book Mastering OpenStack - Second Edition, we will cover the Deploying an OpenStack environment based on the profiled design. Although we created our design by taking care of several aspects related to scalability and performance, we still have to make it real. If you are still looking at OpenStack as a single block system. (For more resources related to this topic, see here.) Furthermore, in the introductory section of this article, we covered the role of OpenStack in the next generation of data centers. A large-scale infrastructure used by cloud providers with a few thousand servers needs a very different approach to set up. In our case, deploying and operating the OpenStack cloud is not as simple as you might think. Thus, you need to make the operational task easier or, in other words, automated. In this article, we will cover new topics about the ways to deploy OpenStack. The next part will cover the following points: Learning what the DevOps movement is and how it can be adopted in the cloud Knowing how to see your infrastructure as code and how to maintain it Getting closer to the DevOps way by including configuration management aspects in your cloud Making your OpenStack environment design deployable via automation Starting your first OpenStack environment deployment using Ansible DevOps in a nutshell The term DevOps is a conjunction of development (software developers) and operations (managing and putting software into production). Many IT organizations have started to adopt such a concept, but the question is how and why? Is it a job? Is it a process or a practice? DevOps is development and operations compounded, which basically defines a methodology of software development. It describes practices that streamline the software delivery process. It is about raising communication and integration between developers, operators (including administrators), and quality assurance teams. The essence of the DevOps movement lies in leveraging the benefits of collaboration. Different disciplines can relate to DevOps in different ways and bring their experiences and skills together under the DevOps banner to gain shared values. So, DevOps is a methodology that integrates the efforts of several disciplines, as shown in the following figure: This new movement is intended to resolve the conflict between developers and operators. Delivering a new release affects the production systems. It puts different teams in conflicting positions by setting different goals for them, for example, the development team wants the their latest code to go live while the operations team wants more time to test and stage the changes before going to production. DevOps fills the gap and streamlines the process of bringing in change by bringing in collaboration between the developers and operators. DevOps is neither a toolkit nor a job; it is the synergy that streamlines the process of change. Let's see how DevOps can incubate a cloud project. DevOps and cloud – everything is code Let's look at the architecture of cloud computing. While discussing a cloud infrastructure, we must remember that we are talking about a large scalable environment! The amazing switch to bigger environments requires us to simplify everything as much as possible. System architecture and software design are becoming more and more complicated. Every new release of software affords new functions and new configurations. Administering and deploying a large infrastructure would not be possible without adopting a new philosophy: infrastructure as code. When infrastructure is seen as code, the components of a given infrastructure are modeled as modules of code. What you need to do is to abstract the functionality of the infrastructure into discrete reusable components, design the services provided by the infrastructure as modules of code, and finally implement them as blocks of automation. Furthermore, in such a paradigm, it will be essential to adhere to the same well-established discipline of software development as an infrastructure developer. The essence of DevOps mandates that developers, network engineers, and operators must work alongside each other to deploy, operate, and maintain cloud infrastructure which will power our next-generation data center. DevOps and OpenStack OpenStack is an open source project, and its code is extended, modified, and fixed in every release. It is composed of multiple projects and requires extensive skills to deploy and operate. Of course, it is not our mission to check the code and dive into its different modules and functions. So what can we do with DevOps, then? Deploying complex software on a large-scale infrastructure requires adopting new strategy. The ever-increasing complexity of software such as OpenStack and deployment of huge cloud infrastructure must be simplified. Everything in a given infrastructure must be automated! This is where OpenStack meets DevStack. Breaking down OpenStack into pieces Let's gather what we covered previously and signal a couple of steps towards our first OpenStack deployment: Break down the OpenStack infrastructure into independent and reusable services. Integrate the services in such a way that you can provide the expected functionalities in the OpenStack environment. It is obvious that OpenStack includes many services, What we need to do is see these services as packages of code in our infrastructure as code experience. The next step will investigate how to integrate the services and deploy them via automation. Deploying service as code is similar to writing a software application. Here are important points you should remember during the entire deployment process: Simplify and modularize the OpenStack services Develop OpenStack services as building blocks which integrate with other components to provide a complete system Facilitate the customization and improvement of services without impacting the complete system. Use the right tool to build the services Be sure that the services provide the same results with the same input Switch your service vision from how to do it to what we want to do Automation is the essence of DevOps. In fact, many system management tools are intensely used nowadays due to their efficiency of deployment. In other words, there is a need for automation! You have probably used some of the automation tools, such as Ansible, Chef, Puppet, and many more. Before we go through them, we need to create a succinct, professional code management step. Working with the infrastructure deployment code While dealing with infrastructure as code, the code that abstracts, models, and builds the OpenStack infrastructure must be committed to source code management. This is required for tracking changes in our automation code and reproducibility of results. Eventually, we must reach a point where we shift our OpenStack infrastructure from a code base to a deployed system while following the latest software development best practices. At this stage, you should be aware of the quality of your OpenStack infrastructure deployment, which roughly depends on the quality of the code that describes it. It is important to highlight a critical point that you should keep in mind during all deployment stages: automated systems are not able to understand human error. You'll have to go through an ensemble of phases and cycles using agile methodologies to end up with a release that is a largely bug-free to be promoted to the production environment. On the other hand, if mistakes cannot be totally eradicated, you should plan for the continuous development and testing of code. The code's life cycle management is shown in the following figure: Changes can be scary! To handle changes, it is recommended that you do the following: Keep track of and monitor the changes at every stage Build flexibility into the code and make it easy to change Refactor the code when it becomes difficult to manage Test, test, and retest your code Keep checking every point that has been described previously till you start to get more confident that your OpenStack infrastructure is being managed by code that won't break. Integrating OpenStack into infrastructure code To keep the OpenStack environment working with a minimum rate of surprises and ensure that the code delivers the functionalities that are required, we must continuously track the development of our infrastructure code. We will connect the OpenStack deployment code to a toolchain, where it will be constantly monitored and tested as we continue to develop and refine our code. This toolchain is composed of a pipeline of tracking, monitoring, testing, and reporting phases and is well known as a continuous integration and continuous development (CI-CD) process. Continuous integration and delivery Let's see how continuous integration (CI) can be applied to OpenStack. The life cycle of our automation code will be managed by the following categories of tools: System Management Tool Artifact (SMTA) can be any IT automation tool juju charms. Version Control System (VCS) tracks changes to our infrastructure deployment code. Any version control system, such as CVS, Subversion, or Bazaar, that you are most familiar with can be used for this purpose. Git can be a good outfit for our VCS. Jenkins is a perfect tool that monitors to changes in the VCS and does the continuous integration testing and reporting of results. Take a look at the model in the following figure: The proposed life-cycle for infrastructure as code consists of infrastructure configuration files that are recorded in a version control system and are built continuously by the means of a CI server (Jenkins, in our case). Infrastructure configuration files can be used to set up a unit test environment (a virtual environment using Vagrant, for example) and makes use of any system management tool to provision the infrastructure (Ansible, Chef, puppet, and so on). The CI server keeps listening to changes in version control and automatically propagates any new versions to be tested, and then it listens to target environments in production. Vagrant allows you to build a virtual environment very easily; it is based on Oracle VirtualBox (https://www.virtualbox.org/) to run virtual machines, so you will need these before moving on with the installation in your test environment. The proposed life-cycle for infrastructure code highlights the importance of a test environment before moving on to production. You should give a lot of importance to and care a good deal about the testing stage, although this might be a very time-consuming task. Especially in our case, with infrastructure code for deploying OpenStack that is complicated and has multiple dependencies on other systems, the importance of testing cannot be overemphasized. This makes it imperative to ensure effort is made for an automated and consistent testing of the infrastructure code. The best way to do this is to keep testing thoroughly in a repeated way till you gain confidence about your code. Choosing the automation tool At first sight, you may wonder what the best automation tool is that will be useful for our OpenStack production day. We have already chosen Git and Jenkins to handle our continuous integration and testing. It is time to choose the right tool for automation. It might be difficult to select the right tool. Most likely, you'll have to choose between several of them. Therefore, giving succinct hints on different tools might be helpful in order to distinguish the best outfit for certain particular setups. Of course, we are still talking about large infrastructures, a lot of networking, and distributed services. Giving the chance for one or more tools to be selected, as system management parties can be effective and fast for our deployment. We will use Ansible for the next deployment phase. Introducing Ansible We have chosen Ansible to automate our cloud infrastructure. Ansible is an infrastructure automation engine. It is simple to get started with and yet is flexible enough to handle complex interdependent systems. The architecture of Ansible consists of the deployment system where Ansible itself is installed and the target systems that are managed by Ansible. It uses an agentless architecture to push changes to the target systems. This is due to the use of SSH protocol as its transport mechanism to push changes to the target systems. This also means that there is no extra software installation required on the target system. The agentless architecture makes setting up Ansible very simple. Ansible works by copying modules over SSH to the target systems. It then executes them to change the state of the target systems. Once executed, the Ansible modules are cleaned up, leaving no trail on the target system. Although the default mechanism for making changes to the client system is an SSH-based push-model, if you feel the push-based model for delivering changes is not scalable enough for your infrastructure, Ansible also supports an agent-based pull-model. Ansible is developed in python and comes with a huge collection of core automation modules. The configuration files for Ansible are called Playbooks and they are written in YAML, which is just a markup language. YAML is easier to understand; it's custom-made for writing configuration files. This makes learning Ansible automation much easier. The Ansible Galaxy is a collection of reusable Ansible modules that can be used for your project. Modules Ansible modules are constructs that encapsulate a system resource or action. A module models the resource and its attributes. Ansible comes with packages with a wide range of core modules to represent various system resources; for example, the file module encapsulates a file on the system and has attributes such as owner, group, mode, and so on. These attributes represent the state of a file in the system; by changing the attributes of the resources, we can describe the required final state of the system. Variables While modules can represent the resources and actions on a system, the variables represent the dynamic part of the change. Variables can be used to modify the behavior of the modules. Variables can be defined from the environment of the host, for example, the hostname, IP address, version of software and hardware installed on a host, and so on. They can also be user-defined or provided as part of a module. User-defined variables can represent the classification of a host resource or its attribute. Inventory An inventory is a list of hosts that are managed by Ansible. The inventory list supports classifying hosts into groups. In its simplest form, an inventory can be an INI file. The groups are represented as article on the INI file. The classification can be based on the role of the hosts or any other system management need. It is possible to have a host to appear in multiple groups in an inventory file. The following example shows a simple inventory of hosts: logserver1.example.com [controllers] ctl1.example.com ctl2.example.com [computes] compute1.example.com compute2.example.com compute3.example.com compute[20:30].example.com The inventory file supports special patterns to represent large groups of hosts. Ansible expects to find the inventory file at /etc/ansible/hosts, but a custom location can be passed directly to the Ansible command line. Ansible also supports dynamic inventory that can be generated by executing scripts or retrieved from another management system, such as a cloud platform. Roles Roles are the building blocks of an Ansible-based deployment. They represent a collection of tasks that must be performed to configure a service on a group of hosts. The Role encapsulates tasks, variable, handlers, and other related functions required to deploy a service on a host. For example, to deploy a multinode web server cluster, the hosts in the infrastructure can be assigned roles such as web server, database server, load balancer, and so on. Playbooks Playbooks are the main configuration files in Ansible. They describe the complete system deployment plan. Playbooks are composed a series of tasks and are executed from top to bottom. The tasks themselves refer to group of hosts that must be deployed with roles. Ansible playbooks are written in YAML. The following is an example of a simple Ansible Playbook: --- - hosts: webservers vars: http_port: 8080 remote_user: root tasks: - name: ensure apache is at the latest version yum: name=httpd state=latest - name: write the apache config file template: src=/srv/httpd.j2 dest=/etc/httpd.conf notify: - restart apache handlers: - name: restart apache service: name=httpd state=restarted Ansible for OpenStack OpenStack Ansible (OSA) is an official OpenStack Big Tent project. It focuses on providing roles and playbooks for deploying a scalable, production-ready OpenStack setup. It has a very active community of developers and users collaborating to stabilize and bring new features to OpenStack deployment. One of the unique features of the OSA project is the use of containers to isolate and manage OpenStack services. OSA installs OpenStack services in LXC containers to provide each service with an isolated environment. LXC is an OS-level container and it encompasses a complete OS environment that includes a separate filesystem, networking stack, and resource isolation using cgroups. OpenStack services are spawned in separate LXC containers and speak to each other using the REST APIs. The microservice-based architecture of OpenStack complements the use of containers to isolate services. It also decouples the services from the physical hardware and provides encapsulation of the service environment that forms the foundation for providing portability, high availability, and redundancy. The OpenStack Ansible deployment is initiated from a deployment host. The deployment host is installed with Ansible and it runs the OSA playbooks to orchestrate the installation of OpenStack on the target hosts: The Ansible target hosts are the ones that will run the OpenStack services. The target nodes must be installed with Ubuntu 14.04 LTS and configured with SSH-key-based authentication to allow login from the deployment host. Summary In this article, we covered several topics and terminologies on how to develop and maintain a code infrastructure using the DevOps style. Viewing your OpenStack infrastructure deployment as code will not only simplify node configuration, but also improve the automation process. You should keep in mind that DevOps is neither a project nor a goal, but it is a methodology that will make your deployment successfully empowered by the team synergy with different departments. Despite the existence of numerous system management tools to bring our OpenStack up and running in an automated way, we have chosen Ansible for automation of our infrastructure. Puppet, Chef, Salt, and others can do the job but in different ways. You should know that there isn't one way to perform automation. Both Puppet and Chef have their own OpenStack deployment projects under the OpenStack Big Tent. Resources for Article: Further resources on this subject: Introduction to Ansible [article] OpenStack Networking in a Nutshell [article] Introducing OpenStack Trove [article]
Read more
  • 0
  • 0
  • 26289

article-image-getting-started-cisco-ucs-and-virtual-networking
Packt
07 Feb 2017
20 min read
Save for later

Getting Started with Cisco UCS and Virtual Networking

Packt
07 Feb 2017
20 min read
Introduction In this article by Anuj Modi, the author of the book Implementing CISCO UCS Solutions - Second Edition, we are given some insight into Cisco UCS products and its innovative architecture, which abstracts the underlying physical hardware and provides unified management to all devices. We will walk through the installation of some of the UCS hardware components and deployment of VSM. (For more resources related to this topic, see here.) UCS storage servers To provide a solution to meet storage requirements, Cisco introduced the new C3000 family of storage-optimized UCS rack servers in 2014. A standalone server with a capacity of 360 TB can be used for any kind of data-intensive application, such as big data, Hadoop, object-oriented storage, OpenStack, and other such enterprise applications requiring higher throughput and more efficient transfer rates. This server family can be integrated with existing any B-Series, C-Series, and M-Series servers to provide them with all the required storage and backup requirements. This is an ideal solution for customers who don't want to invest in high-end storage arrays and still want to meet the business requirements. Cisco C3000 storage servers are an optimal solution for medium and large scale-out storage requirements. Cisco UCS 3260 is the latest edition with better density, throughput, and dual-server support, whereas the earlier UCS 3160 provides the same density with a single server. These servers can be managed through CIMC like C-Series rack servers. UCS C3206 Cisco UCS C3206 is 4U rack server with Intel® Xeon® E5-2600 v2 processor family CPUs and up to 320 TB local storage with dual-server nodes and 4x40 Gig I/O throughput using Cisco VIC 1300. This storage can be shared across two compute nodes and provide HA solution inside the box. UCS C3106 Cisco UCS C3106 is a 4U rack server with Intel® Xeon® E5-2600 v2 processor family CPUs and up to 320 TB local storage with a single server node and 4x10 Gig I/O throughput. UCS M-Series modular servers Earlier in 2014, Cisco unleashed a new line of M-Series modular servers to meet the high-density, low-power demands of massively parallelized and cloud-scale applications. These modular servers separate the infrastructure components, such as network, storage, power, and cooling, from the compute nodes and deliver compute and memory to provide scalable applications. These compute nodes disaggregate resources such as processor and memory from the subsystem to provide a dense and power-efficient platform designed to increase compute capacity with unified management capabilities through UCS Manager called UCS System Link Technology. A modular chassis and compute cartridges are the building blocks of these M-Series servers. The M4308 chassis can hold up to eight compute cartridges. Each cartridge has a single or dual CPU with two memory channels and provides two nodes in a single cartridge to support up to 16 compute nodes in a single chassis. The cartridges are hot-pluggable and can be added or removed for the system. The Cisco VIC provides connectivity to UCS Fabric Interconnects for network and management. The UCS M-Series modular server includes the following components: Modular chassis M4308 Modular compute cartridges—M2814, M1414, and M142 M4308 The M4308 modular chassis is a 2U chassis that can accommodate eight compute cartridges with two servers per cartridge, and this makes for 16 servers in a single chassis. The chassis can be connected to a pair of Cisco Fabric Interconnects, providing network, storage, and management capabilities. M2814 The M2814 cartridge has dual-socket Intel® Xeon® E5-2600 v3 and v4 process family CPUs with 512 GB memory. This cartridge can be used for web-scale and small, in-memory databases. M1414 The M1414 cartridge has a single-socket Intel® Xeon® E3-1200 v3 and v4 process family CPU with 64 GB memory. This cartridge can be used for electronic design automation and simulation. M142 The M142 cartridge has a single-socket Intel® Xeon® E3-1200 v3 and v4 process family CPU with 128 GB memory. This cartridge can be used for content delivery, dedicated hosting, and financial modeling and business analytics. Cisco UCS Mini Cisco brings the power of UCS in a smaller form factor for smaller business, remote, and branch offices with the UCS Mini solution, a combination of blade and rack servers with unified management provided by Cisco UCS Manager. UCS Mini is compact version with specialized the 6324 Fabric Interconnect model embedded into the Cisco UCS blade chassis for a smaller server footprint. It provides servers, storage, network, and management capabilities similar to classic UCS. The 6324 Fabric Interconnect combines the Fabric Extender and Fabric Interconnect functions into one plugin module to provide direct connections to upstream switches. A pair of 6324 Fabric Interconnects will be directly inserted into the chassis, called the primary chassis, and provide internal connectivity to UCS blades. A Fabric Extender is not required for the primary chassis. The primary chassis can be connected to another chassis, called the secondary chassis, through a pair of 2208XP or 2204XP Fabric Extenders. Cisco UCS Mini supports a wide variety of UCS B-Series servers, such as B200 M3, 200 M4, B420 M3, and B420 M4. These blades can be inserted into mixed form factors. UCS Mini also supports connectivity to UCS C-Series servers C220 M3, C220 M4, C240 M3 and C240M4. A maximum of seven C-Series servers can be connected to a single chassis. It can support a maximum of 20 servers with a combination of two chassis with eight half-width blades per chassis and four C-Series servers. Here is an overview of Cisco UCS Mini: 6324 Cisco 6324 is embedded into the Cisco UCS 5108 blade server chassis. The 6324 Fabric Interconnect integrates the functions of a Fabric Interconnect and Fabric Extender and provides LAN, SAN, and management connectivity to blade servers and rack servers with embedded UCS Manager. 1G/10G SFP+ ports will be used for external network and SAN connectivity, while 40 Gig QSPF can only be used for connecting another chassis, rack server, and storage. One or two Cisco UCS 6324 FIs can be installed in UCS Mini. The following are the features of the 6324: A maximum of 20 blade/rack servers per UCS domain Four 1 Gig/10 Gig SFP+ unified ports One 40 Gig QSFP Fabric throughput of 500 Gbps Installing UCS hardware components Now, as we have a better understanding of the various components of the Cisco UCS platform, we can dive into the physical installation of UCS Fabric Interconnects and chassis, including blade servers, IOM modules, fan units, power supplies, SFP+ modules, and physical cabling. Before the physical installation of the UCS solution, it is also imperative to consider other data center design factors, including: Building floor load-bearing capacity Rack requirements for UCS chassis and Fabric Interconnects Rack airflow, heating, and ventilation (HVAC) Installing racks for UCS components Any standard 19-inch rack with a minimum depth of 29 inches to a maximum of 35 inches from front to rear rails with 42U height can be an ideal rack solution for Cisco UCS equipment. However, rack size can vary based on different requirements and the data center's landscape. The rack should provide sufficient space for power, cooling, and cabling for all the devices. It should be designed according to data center best practices to optimize its resources. The front and rear doors should provide sufficient space to rack and stack the equipment and adequate space for all types of cabling. Suitable cable managers can be used to manage the cables coming from all the devices. It is always recommended that heavier devices be placed at the bottom and lighter devices at the top. For example, a Cisco UCS B-Series chassis can be placed at the bottom and Fabric Interconnect at the top of the rack. The number of UCS or Nexus devices in a rack can vary based on the total power available for it. The Cisco R-Series R42610 rack is certified for Cisco UCS devices and Nexus switches, which provides better reliability, space, and structural integrity to data centers. Installing UCS chassis components Care must be taken during the installation of all components as failure to follow installation procedures may result in component malfunction and bodily injury. UCS chassis Don'ts: Do not try to lift even the empty chassis alone. At least two people are required to handle a UCS chassis. Do not handle internal components such as CPU, RAM, and mezzanine cards without an ESD field kit. Physical installation is divided into three sections: Blade server component (CPU, memory, hard drives, and mezzanine card) installation Chassis component (blade servers, IOMs, fan units, and power supplies) installation Fabric Interconnect installation and physical cabling Installing blade servers Cisco UCS blade servers are designed with industry-standard components with some enhancements. Anyone with prior server installation experience should be comfortable installing internal components using guidelines provided in the blade server's manual and following standard safety procedures. Transient ESD charges may result in thousands of volts of charge, which can degrade or permanently damage electronic components. The Cisco ESD training course can be found at http://www.cisco.com/web/learning/le31/esd/WelcomeP.html. All Cisco UCS blade servers have a similar cover design, with a button at the front top of the blade, which should be pushed down. Then, there are slight variations among models in the way that cover is slid off, which can be toward the rear and up or toward yourself and up. Installing and removing CPUs The following is the procedure to mount a CPU into a UCS B-Series blade server: Make sure you are wearing an ESD wrist wrap grounded to the blade server cover. To release the CPU clasp, slide it down and to the side. Move the lever up and remove the CPU's blank cover. Keep the blank in a safe place just in case you to remove the CPU later. Lift the CPU by its plastic edges and align it with the socket. CPU can only fit one way. Lower the mounting bracket with the side lever, and secure the CPU into the socket. Align the heat sink with its fins in a position allowing unobstructed airflow from front to back. Gently tighten the heat sink screws to the motherboard. CPU removal is the reverse of the installation process. It is critical to place the blank socket cover back over the CPU socket. Damage could occur to the socket without the blank cover. Installing and removing RAM The following is the procedure to install RAM modules into a UCS B-Series blade server: Make sure you are wearing an ESD wrist wrap grounded to the blade server cover. Undo the clips on the side of the memory slot. Hold the memory module from both edges in an upright position and firmly push it straight down, matching the notch of the module to the socket. Close the side clips to hold the memory module. Memory removal is the reverse of the preceding process. Memory modules must be inserted in pairs and split equally between each CPU if all the memory slots are not populated. Refer to the server manual for identifying memory slots pairs, slot-CPU relation, and optimized memory performance. Installing and removing internal hard disks UCS supports SFF, serial attached SCSI (SAS), and SATA solid-state disk (SSD) hard drives. B200 M4 blade servers also support non-volatile memory express (NVMe) SFF 2.5-inch hard drives via PCI Express. The B200 M4 also provides the option of an SD card to deploy small footprint hypervisors such as ESXi. To insert a hard disk into B200, B260, B420, and B460 blade servers, follow these steps: Make sure you are wearing an ESD wrist wrap grounded to the blade server cover. Remove the blank cover. Press the button on the catch lever on the ejector arm. Slide the hard disk completely into the slot. Push the ejector lever until it clicks to lock the hard disk. To remove a hard disk, press the button, release the catch lever, and slide the hard disk out. Do not leave a hard disk slot empty. If you do not intend to replace the hard disk, cover it with a blank plate to ensure proper airflow. Installing mezzanine cards The UCS B200 supports a single mezzanine card, B260/B420 support two cards, and B460 supports four cards. The procedure for installing these cards is the same for all servers: Make sure you are wearing an ESD wrist wrap grounded to the blade server cover. Open the server's top cover. Grab the card by the edges and align the male molex connector, the female connector, and the motherboard. Press the card gently into the slot. Once card is properly seated, secure it by tightening the screw on the top. Removing a mezzanine card is the reverse of the preceding process. Installing blade servers into a chassis The installation and removal of half-width and full-width blade servers is almost identical, with the only difference being that there's one ejector arm in a half-width blade server, whereas in a full-width blade server, there are two ejector arms. Only the B460 M4 blade server requires two full-width slots in the chassis with an additional Scalability Connector to connect two B260 M4 blade servers and allow them to function as a single server. The bottom blade in this pair serves as the master and top blade as slave. The process is as follows: Make sure you are wearing an ESD wrist wrap grounded to the chassis. Open the ejector arm for a half-width blade or both ejector arms for a full-width blade. Push the blade into the slot. Once it's firmly in, close the ejector arm on the face of server and hand-tighten the screw. The removal of a blade server is the reverse of the preceding process. In order to install a full-width blade, it is necessary to remove the central divider. This can be done with a Philips-head screw driver to push two clips, one downward and the other upward, and sliding the divider out of the chassis. Installing rack servers Cisco UCS rack servers are designed as standalone servers; however, these can be managed or unmanaged through UCS Managers. Unmanaged servers can be connected to any standard upstream switch or Nexus switch to provide network access. However, managed servers need to connect with Fabric Interconnects directly or indirectly through Fabric Extenders to provide LAN, SAN, and management network functionality. Based on the application requirement, the model of rack server can be selected from the various options discussed in the previous chapter. All the rack servers are designed in a similar way; however, they vary in capacity, size, weight, and dimensions. These rack servers can be fit into any standard 19-inch rack with adequate air space for servicing the server. Servers should be installed on the slide rails provided with them. At least two people or a mechanical lift should be used to place the servers in the rack. Always place the heavy rack servers at the bottom of the rack, and lighter servers can be placed at the top. Cisco UCS rack servers provide front-to-rear cooling, and the data center should maintain adequate air conditioning to dissipate the heat from these servers. A Cisco rack server's power requirement depends on the model and can be checked in the server data sheet. Installing Fabric Interconnects The Cisco UCS Fabric Interconnect is a top-of-rack switch that connects the LAN, SAN, and management to underlying physical compute servers for Ethernet and Fibre Channel access. Normally, a pair of Fabric Interconnects can be installed either in a single rack or distributed across two racks to provide rack and power redundancy. Fabric Interconnects can be installed at the bottom in case the network cabling is planned under the floor. All the components in a Fabric Interconnect come installed by default from the factory; only external components such as Ethernet modules, power supplies, and fans need to be connected. It includes two redundant hot-swappable power supply units. It is recommended that the power supply in the rack come from two different energy sources and installed with a redundant power distribution unit (PDU). It requires a maximum of 750 Watts of power. Although the Fabric Interconnect can be powered with a single power source, it is recommended you have redundant power sources. It includes four redundant hot-swappable fans to provide the most efficient front-to-rear cooling design and is designed to work in hot aisle/cold aisle environments. It dissipates approximately 2500 BTU/hour, and adequate cooling should be available in the data center for getting better performance. For detailed information on power, cooling, physical, and environmental specifications, check the data sheet. UCS FI 6300 series data sheet: http://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-b-series-blade-servers/6332-specsheet.pdf UCS FI 6200 series data sheet: http://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-6200-series-fabric-interconnects/data_sheet_c78-675245.pdf Fabric Interconnects provide various options to connect with networks and servers. The 6300 Fabric Interconnect series can provide 10/40 Gig connectivity, while the FI 6200 series can provide 1/10 Gig to the upstream network. The third-generation FI 6332 has 32x40 Gig fixed ports. Ports 1-12 and 15-26 can be configured as 40 Gig QSFP+ ports or as 4x10 Gig SFP+ breakout ports, while ports 13 and 14 can be configured as 40 Gig or 1/10 Gig with QSA adapters but can't be configured with 10 Gig SFP+ breakout cables. The last six ports, 27-32, are dedicated for 40 Gig upstream switch connectivity. The third-generation FI 6332-16UP can be deployed where native Fibre Channel connectivity is essential. The first 16 unified ports, 1-16, can be configured as 1/10 Gig SFP+ Ethernet ports or 4/8/16 Gig Fibre Channel ports. Ports 17-34 can be configured as 40 Gig QSFP+ ports or 4x10 Gig SFP+ breakout ports. The last six ports, 35-40, are dedicated for 40 Gig upstream switch connectivity. The second-generation FI 6248UP has 32x10 Gig fixed unified ports and one expansion module to provide an additional 16 unified ports. All unified ports can be configured as either 1/10 Gig Ethernet or 1/2/4/8 Gig Fibre Channel. The second-generation FI 6296UP has 48x10 Gig fixed unified ports and three expansion modules to provide an additional 48 unified ports. All unified ports can be configured as either 1/10 Gig Ethernet or 1/2/4/8 Gig Fibre Channel. Deploying VSM Cisco has simplified VSM deployment with Virtual Switch Update Manager (VSUM). It is a virtual appliance that can be installed on a VMware ESXi hypervisor and registered as plugin with VMware vCenter. VSUM enables the SA to install, monitor, migrate, and upgrade Cisco Nexus 1000V in high availability or standalone mode. With VSUM, you can deploy multiple instances of Nexus 1000V and can manage them from a single appliance. VSUM only provides layer 3 connectivity with ESXi hosts. For layer 2 connectivity with ESXi, you still have to deploy the Nexus 1000V VSM manually and configure L2 mode in basic configuration. VSUM architecture VSUM has two components: backend as virtual appliance and frontend as GUI integrated into VMware vCenter Server. The virtual appliance will be configured with an IP address, similar to a VMware vCenter Management IP address subnet. Once a virtual appliance is deployed on the ESXi, it establishes connectivity with VMware vCenter Server to get access to vCenter and hosts. VSUM and VSM installation VSM deployment will be divided into two parts. First, we will install VSUM and register it as a plugin with VMware vCenter. The second part is importing the Nexus 1000V package into VSUM and deploying Nexus 1000V through the VSUM GUI interface. For installing and configuring Cisco VSUM, you require one management port group, which will communicate with vCenter. However, the Nexus 1000V VSM will require two port groups for control and management. You can use just one VLAN for all of these, but it is preferred to have them configured with separate VLANs. Download the Nexus 1000v package and VSUM from https://software.cisco.com/download/type.html?mdfid=282646785&flowid=42790, and then use the following procedure to deploy VSUM and VSM: Go to vCenter, click on the File menu, and select Deploy OVF Template. Browse to the location of the VSUM OVA file, click on Browse… to install the required OVA file, and click on Next, as shown in the following screenshot: Click on Next twice. Click on Accept for the End User License Agreement page, and then click on Next. Specify a Name for the VSUM, and select the data center where you would like it to be installed. Then, click on Next, as shown in the following screenshot: Select the Datastore object to install the VSUM. Choose Thin Provision for the virtual disk of VSUM and click on Next. Under the Destination network page, select the network that you had created earlier for Management. Then, click on Next, as shown in the following screenshot: Enter the management IP address, DNS, vCenter IP address, username and password values, as shown in the following screenshot: A summary of all your settings is then presented. If you are happy with these settings, click on Finish. Once VSUM is deployed in the vCenter, power on the VM. This will take 5 minutes for installation and registration as a vCenter plugin. To view the VSUM GUI, you need to re-login into vCenter server to activate the VSUM plugin. Click on Cisco Virtual Switch Update Manager, and select the Upload option under Image Tasks. Upload the Nexus 1000v package downloaded earlier under Upload switch image. This will open Virtual Switch Image File Uploader and select the appropriate package. Once the image is uploaded, it will show up under Manage Uploaded switch images. Click on the Nexus 1000V tab under Basic Tasks, click on Install, and select the desired data center. In Nexus 1000V installer page, go to install a new control plane VSM, and select the proper options for Nexus1000V Switch Deployment Type, VSM Version, and Choose a Port Group. Select the host, VSM Domain ID, Switch Name, IP Address, Username, and Password. Then, click on Finish. A window will appear to show the progress. Once it's deployed, select the Nexus 1000v VM, and click on Power On in the Summary screen. Go to the VSUM GUI, and click on Dashboard under Nexus 1000V Basic Tasks. Verify that the VC Connection Status is green for your installed VSM. You need to create a port profile of the Ethernet type for uplink management. For this, log in to the Nexus 1000v switch using SSH, and type the following commands: port-profile type ethernet system-uplink vmware port-group switchport mode trunk switchport trunk allowed vlan 1-1000 mtu 9000 no shutdown system vlan 1-1000 state enabled Move over to vCenter, and add your ESXi host to the Nexus environment. Open vCenter, click on Inventory, and select Networking. Expand Datacenter and select Nexus Switch. Right-click on it and select Add Host. Select one of the unused VMNICs on the host, then select the uplink port group created earlier (the one carrying the system VLAN data), and click on Next. On the Network Connectivity page, click on Next. On the Virtual Machine Networking page, click on Next. Do not select the checkbox to migrate virtual machine networking. On the Ready to complete page, click on Finish. Finally, run the vemcmd show card command to confirm whether the opaque data is now being received by the VEM. Log in to your Nexus 1000v switch and type in show module to check whether your host has been added or not. Log in to your ESXi server and type VEM Status. As you can see, VEM is now loaded on the VMNIC. Summary In this article, we saw the various available options for all UCS components, such as storage servers, modular servers, and UCS Mini. We also learned about rack server installation and UCS hardware component installation such as chassis, blade, I/O module, and Fabric Interconnect. We learned about the VSUM architecture and Nexus 1000V components and installed those components. Resources for Article: Further resources on this subject: EMC Storage connectivity of Cisco UCS B-Series server [article] Creating Identity and Resource Pools [article] Securing your Elastix System [article]
Read more
  • 0
  • 0
  • 7565
Packt
07 Feb 2017
32 min read
Save for later

Context – Understanding your Data using R

Packt
07 Feb 2017
32 min read
In this article by James D Miller, the author of the book Big Data Visualization we will explore the idea of adding context to the data you are working with. Specifically, we’ll discuss the importance of establishing data context, as well as the practice of profiling your data for context discovery as well how big data effects this effort. The article is organized into the following main sections: Adding Context About R R and Big Data R Example 1 -- healthcare data R Example 2 -- healthcare data (For more resources related to this topic, see here.) When writing a book, authors leave context clues for their readers. A context clue is a “source of information” about written content that may be difficult or unique that helps readers understand. This information offers insight into the content being read or consumed (an example might be: “It was an idyllic day; sunny, warm and perfect…”). With data, context clues should be developed, through a process referred to as profiling (we’ll discuss profiling in more detail later in this article), so that the data consumer can better understand (the data) when visualized. (Additionally, having context and perspective on the data you are working with is a vital step in determining what kind of data visualization should be created). Context or profiling examples might be calculating the average age of “patients” or subjects within the data or “segmenting the data into time periods” (years or months, usually). Another motive for adding context to data might be to gain a new perspective on the data. An example of this might be recognizing and examining a comparison present in the data. For example, body fat percentages of urban high school seniors could be compared to those of rural high school seniors. Adding context to your data before creating visualizations can certainly make it (the data visualization) more relevant, but context still can’t serve as a substitute for value. Before you consider any factors such as time of day or geographic location, or average age, first and foremost, your data visualization needs to benefit those who are going to consume it so establishing appropriate context requirements will be critical. For data profiling (adding context), the rule is: Before Context, Think →Value Generally speaking, there are a several visualization contextual categories, which can be used to argument or increase the value and understanding of data for visualization. These include: Definitions and explanations, Comparisons, Contrasts Tendencies Dispersion Definitions andexplanations This is providing additional information or “attributes” about a data point. For example, if the data contains a field named “patient ID” and we come to know that records describe individual patients, we may choose to calculate and add each individual patients BMI or body mass index: Comparisons This is adding a comparable value to a particular data point. For example, you might compute and add a national ranking to each “total by state”: Contrasts This is almost like adding an “opposite” to a data point to see if it perhaps determines a different perspective. An example might be reviewing average body weights for patients who consume alcoholic beverages verses those who do not consume alcoholic beverages: Tendencies These are the “typical” mathematical calculations (or summaries) on the data as a whole or by other category within the data, such as Mean, Median, and Mode. For example, you might add a median heart rate for the age group each patient in the data is a member of: Dispersion Again, these are mathematical calculations (or summaries), such as Range, Variance, and Standard Deviation, but they describe the "average" of a data set (or group within the data). For example, you may want to add the “range” for a selected value, such as the minimum and maximum number of hospital stays found in the data for each patient age group: The “art” of profiling data to add context and identify new and interesting perspectives for visualization is still and ever evolving; no doubt there are additional contextual categories existing today that can be investigated as you continue your work with big data visualization projects. Adding Context So, how do we add context to data? …is it merely select Insert, then Data Context? No, it’s not that easy (but it’s not impossible either). Once you have identified (or “pulled together”) your big data source (or at least a significant amount of data), how do you go from mountains of raw big data to summarizations that can be used as input to create valuable data visualizations, helping you to further analyze that data and support your conclusions? The answer is through data profiling. Data profiling involves logically “getting to know” the data you think you may want to visualize – through query, experimentation & review. Following the profiling process, you can then use the information you have collected to add context (and/or apply new “perspectives”) to the data. Adding context to data requires the manipulation of that data to perhaps reformat, adding calculations, aggregations or additional columns or re-ordering and so on. Finally, you will be ready to visualize (or “picture”) your data. The complete profiling process is shown below; as in: Pull together (the data or enough of the data), Profile (the data through query, experimentation and review), add Perspective(s) (or context) and finally… Picture (visualize) the data About R R is a language and environment easy to learn, very flexible in nature and also very focused on statistical computing- making it great for manipulating, cleaning, summarizing, producing probability statistics, etc. (as well as actually creating visualizations with your data), so it’s a great choice for the exercises required for profiling, establishing context and identifying additional perspectives. In addition, here are a few more reasons to use R when profiling your big data: R is used by a large number of academic statisticians – so it’s a tool that is not “going away” R is pretty much platform independent – what you develop will run almost any where R has awesome help resources – just Goggle it; you’ll see! R and Big Data Although R is free (open sourced), super flexible, and feature rich, you must keep in mind that R preserves everything in your machine’s memory and this can become problematic when you are working with big data (even with the introduction of the low resource costs of today). Thankfully, though there are various options and strategies to “work with” this limitation, such as imploring a sort of “pseudo-sampling” technique, which we will expound on later in this article (as part of some of the examples provided). Additionally, R libraries have been developed and introduced that can leverage hard drive space (as sort of a virtual extension to your machines memory), again exposed in this article’s examples. Example 1 In this article’s first example we’ll use data collected from a theoretical hospital where upon admission, patient medical history information is collected though an online survey. Information is also added to a “patients file” as treatment is provided. The file includes many fields including basic descriptive data for the patient such as: sex, date of birth, height, weight, blood type, etc. Vital statistics such as: blood pressure, heart rate, etc. Medical history such as: number of hospital visits, surgeries, major illnesses or conditions, currently under a doctor’s care, etc. Demographical statistics such as: occupation, home state, educational background, etc. Some additional information is also collected in the file in an attempt to develop patient characters and habits such as the number of times the patient included beef, pork and fowl in their weekly diet or if they typically use a butter replacement product, and so on. Periodically, the data is “dumped” to text files, are comma-delimited and contain the following fields (in this order): Patientid, recorddate_month, recorddate_day, recorddate_year, sex, age, weight, height, no_hospital_visits, heartrate, state, relationship, Insured, Bloodtype, blood_pressure, Education, DOBMonth, DOBDay, DOBYear, current_smoker, current_drinker, currently_on_medications, known_allergies, currently_under_doctors_care, ever_operated_on, occupation, Heart_attack, Rheumatic_Fever Heart_murmur, Diseases_of_the_arteries, Varicose_veins, Arthritis, abnormal_bloodsugar, Phlebitis, Dizziness_fainting, Epilepsy_seizures, Stroke, Diphtheria, Scarlet_Fever, Infectious_mononucleosis, Nervous_emotional_problems, Anemia, hyroid_problems, Pneumonia, Bronchitis, Asthma, Abnormal_chest_Xray, lung_disease, Injuries_back_arms_legs_joints_Broken_bones, Jaundice_gallbladder_problems, Father_alive, Father_current_age, Fathers_general_health, Fathers_reason_poor_health, Fathersdeceased_age_death, mother_alive, Mother_current_age, Mother_general_health, Mothers_reason_poor_health, Mothers_deceased_age_death, No_of_brothers, No_of_sisters, age_range, siblings_health_problems, Heart_attacks_under_50, Strokes_under_50, High_blood_pressure, Elevated_cholesterol, Diabetes, Asthma_hayfever, Congenital_heart_disease, Heart_operations, Glaucoma, ever_smoked_cigs, cigars_or_pipes, no_cigs_day, no_cigars_day, no_pipefuls_day, if_stopped_smoking_when_was_it, if_still_smoke_how_long_ago_start,target_weight, most_ever_weighed, 1_year_ago_weight, age_21_weight, No_of_meals_eatten_per_day, No_of_times_per_week_eat_beef, No_of_times_per_week_eat_pork, No_of_times_per_week_eat_fish, No_of_times_per_week_eat_fowl, No_of_times_per_week_eat_desserts, No_of_times_per_week_eat_fried_foods, No_servings_per_week_wholemilk, No_servings_per_week_2%_milk, No_servings_per_week_tea, No_servings_per_week_buttermilk, No_servings_per_week_1%_milk, No_servings_per_week_regular_or_diet_soda, No_servings_per_week_skim_milk, No_servings_per_week_coffee No_servings_per_week_water, beer_intake, wine_intake, liquor_intake, use_butter, use_extra_sugar, use_extra_salt, different_diet_weekends, activity_level, sexually_active, vision_problems, wear_glasses Following is the image showing a portion of the file (displayed in MS Windows notepad): Assuming we have been given no further information about the data, other than the provided field name list and the knowledge that the data is captured by hospital personnel upon patient admission, the next step would be to perform some sort of profiling of the data- investigating to start understanding the data and then to start adding context and perspectives (so ultimately we can create some visualizations). Initially, we start out by looking through the field or column names in our file and some ideas start to come to mind. For example: What is the data time-frame we are dealing with? Using the field record date, can we establish a period of time (or time frame) for the data? (In other words, over what period of time was this data captured). Can we start “grouping the data” using fields such as sex, age and state? Eventually, what we should be asking is, “what can we learn from visualizing the data?” Perhaps: What is the breakdown of those currently smoking by age group? What is the ratio of those currently smoking to the number of hospital visits? Do those patients currently under a doctor’s care, on average have better BMI ratios? And so on. Dig-in with R Using the power of R programming, we can run various queries on the data; noting that the results of those quires may spawn additional questions and queries and eventually, yield data ready for visualizing. Let’s start with a few simple profile queries. I always start my data profiling by “time boxing” the data. The following R scripts (although as mentioned earlier, there are many ways to accomplish the same objective) work well for this: # --- read our file into a temporary R table tmpRTable4TimeBox<-read.table(file="C:/Big Data Visualization/Chapter 3/sampleHCSurvey02.txt”, sep=",") # --- convert to an R data frame and filter it to just include # --- the 2nd column or field of data data.df <- data.frame(tmpRTable4TimeBox) data.df <- data.df[,2] # --- provides a sorted list of the years in the file YearsInData = substr(substr(data.df[],(regexpr('/',data.df[])+1),11),( regexpr('/',substr(data.df[],(regexpr('/',data.df[])+1),11))+1),11) # -- write a new file named ListofYears write.csv(sort(unique(YearsInData)),file="C:/Big Data Visualization /Chapter 3/ListofYears.txt",quote = FALSE, row.names = FALSE) The above simple R script provides a sorted list file (ListofYears.txt) (shown below) containing the years found in the data we are profiling: Now we can see that our patient survey data covers patient survey data collected during the years 1999 through 2016 and with this information we start to add context (or allow us to gain a perspective) on our data. We could further time-box the data by perhaps breaking the years into months (we will do this later on in this article) but let’s move on now to some basic “grouping profiling”. Assuming that each record in our data represents a unique hospital visit, how can we determine the number of hospital visits (the number of records) by sex, age and state? Here I will point out that it may be worthwhile establishing the size (number of rows or records (we already know the number of columns or fields) of the file you are working with. This is important since the size of the data file will dictate the programming or scripting approach you will need to use during your profiling. Simple R functions valuable to know are: nrow and head. These simple command can be used to count the total rows in a file: nrow:mydata Of to view the first n umber of rows of data: head(mydata, nrow=10) So, using R, one could write a script to load the data into a table, convert it to a data frame and then read through all the records in the file and “count up” or “tally” the number of hospital visits (the number of records) for males and females. Such logic is a snap to write: # --- assuming tmpRTable holds the data already datas.df<-data.frame(tmpRTable) # --- initialize 2 counter variables NumberMaleVisits <-0;NumberFemaleVisits <-0 # --- read through the data for(i in 1:nrow(datas.df)) { if (datas.df[i,3] == 'Male') {NumberMaleVisits <- NumberMaleVisits + 1} if (datas.df[i,3] == 'Female') {NumberFemaleVisits <- NumberFemaleVisits + 1} } # --- show me the totals NumberMaleVisits NumberFemaleVisits The previous script works, but in a big data scenario, there is a more efficient way, since reading or “looping through” and counting each record will take far too long. Thankfully, R provides the table function that can be used similar to the SQL “group by” command. The following script assumes that our data is already in an R data frame (named datas.df), so using the sequence number of the field in the file, if we want to see the number of hospital visits for Males and the number of hospital visits for Females we can write: # --- using R table function as "group by" field number # --- patient sex is the 3rd field in the file table(datas.df[,3]) Following is the output generated from running the above stated script. Notice that R shows “sex” with a count of 1 since the script included the files “header record” of the file as a unique value: We can also establish the number of hospital visits by state (state is the 9th field in the file): table(datas.df[,9]) Age (or the fourth field in the file) can also be studied using the R functions sort and table: Sort(table(datas.df[,4])) Note that since there are quite a few more values for age within the file, I’ve sorted the output using the R sort function. Moving on now, let’s see if there is a difference between the number of hospital visits for patients who are current smokers (field name current_smoker and is field number 16 in the file) and those indicating that they are non-current smokers. We can use the same R scripting logic: sort(table(datas.df[16])) Surprisingly (one might think) it appears from our profiling that those patients who currently do not smoke have had more hospital visits (113,681) than those who currently are smokers (12,561): Another interesting R script to continue profiling our data might be: table(datas.df[,3],datas.df[,16]) The above shown script again uses the R table function to group data, but shows how we can “group within a group”, in other words, using this script we can get totals for “current” and “non-current” smokers, grouped by sex. In the below image we see that the difference between female smokers and male smokers might be considered to be marginal: So we see that by using the above simple R script examples, we’ve been able to add some context to our healthcare survey data. By reviewing the list of fields provided in the file we can come up with the R profiling queries shown (and many others) without much effort. We will continue with some more complex profiling in the next section, but for now, let’s use R to create a few data visualizations - based upon what we’ve learned so far through our profiling. Going back to the number of hospital visits by sex, we can use the R function barplot to create a visualization of visits by sex. But first, a couple of “helpful hints” for creating the script. First, rather than using the table function, you can use the ftable function which creates a “flat” version of the original function’s output. This makes it easier to exclude the header record count of 1 that comes back from the table function. Next, we can leverage some additional arguments of the barplot function like col, border, names.arg and Title to make the visualization a little “nicer to look at”. Below is the script: # -- use ftable function to drop out the header record forChart<- ftable(datas.df[,3]) # --- create bar names barnames<-c("Female","Male") # -- use barplot to draw bar visual barplot(forChart[2:3], col = "brown1", border = TRUE, names.arg = barnames) # --- add a title title(main = list("Hospital Visits by Sex", font = 4)) The scripts output (our visualization) is shown below: We could follow the same logic for creating a similar visualization of hospital visits by state: st<-ftable(datas.df[,9]) barplot(st) title(main = list("Hospital Visits by State", font = 2)) But the visualization generated isn’t very clear: One can always experiment a bit more with this data to make the visualization a little more interesting. Using the R functions substr and regexpr, we can create an R data frame that contains a record for each hospital visit by state within each year in the file. Then we can use the function plot (rather than barplot) to generate the visualization. Below is the R script: # --- create a data frame from our original table file datas.df <- data.frame(tmpRTable) # --- create a filtered data frame of records from the file # --- using the record year and state fields from the file dats.df<-data.frame(substr(substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11),( regexpr('/',substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11))+1),11),datas.df[,9]) # --- plot to show a visualization plot(sort(table(dats.df[2]),decreasing = TRUE),type="o", col="blue") title(main = list("Hospital Visits by State (Highest to Lowest)", font = 2)) Here is the different (perhaps more interesting) version of the visualization generated by the previous script: Another earlier perspective on the data was concerning Age. We grouped the hospital visits by the age of the patients (using the R table function). Since there are many different patient ages, a common practice is to establish age ranges, such as the following: 21 and under 22 to 34 35 to 44 45 to 54 55 to 64 65 and over To implement the previous age ranges, we need to organize the data and could use the following R script: # --- initialize age range counters a1 <-0;a2 <-0;a3 <-0;a4 <-0;a5 <-0;a6 <-0 # --- read and count visits by age range for(i in 2:nrow(datas.df)) { if (as.numeric(datas.df[i,4]) < 22) {a1 <- a1 + 1} if (as.numeric(datas.df[i,4]) > 21 & as.numeric(datas.df[i,4]) < 35) {a2 <- a2 + 1} if (as.numeric(datas.df[i,4]) > 34 & as.numeric(datas.df[i,4]) < 45) {a3 <- a3 + 1} if (as.numeric(datas.df[i,4]) > 44 & as.numeric(datas.df[i,4]) < 55) {a4 <- a4 + 1} if (as.numeric(datas.df[i,4]) > 54 & as.numeric(datas.df[i,4]) < 65) {a5 <- a5 + 1} if (as.numeric(datas.df[i,4]) > 64) {a6 <- a6 + 1} } Big Data Note: Looping or reading through each of the records in our file isn’t very practical if there are a trillion records. Later in this article we’ll use a much better approach, but for now will assume a smaller file size for convenience. Once the above script is run, we can use the R pie function and the following code to create our pie chart visualization: # --- create Pie Chart slices <- c(a1, a2, a3, a4, a5, a6) lbls <- c("under 21", "22-34","35-44","45-54","55-64", "65 & over") pie(slices, labels = lbls, main="Hospital Visits by Age Range") Following is the generated visualization: Finally, earlier in this section we looked at the values in field 16 of our file - which indicates whether the survey patient was a current smoker. We could build a simple visual showing the totals, but (again) the visualization isn’t very interesting or all that informative. With some simple R scripts, we can proceed to create a visualization showing the number of hospital visits, year-over-year by those patients that are current smokers. First, we can “reformat” the data in our R data frame (named datas.df) to store only the year (of the record date) using the R function substr. This makes it a little easier to aggregate the data by year shown in the next steps. The R script using the substr function is shown below: # --- redefine the record date field to hold just the record # --- year value datas.df[,2]<-substr(substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11),( regexpr('/',substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11))+1),11) Next, we can create an R table named c to hold the record date year and totals (of non and current smokers) for each year. Following is the R script: used: # --- create a table holding record year and total count for # --- smokers and not smoking c<-table(datas.df[,2],datas.df[,16]) Finally, we can use the R barplot function to create our visualization. Again, there is more than likely a cleverer way to setup the objects bars and lbls, but for now, I simply hand-coded the year’s data I wanted to see in my visualization: # --- set up the values to chart and the labels for each bar # --- in the chart bars<-c(c[2,3], c[3,3], c[4,3],c[5,3],c[6,3],c[7,3],c[8,3],c[9,3],c[10,3],c[11,3],c[12,3],c[13,3]) lbls<-c("99","00","01","02","03","04","05","06","07","08","09","10") Now the R script to actually produce the bar chart visualization is shown below: # --- create the bar chart barplot(bars, names.arg=lbls, col="red") title(main = list("Smoking Patients Year to Year", font = 2)) Below is the generated visualization: Example 2 In the above examples, we’ve presented some pretty basic and straight forward data profiling exercises. Typically, once you’ve become somewhat familiar with your data – having added some context (though some basic profiling), one would extend the profiling process, trying to look at the data in additional ways using technics such as those mentioned in the beginning of this article: Defining new data points based upon the existing data, performing comparisons, looking at contrasts (between data points), identifying tendencies and using dispersions to establish the variability of the data. Let’s now review some of these options for extended profiling using simple examples as well as the same source data as was used in the previous section examples. Definitions & Explanations One method of extending your data profiling is to “add to” the existing data by creating additional definition or explanatory “attributes” (in other words add new fields to the file). This means that you use existing data points found in the data to create (hopefully new and interesting) perspectives on the data. In the data used in this article, a thought-provoking example might be to use the existing patient information (such as the patients weight and height) to calculate a new point of data: body mass index (BMI) information. A generally accepted formula for calculating a patient’s body mass index is: BMI = (Weight (lbs.) / (Height (in))2) x 703 For example: (165 lbs.) / (702) x 703 = 23.67 BMI. Using the above formula, we can use the following R script (assuming we’ve already loaded the R object named tmpRTable with our file data) to generate a new file of BMI percentages and state names: j=1 for(i in 2:nrow(tmpRTable)) { W<-as.numeric(as.character(tmpRTable[i,5])) H<-as.numeric(as.character(tmpRTable[i,6])) P<-(W/(H^2)*703) datas2.df[j,1]<-format(P,digits=3) datas2.df[j,2]<-tmpRTable[i,9] j=j+1 } write.csv(datas2.df[1:j-1,1:2],file="C:/Big Data Visualization/Chapter 3/BMI.txt", quote = FALSE, row.names = FALSE) Below is a portion of the generated file: Now we have a new file of BMI percentages by state (one BMI record for each hospital visit in each state). Earlier in this article we touched on the concept of looping or reading through all of the records in a file or data source and creating counts based on various field or column values. Such logic works fine for medium or smaller files but a much better approach (especially with big data files) would be to use the power of various R commands. No Looping Although the above described R script does work, it requires looping through each record in our file which is slow and inefficient to say the least. So, let’s consider a better approach. Again, assuming we’ve already loaded the R object named tmpRTable with our data, the below R script can accomplish the same results (create the same file) in just 2 lines: PDQ<-paste(format((as.numeric(as.character(tmpRTable[,5]))/(as.numeric(as.character(tmpRTable[,6]))^2)*703),digits=2),',',tmpRTable[,9],sep="") write.csv(PDQ,file="C:/Big Data Visualization/Chapter 3/BMI.txt", quote = FALSE,row.names = FALSE) We could now use this file (or one similar) as input to additional profiling exercise or to create a visualization, but let’s move on. Comparisons Performing comparisons during data profiling can also add new and different perspectives to the data. Beyond simple record counts (like total smoking patients visiting a hospital verses the total non-smoking patients visiting a hospital) one might ponder to compare the total number of hospital visits for each state to the average number of hospital visits for a state. This would require calculating the total number of hospital visits by state as well as the total number of hospital visits over all (then computing the average). The following 2 lines of code use the R functions table and write.csv to create a list (a file) of the total number of hospital visits found for each state: # --- calculates the number of hospital visits for each # --- state (state ID is in field 9 of the file StateVisitCount<-table(datas.df[9]) # --- write out a csv file of counts by state write.csv (StateVisitCount, file="C:/Big Data Visualization/Chapter 3/visitsByStateName.txt", quote = FALSE, row.names = FALSE) Below is a portion of the file that is generated: The following R command can be used to calculate the average number of hospitals by using the nrow function to obtain a count of records in the data source and then divide it by the number of states: # --- calculate the average averageVisits<-nrow(datas.df)/50 Going a bit further with this line of thinking, you might consider that the nine states the U.S. Census Bureau designates as the Northeast region are Connecticut, Maine, Massachusetts, New Hampshire, New York, New Jersey, Pennsylvania, Rhode Island and Vermont. What is the total number of hospital visits recorded in our file for the northeast region? R makes it simple with the subset function: # --- use subset function and the “OR” operator to only have # --- northeast region states in our list NERVisits<-subset(tmpRTable, as.character(V9)=="Connecticut" | as.character(V9)=="Maine" | as.character(V9)=="Massachusetts" | as.character(V9)=="New Hampshire" | as.character(V9)=="New York" | as.character(V9)=="New Jersey" | as.character(V9)=="Pennsylvania" | as.character(V9)=="Rhode Island" | as.character(V9)=="Vermont") Extending our scripting we can add some additional queries to calculate the average number of hospital visits for the northeast region and the total country: AvgNERVisits<-nrow(NERVisits)/9 averageVisits<-nrow(tmpRTable)/50 And let’s add a visualization: # -- the c objet is the the data for the barplot function to # --- graph c<-c(AvgNERVisits, averageVisits) # --- use R barplot barplot(c, ylim=c(0,3000), ylab="Average Visits", border="Black", names.arg = c("Northeast","all")) title("Northeast Region vs Country") The generated visualzation is shown below: Contrasts The examination of contrasting data is another form of extending data profiling. For example, using this article’s data, one could contrast the average body weight of patients that are under doctor’s care against the average body weight of patients that are not under a doctor’s care (after calculating average body weights for each group). To accomplish this, we can calculate the average weights for patients that fall into each category (those currently under a doctor’s care and those not currently under a doctor’s care) as well as for all patients, using the following R script: # --- read in our entire file tmpRTable<-read.table(file="C:/Big Data Visualization/Chapter 3/sampleHCSurvey02.txt",sep=",") # --- use the subset functionto create the 2 groups we are # --- interested in UCare.sub<-subset(tmpRTable, V20=="Yes") NUCare.sub<-subset(tmpRTable, V20=="No") # --- use the mean function to get the average body weight of all pateints in the file as well as for each of our separate groups average_undercare<-mean(as.numeric(as.character(UCare.sub[,5]))) average_notundercare<-mean(as.numeric(as.character(NUCare.sub[,5]))) averageoverall<-mean(as.numeric(as.character(tmpRTable[2:nrow(tmpRTable),5]))) average_undercare;average_notundercare;averageoverall In “short order”, we can use R’s ability to create subsets (using the subset function) of the data based upon values in a certain field (or column), then use the mean function to calculate the average patient weight for the group. The results from running the script (the calculated average weights) are shown below: And if we use the calculated results to create a simple visualization: # --- use R barplot to create the bar graph of # --- average patient weight barplot(c, ylim=c(0,200), ylab="Patient Weight", border="Black", names.arg = c("under care","not under care", "all"), legend.text= c(format(c[1],digits=5),format(c[2],digits=5),format(c[3],digits=5)))> title("Average Patient Weight") Tendencies Identifying tendencies present within your data is also an interesting way of extending data profiling. For example, using this article’s sample data, you might determine what the number of servings of water that was consumed per week by each patient age group. Earlier in this section we created a simple R script to count visits by age groups; it worked, but in a big data scenario, this may not work. A better approach would be to categorize the data into the age groups (age is the fourth field or column in the file) using the following script: # --- build subsets of each age group agegroup1<-subset(tmpRTable, as.numeric(V4)<22) agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35) agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45) agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55) agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66) agegroup6<-subset(tmpRTable, as.numeric(V4)>64) After we have our grouped data, we can calculate water consumption. For example, to count the total weekly servings of water (which is in field or column 96) for age group 1 we can use: # --- field 96 in the file is the number of servings of water # --- below line counts the total number of water servings for # --- age group 1 sum(as.numeric(agegroup1[,96])) Or the average number of servings of water for the same age group: mean(as.numeric(agegroup1[,96])) Take note that R requires the explicit conversion of the value of field 96 (even though it comes in the file as a number) to a number using the R function as.numeric. Now, let’s see create the visualization of this perspective of our data. Below is the R script used to generate the visualization: # --- group the data into age groups agegroup1<-subset(tmpRTable, as.numeric(V4)<22) agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35) agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45) agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55) agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66) agegroup6<-subset(tmpRTable, as.numeric(V4)>64) # --- calculate the averages by group g1<-mean(as.numeric(agegroup1[,96])) g2<-mean(as.numeric(agegroup2[,96])) g3<-mean(as.numeric(agegroup3[,96])) g4<-mean(as.numeric(agegroup4[,96])) g5<-mean(as.numeric(agegroup5[,96])) g6<-mean(as.numeric(agegroup6[,96])) # --- create the visualization barplot(c(g1,g2,g3,g4,g5,g6), + axisnames=TRUE, names.arg = c("<21", "22-34", "35-44", "45-54", "55-64", ">65")) > title("Glasses of Water by Age Group") The generated visualization is shown below: Dispersion Finally, dispersion is still another method of extended data profiling. Dispersion measures how various elements selected behave with regards to some sort of central tendency, usually the mean. For example, we might look at the total number of hospital visits for each age group, per calendar month in regards to the average number of hospital visits per month. For this example, we can use the R function subset in the R scripts (to define our age groups and then group the hospital records by those age groups) like we did in our last example. Below is the script, showing the calculation for each group: agegroup1<-subset(tmpRTable, as.numeric(V4) <22) agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35) agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45) agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55) agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66) agegroup6<-subset(tmpRTable, as.numeric(V4)>64) Remember, the previous scripts create subsets of the entire file (which we loaded into the object tmpRTable) and they contain all of the fields of the entire file. The agegroup1 group is partially displayed as follows: Once we have our data categorized by age group (agegroup1 through agegroup6), we can then go on and calculate a count of hospital stays by month for each group (shown in the following R commands). Note that the substr function is used to look at the month code (the first 3 characters of the record date) in the file since we (for now) don’t care about the year. The table function then can be used to create an array of counts by month. az1<-table(substr(agegroup1[,2],1,3)) az2<-table(substr(agegroup2[,2],1,3)) az3<-table(substr(agegroup3[,2],1,3)) az4<-table(substr(agegroup4[,2],1,3)) az5<-table(substr(agegroup5[,2],1,3)) az6<-table(substr(agegroup6[,2],1,3)) Using the above month totals, we can then calculate an average number of hospital visits for each month using the R function mean. This will be the mean function of the total for the month for ALL age groups: JanAvg<-mean(az1["Jan"], az2["Jan"], az3["Jan"], az4["Jan"], az5["Jan"], az6["Jan"]) Note that the above code example can be used to calculate an average for each month Next we can calculate the totals for each month, for each age group: Janag1<-az1["Jan"];Febag1<-az1["Feb"];Marag1<-az1["Mar"];Aprag1<-az1["Apr"];Mayag1<-az1["May"];Junag1<-az1["Jun"] Julag1<-az1["Jul"];Augag1<-az1["Aug"];Sepag1<-az1["Sep"];Octag1<-az1["Oct"];Novag1<-az1["Nov"];Decag1<-az1["Dec"] The following code “stacks” the totals so we can more easily visualize it later (we would have one line for each age group (that is, Group1Visits, Group2Visits and so on). Monthly_Visits<-c(JanAvg, FebAvg, MarAvg, AprAvg, MayAvg, JunAvg, JulAvg, AugAvg, SepAvg, OctAvg, NovAvg, DecAvg) Group1Visits<-c(Janag1,Febag1,Marag1,Aprag1,Mayag1,Junag1,Julag1,Augag1,Sepag1,Octag1,Novag1,Decag1) Group2Visits<-c(Janag2,Febag2,Marag2,Aprag2,Mayag2,Junag2,Julag2,Augag2,Sepag2,Octag2,Novag2,Decag2) Finally, we can now create the visualization: plot(Monthly_Visits, ylim=c(1000,4000)) lines(Group1Visits, type="b", col="red") lines(Group2Visits, type="b", col="purple") lines(Group3Visits, type="b", col="green") lines(Group4Visits, type="b", col="yellow") lines(Group5Visits, type="b", col="pink") lines(Group6Visits, type="b", col="blue") title("Hosptial Visits", sub = "Month to Month", cex.main = 2, font.main= 4, col.main= "blue", cex.sub = 0.75, font.sub = 3, col.sub = "red") and enjoy the generated output: Summary In this article we went over the idea and importance of establishing context and perhaps identifying perspectives to big data, using the data profiling with R. Additionally, we introduced and explored the R Programming language as an effective means to profile big data and used R in numerous illustrative examples. Once again, R is an extremely flexible and powerful tool that works well for data profiling and the reader would be well served researching and experimenting with the languages vast libraries available today as we have only scratched the surface of the features currently available. Resources for Article: Further resources on this subject: Introduction to R Programming Language and Statistical Environment [article] Fast Data Manipulation with R [article] DevOps Tools and Technologies [article]
Read more
  • 0
  • 2
  • 30180

article-image-classification-using-convolutional-neural-networks
Mohammad Pezeshki
07 Feb 2017
5 min read
Save for later

Classification using Convolutional Neural Networks

Mohammad Pezeshki
07 Feb 2017
5 min read
In this blog post, we begin with a simple classification task that the reader can readily relate to. The task is a binary classification of 25000 images of cats and dogs, divided into 20000 training, 2500 validation, and 2500 testing images. It seems reasonable to use the most promising model for object recognition, which is convolutional neural network (CNN). As a result, we use CNN as the baseline for the experiments, and along with this post, we will try to improve its performance using different techniques. So, in the next sections, we will first introduce CNN and its architecture and then we will explore three techniques to boost the performance and speed. These three techniques are using Parametric ReLU and a method of Batch Normalization. In this post, we will show the experimental results as we go through each technique. The complete code for CNN is available online in the author’s GitHub repository. Convolutional Neural Networks Convolutional neural networks can be seen as feedforward neural networks that multiple copies of the same neuron are applied to in different places. It means applying the same function to different patches of an image. Doing this means that we are explicitly imposing our knowledge about data (images) into the model structure. That's because we already know that natural image data is translation invariant, meaning that probability distribution of pixels are the same across all images. This structure, which is followed by a non-linearity and a pooling and subsampling layer, makes CNN’s powerful models, especially, when dealing with images. Here's a graphical illustration of CNN from Prof. Hugo Larochelle's course of Neural Networks, which is originally from Prof. YannLecun's paper on ConvNets. Implementation of a CNN in a GPU-based language of Theano is so straightforward as well. So, we can create a layer like this: And then we can stack them on top of each other like this: CNN Experiments Armed with CNN, we attacked the task using two baseline models. A relatively big, and a relatively small model. In the figures below, you can see the number for layer, filter size, pooling size, stride, and a number of fully connected layers. We trained both networks with a learning rate of 0.01, and a momentum of 0.9 on a GTX580 GPU. We also used early stopping. The small model can be trained in two hours and results in 81 percent accuracy on validation sets. The big model can be trained in 24 hours and results in 92 percent accuracy on validation sets. Parametric ReLU Parametric ReLU (aka Leaky ReLU) is an extension to Rectified Linear Unitthat allows the neuron to learn the slope of activation function in the negative region. Unlike the actual paper of Parametric ReLU by Microsoft Research, I used a different parameterizationthat forces the slope to be between 0 and 1. As shown in the figure below, when alpha is 0, the activation function is just linear. On the other hand, if alpha is 1, then the activation function is exactly the ReLU. Interestingly, although the number of trainable parameters is increased using Parametric ReLU, it improves the model both in terms of accuracy and in terms of convergence speed. Using Parametric ReLU makes the training time 3/4 and increases the accuracy around 1 percent. In Parametric ReLU,to make sure that alpha remains between 0 and 1, we will set alpha = Sigmoid(beta) and optimize beta instead. In our experiments, we will set the initial value of alpha to 0.5. After training, all alphas were between 0.5 and 0.8. That means that the model enjoys having a small gradient in the negative region. “Basically, even a small slope in negative region of activation function can help training a lot. Besides, it's important to let the model decide how much nonlinearity it needs.” Batch Normalization Batch Normalization simply means normalizing preactivations for each batch to have zero mean and unit variance. Based on a recent paper by Google, this normalization reduces a problem called Internal Covariance Shift and consequently makes the learning much faster. The equations are as follows: Personally, during this post, I found this as one of the most interesting and simplest techniques I've ever used. A very important point to keep in mind is to feed the whole validation set as a single batch at testing time to have a more accurate (less biased) estimation of mean and variance. “Batch Normalization, which means normalizing pre-activations for each batch to have zero mean and unit variance, can boost the results both in terms of accuracy and in terms of convergence speed.” Conclusion All in all, we will conclude this post with two finalized models. One of them can be trained in 10 epochs or, equivalently, 15 minutes, and can achieve 80 percent accuracy. The other model is a relatively large model. In this model, we did not use LDNN, but the two other techniques are used, and we achieved 94.5 percent accuracy. About the Author Mohammad Pezeshki is a PhD student in the MILA lab at University of Montreal. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014. He then obtained his Master’s in June 2016. His research interests lie in the fields of Artificial Intelligence, Machine Learning, Probabilistic Models and, specifically,Deep Learning.
Read more
  • 0
  • 0
  • 26196
Modal Close icon
Modal Close icon