Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Programming

1083 Articles
article-image-the-developer-tester-face-off-needs-to-end
Aaron Lazar
21 Jul 2018
6 min read
Save for later

The developer-tester face-off needs to end. It's putting our projects at risk.

Aaron Lazar
21 Jul 2018
6 min read
Penny and Leonard work at the same company as a tester and developer respectively. Penny arrives home late, to find Leonard on the couch with his legs up on the table, playing his favourite video game. Leonard: Oh hi sweety, it looks like you had a long day at work. Penny, throwing him a hostile, sideways glance, heads over to the refrigerator. Penny: Did you remember to take out the garbage? Leonard: Of course, sweety. I used 2 bags so Sheldon’s Szechuan sauce from Szechuan Palace doesn’t seep through. Penny: Did you buy new shampoo for the bathroom? Leonard: Yes, I picked up your regular one from the store on the way back. Penny: And did you slap on a last minute field on the SPA at work? Leonard, pausing his video game and answering in a soft, high pitched voice: Whaaaaaat? Source: giphy If you’re a developer or a tester, you’ve probably been in this situation at least once, if not more. Even if your husband or wife might not be on the other side of the source code. The war goes on... The funny thing is that this isn’t something that’s happened in the recent past. The war between Developers and Testers is a long standing, unresolved battle, that is usually brought up in bouts of unnecessary humor. The truth is that this battle is the cause of several projects slipping deadlines, teams not respecting each other’s views, etc. Here we’ll discuss some of the main reasons for this disconnect and try and address them in hope of making the office a better place. #1 You talkin’ to me? One of the main reasons that developers and testers are not on the same page is because neither bother to communicate effectively with the other. Each individual considers informing the other about the strategy/techniques used, a waste of effort. Obviously there are bound to be issues arising with such a disjointed team. The only way to resolve this problem is to toss egos out of the window, sit down and resolve problems like professionals. While tickets might be the most professional and efficient way to resolve things, walking up to the person (if possible), and discussing the best way forward lets you build a relationship, and resolve things more effectively. Moreover, the person on the receiving end will not consider the move offensive, or demeaning. #2 Is it ‘team’ or ‘teams’? You know the answer to this one, but you’re still not willing to accept it. IT managers and team leads need to create an environment in which developers and testers are not two separate teams. Rather, consider them all as engineers working in the same team, towards the same goal! There’s no better recipe to meet success. Use modern methods like Mob or Pair Programming, where both developers and testers work together closely. The ideal scenario would be to possibly have both team members work on the same machine, addressing and strategising to achieve the goal with continuous, real-time feedback. A good pairing station, Source: ministry of testing #3 On the same page? Which book you got there? If you’re a developer, this one’s especially for you, so listen carefully! Most developers aren’t aware of what tools the testers in their teams use, which is a sin. Being aware of testing tools, methodologies and processes, goes a long way in enabling a smooth and speedy testing process. A developer will be able to understand which parts of their code can probably be a tester’s target, what changes would give testers a tough time and on the other hand, what makes it easy. #4 One goal, two paths to achieve it Well, this is true a lot of times. Developers aim to “build” an application. Testers on the other hand aim to “break” the application. Now while this is not wrong, it’s the vision with which the tester is actually planning on breaking things. Testers, you should always keep the customer’s or end user’s requirements clear, while approaching the application. I may not be an actual tester, and you might wonder how I can empathise with other testers. Honestly, I don’t build software, neither do I test any. But I’ve been in a very similar role, earlier on in my Publishing career. As a Commissioning Gatekeeper, I was responsible for validating book and video ideas from the Commissioning Editors. Like a tester, my job was to identify why or how something wouldn’t work in the market. Now, I could easily approach a particular book idea from the perspective of ‘trashing’ it. But when I learned to approach it from the customer’s point of view, my perspective changed, and I was able to give better constructive feedback to the editor. Don’t aim to destroy, aim to improve. If you must kill off an idea or a feature, do it firmly but with kindness. #5 Trust the Developer’s testing skills Yes! Lack of trust is one of the main reasons why there’s so much friction between developers and testers. A tester needs to understand the developer and believe that they can also write tests with a clear goal in mind. Test Driven Development is a great approach to follow. Here, the developer will know better, what angles to test from, and this can help the tester write a mutually defined test case for the developer to run. At the same time, the tester can also provide insight into how to address bugs that might creep up while running the tests. With this combined knowledge, the developers will be able to minimize the number of bugs at the first go itself! Toss in a Business-Driven Development approach, and you’ve got yourself a team that delivers user stories that are more aligned to the business requirement than ever before! In the end, developers and testers, both need to set their egos aside and make peace with each other. If you really look at it, it’s not that hard at all. It’s all about how the two collaborate to create better software, rather than working in silos. IT managers can play an important role here, and they need to understand the advantages and limitations of their team. They need to ensure the unity of the team by encouraging more engaging ways of working as well as introducing modern methodologies that would assist a peaceful, collaborative effort. Why does more than half the IT industry suffer from Burnout? Abandoning Agile Unit testing with Java frameworks: JUnit and TestNG [Tutorial]
Read more
  • 0
  • 0
  • 13465

article-image-asynchronous-programming-f
Packt
12 Oct 2016
15 min read
Save for later

Asynchronous Programming in F#

Packt
12 Oct 2016
15 min read
In this article by Alfonso Garcia Caro Nunez and Suhaib Fahad, author of the book Mastering F#, sheds light on how writing applications that are non-blocking or reacting to events is increasingly becoming important in this cloud world we live in. A modern application needs to carry out a rich user interaction, communicate with web services, react to notifications, and so on; the execution of reactive applications is controlled by events. Asynchronous programming is characterized by many simultaneously pending reactions to internal or external events. These reactions may or may not be processed in parallel. (For more resources related to this topic, see here.) In .NET, both C# and F# provide asynchronous programming experience through keywords and syntaxes. In this article, we will go through the asynchronous programming model in F#, with a bit of cross-referencing or comparison drawn with the C# world. In this article, you will learn about asynchronous workflows in F# Asynchronous workflows in F# Asynchronous workflows are computation expressions that are setup to run asynchronously. It means that the system runs without blocking the current computation thread when a sleep, I/O, or other asynchronous process is performed. You may be wondering why do we need asynchronous programming and why can't we just use the threading concepts that we did for so long. The problem with threads is that the operation occupies the thread for the entire time that something happens or when a computation is done. On the other hand, asynchronous programming will enable a thread only when it is required, otherwise it will be normal code. There is also lot of marshalling and unmarshalling (junk) code that we will write around to overcome the issues that we face when directly dealing with threads. Thus, asynchronous model allows the code to execute efficiently whether we are downloading a page 50 or 100 times using a single thread or if we are doing some I/O operation over the network and there are a lot of incoming requests from the other endpoint. There is a list of functions that the Async module in F# exposes to create or use these asynchronous workflows to program. The asynchronous pattern allows writing code that looks like it is written for a single-threaded program, but in the internals, it uses async blocks to execute. There are various triggering functions that provide a wide variety of ways to create the asynchronous workflow, which is either a background thread, a .NET framework task object, or running the computation in the current thread itself. In this article, we will use the example of downloading the content of a webpage and modifying the data, which is as follows: let downloadPage (url: string) = async { let req = HttpWebRequest.Create(url) use! resp = req.AsyncGetResponse() use respStream = resp.GetResponseStream() use sr = new StreamReader(respStream) return sr.ReadToEnd() } downloadPage("https://www.google.com") |> Async.RunSynchronously The preceding function does the following: The async expression, { … }, generates an object of type Async<string> These values are not actual results; rather, they are specifications of tasks that need to run and return a string Async.RunSynchronously takes this object and runs synchronously We just wrote a simple function with asynchronous workflows with relative ease and reason about the code, which is much better than using code with Begin/End routines. One of the most important point here is that the code is never blocked during the execution of the asynchronous workflow. This means that we can, in principle, have thousands of outstanding web requests—the limit being the number supported by the machine, not the number of threads that host them. Using let! In asynchronous workflows, we will use let! binding to enable execution to continue on other computations or threads, while the computation is being performed. After the execution is complete, the rest of the asynchronous workflow is executed, thus simulating a sequential execution in an asynchronous way. In addition to let!, we can also use use! to perform asynchronous bindings; basically, with use!, the object gets disposed when it loses the current scope. In our previous example, we used use! to get the HttpWebResponse object. We can also do as follows: let! resp = req.AsyncGetResponse() // process response We are using let! to start an operation and bind the result to a value, do!, which is used when the return of the async expression is a unit. do! Async.Sleep(1000) Understanding asynchronous workflows As explained earlier, asynchronous workflows are nothing but computation expressions with asynchronous patterns. It basically implements the Bind/Return pattern to implement the inner workings. This means that the let! expression is translated into a call to async. The bind and async.Return function are defined in the Async module in the F# library. This is a compiler functionality to translate the let! expression into computation workflows and, you, as a developer, will never be required to understand this in detail. The purpose of explaining this piece is to understand the internal workings of an asynchronous workflow, which is nothing but a computation expression. The following listing shows the translated version of the downloadPage function we defined earlier: async.Delay(fun() -> let req = HttpWebRequest.Create(url) async.Bind(req.AsyncGetResponse(), fun resp -> async.Using(resp, fun resp -> let respStream = resp.GetResponseStream() async.Using(new StreamReader(respStream), fun sr " -> reader.ReadToEnd() ) ) ) ) The following things are happening in the workflow: The Delay function has a deferred lambda that executes later. The body of the lambda creates an HttpWebRequest and is forwarded in a variable req to the next segment in the workflow. The AsyncGetResponse function is called and a workflow is generated, where it knows how to execute the response and invoke when the operation is completed. This happens internally with the BeginGetResponse and EndGetResponse functions already present in the HttpWebRequest class; the AsyncGetResponse is just a wrapper extension present in the F# Async module. The Using function then creates a closure to dispose the object with the IDisposable interface once the workflow is complete. Async module The Async module has a list of functions that allows writing or consuming asynchronous code. We will go through each function in detail with an example to understand it better. Async.AsBeginEnd It is very useful to expose the F# workflow functionality out of F#, say if we want to use and consume the API's in C#. The Async.AsBeginEnd method gives the possibility of exposing the asynchronous workflows as a triple of methods—Begin/End/Cancel—following the .NET Asynchronous Programming Model (APM). Based on our downloadPage function, we can define the Begin, End, Cancel functions as follows: type Downloader() = let beginMethod, endMethod, cancelMethod = " Async.AsBeginEnd downloadPage member this.BeginDownload(url, callback, state : obj) = " beginMethod(url, callback, state) member this.EndDownload(ar) = endMethod ar member this.CancelDownload(ar) = cancelMethod(ar) Async.AwaitEvent The Async.AwaitEvent method creates an asynchronous computation that waits for a single invocation of a .NET framework event by adding a handler to the event. type MyEvent(v : string) = inherit EventArgs() member this.Value = v; let testAwaitEvent (evt : IEvent<MyEvent>) = async { printfn "Before waiting" let! r = Async.AwaitEvent evt printfn "After waiting: %O" r.Value do! Async.Sleep(1000) return () } let runAwaitEventTest () = let evt = new Event<Handler<MyEvent>, _>() Async.Start <| testAwaitEvent evt.Publish System.Threading.Thread.Sleep(3000) printfn "Before raising" evt.Trigger(null, new MyEvent("value")) printfn "After raising" > runAwaitEventTest();; > Before waiting > Before raising > After raising > After waiting : value The testAwaitEvent function listens to the event using Async.AwaitEvent and prints the value. As the Async.Start will take some time to start up the thread, we will simply call Thread.Sleep to wait on the main thread. This is for example purpose only. We can think of scenarios where a button-click event is awaited and used inside an async block. Async.AwaitIAsyncResult Creates a computation result and waits for the IAsyncResult to complete. IAsyncResult is the asynchronous programming model interface that allows us to write asynchronous programs. It returns true if IAsyncResult issues a signal within the given timeout. The timeout parameter is optional, and its default value is -1 of Timeout.Infinite. let testAwaitIAsyncResult (url: string) = async { let req = HttpWebRequest.Create(url) let aResp = req.BeginGetResponse(null, null) let! asyncResp = Async.AwaitIAsyncResult(aResp, 1000) if asyncResp then let resp = req.EndGetResponse(aResp) use respStream = resp.GetResponseStream() use sr = new StreamReader(respStream) return sr.ReadToEnd() else return "" } > Async.RunSynchronously (testAwaitIAsyncResult "https://www.google.com") We will modify the downloadPage example with AwaitIAsyncResult, which allows a bit more flexibility where we want to add timeouts as well. In the preceding example, the AwaitIAsyncResult handle will wait for 1000 milliseconds, and then it will execute the next steps. Async.AwaitWaitHandle Creates a computation that waits on a WaitHandle—wait handles are a mechanism to control the execution of threads. The following is an example with ManualResetEvent: let testAwaitWaitHandle waitHandle = async { printfn "Before waiting" let! r = Async.AwaitWaitHandle waitHandle printfn "After waiting" } let runTestAwaitWaitHandle () = let event = new System.Threading.ManualResetEvent(false) Async.Start <| testAwaitWaitHandle event System.Threading.Thread.Sleep(3000) printfn "Before raising" event.Set() |> ignore printfn "After raising" The preceding example uses ManualResetEvent to show how to use AwaitHandle, which is very similar to the event example that we saw in the previous topic. Async.AwaitTask Returns an asynchronous computation that waits for the given task to complete and returns its result. This helps in consuming C# APIs that exposes task based asynchronous operations. let downloadPageAsTask (url: string) = async { let req = HttpWebRequest.Create(url) use! resp = req.AsyncGetResponse() use respStream = resp.GetResponseStream() use sr = new StreamReader(respStream) return sr.ReadToEnd() } |> Async.StartAsTask let testAwaitTask (t: Task<string>) = async { let! r = Async.AwaitTask t return r } > downloadPageAsTask "https://www.google.com" |> testAwaitTask |> Async.RunSynchronously;; The preceding function is also downloading the web page as HTML content, but it starts the operation as a .NET task object. Async.FromBeginEnd The FromBeginEnd method acts as an adapter for the asynchronous workflow interface by wrapping the provided Begin/End method. Thus, it allows using large number of existing components that support an asynchronous mode of work. The IAsyncResult interface exposes the functions as Begin/End pattern for asynchronous programming. We will look at the same download page example using FromBeginEnd: let downloadPageBeginEnd (url: string) = async { let req = HttpWebRequest.Create(url) use! resp = Async.FromBeginEnd(req.BeginGetResponse, req.EndGetResponse) use respStream = resp.GetResponseStream() use sr = new StreamReader(respStream) return sr.ReadToEnd() } The function accepts two parameters and automatically identifies the return type; we will use BeginGetResponse and EndGetResponse as our functions to call. Internally, Async.FromBeginEnd delegates the asynchronous operation and gets back the handle once the EndGetResponse is called. Async.FromContinuations Creates an asynchronous computation that captures the current success, exception, and cancellation continuations. To understand these three operations, let's create a sleep function similar to Async.Sleep using timer: let sleep t = Async.FromContinuations(fun (cont, erFun, _) -> let rec timer = new Timer(TimerCallback(callback)) and callback state = timer.Dispose() cont(()) timer.Change(t, Timeout.Infinite) |> ignore ) let testSleep = async { printfn "Before" do! sleep 5000 printfn "After 5000 msecs" } Async.RunSynchronously testSleep The sleep function takes an integer and returns a unit; it uses Async.FromContinuations to allow the flow of the program to continue when a timer event is raised. It does so by calling the cont(()) function, which is a continuation to allow the next step in the asynchronous flow to execute. If there is any error, we can call erFun to throw the exception and it will be handled from the place we are calling this function. Using the FromContinuation function helps us wrap and expose functionality as async, which can be used inside asynchronous workflows. It also helps to control the execution of the programming with cancelling or throwing errors using simple APIs. Async.Start Starts the asynchronous computation in the thread pool. It accepts an Async<unit> function to start the asynchronous computation. The downloadPage function can be started as follows: let asyncDownloadPage(url) = async { let! result = downloadPage(url) printfn "%s" result"} asyncDownloadPage "http://www.google.com" |> Async.Start   We wrap the function to another async function that returns an Async<unit> function so that it can be called by Async.Start. Async.StartChild Starts a child computation within an asynchronous workflow. This allows multiple asynchronous computations to be executed simultaneously, as follows: let subTask v = async { print "Task %d started" v Thread.Sleep (v * 1000) print "Task %d finished" v return v } let mainTask = async { print "Main task started" let! childTask1 = Async.StartChild (subTask 1) let! childTask2 = Async.StartChild (subTask 5) print "Subtasks started" let! child1Result = childTask1 print "Subtask1 result: %d" child1Result let! child2Result = childTask2 print "Subtask2 result: %d" child2Result print "Subtasks completed" return () } Async.RunSynchronously mainTask Async.StartAsTask Executes a computation in the thread pool and returns a task that will be completed in the corresponding state once the computation terminates. We can use the same example of starting the downloadPage function as a task. let downloadPageAsTask (url: string) = async { let req = HttpWebRequest.Create(url) use! resp = req.AsyncGetResponse() use respStream = resp.GetResponseStream() use sr = new StreamReader(respStream) return sr.ReadToEnd() } |> Async.StartAsTask let task = downloadPageAsTask("http://www.google.com") prinfn "Do some work" task.Wait() printfn "done"   Async.StartChildAsTask Creates an asynchronous computation from within an asynchronous computation, which starts the given computation as a task. let testAwaitTask = async { print "Starting" let! child = Async.StartChildAsTask <| async { // " Async.StartChildAsTask shall be described later print "Child started" Thread.Sleep(5000) print "Child finished" return 100 } print "Waiting for the child task" let! result = Async.AwaitTask child print "Child result %d" result } Async.StartImmediate Runs an asynchronous computation, starting immediately on the current operating system thread. This is very similar to the Async.Start function we saw earlier: let asyncDownloadPage(url) = async { let! result = downloadPage(url) printfn "%s" result"} asyncDownloadPage "http://www.google.com" |> Async.StartImmediate Async.SwitchToNewThread Creates an asynchronous computation that creates a new thread and runs its continuation in it: let asyncDownloadPage(url) = async { do! Async.SwitchToNewThread() let! result = downloadPage(url) printfn "%s" result"} asyncDownloadPage "http://www.google.com" |> Async.Start   Async.SwitchToThreadPool Creates an asynchronous computation that queues a work item that runs its continuation, as follows: let asyncDownloadPage(url) = async { do! Async.SwitchToNewThread() let! result = downloadPage(url) do! Async.SwitchToThreadPool() printfn "%s" result"} asyncDownloadPage "http://www.google.com" |> Async.Start   Async.SwitchToContext Creates an asynchronous computation that runs its continuation in the Post method of the synchronization context. Let's assume that we set the text from the downloadPage function to a UI textbox, then we will do it as follows: let syncContext = System.Threading.SynchronizationContext() let asyncDownloadPage(url) = async { do! Async.SwitchToContext(syncContext) let! result = downloadPage(url) textbox.Text <- result"} asyncDownloadPage "http://www.google.com" |> Async.Start   Note that in the console applications, the context will be null. Async.Parallel The Parallel function allows you to execute individual asynchronous computations queued in the thread pool and uses the fork/join pattern. let parallel_download() = let sites = ["http://www.bing.com"; "http://www.google.com"; "http://www.yahoo.com"; "http://www.search.com"] let htmlOfSites = Async.Parallel [for site in sites -> downloadPage site ] |> Async.RunSynchronously printfn "%A" htmlOfSites   We will use the same example of downloading HTML content in a parallel way. The preceding example shows the essence of parallel I/O computation The async function, { … }, in the downloadPage function shows the asynchronous computation These are then composed in parallel using the fork/join combinator In this sample, the composition executed waits synchronously for overall result Async.OnCancel A cancellation interruption in the asynchronous computation when a cancellation occurs. It returns an asynchronous computation trigger before being disposed. // This is a simulated cancellable computation. It checks the " token source // to see whether the cancel signal was received. let computation " (tokenSource:System.Threading.CancellationTokenSource) = async { use! cancelHandler = Async.OnCancel(fun () -> printfn " "Canceling operation.") // Async.Sleep checks for cancellation at the end of " the sleep interval, // so loop over many short sleep intervals instead of " sleeping // for a long time. while true do do! Async.Sleep(100) } let tokenSource1 = new " System.Threading.CancellationTokenSource() let tokenSource2 = new " System.Threading.CancellationTokenSource() Async.Start(computation tokenSource1, tokenSource1.Token) Async.Start(computation tokenSource2, tokenSource2.Token) printfn "Started computations." System.Threading.Thread.Sleep(1000) printfn "Sending cancellation signal." tokenSource1.Cancel() tokenSource2.Cancel() The preceding example implements the Async.OnCancel method to catch or interrupt the process when CancellationTokenSource is cancelled. Summary In this article, we went through detail, explanations about different semantics in asynchronous programming with F#, used with asynchronous workflows. We saw a number of functions from the Async module. Resources for Article: Further resources on this subject: Creating an F# Project [article] Asynchronous Control Flow Patterns with ES2015 and beyond [article] Responsive Applications with Asynchronous Programming [article]
Read more
  • 0
  • 0
  • 13428

article-image-introduction-and-composition
Packt
19 Aug 2015
17 min read
Save for later

Introduction and Composition

Packt
19 Aug 2015
17 min read
In this article written by Diogo Resende, author of the book Node.js High Performance, we will discuss how high performance is hard, and how it depends on many factors. Best performance should be a constant goal for developers. To achieve it, a developer must know the programming language they use and, more importantly, how the language performs under heavy loads, these being disk, memory, network, and processor usage. (For more resources related to this topic, see here.) Developers will make the most out of a language if they know its weaknesses. In a perfect world, since every job is different, a developer should look for the best tool for the job. But this is not feasible and a developer wouldn't be able to know every best tool, so they have to look for the second best tool for every job. A developer will excel if they know few tools but master them. As a metaphor, a hammer is used to drive nails, and you can also use it to break objects apart or forge metals, but you shouldn't use it to drive screws. The same applies to languages and platforms. Some platforms are very good for a lot of jobs but perform really badly at other jobs. This performance can sometimes be mitigated, but at other times, can't be avoided and you should look for better tools. Node.js is not a language; it's actually a platform built on top of V8, Google's open source JavaScript engine. This engine implements ECMAScript, which itself is a simple and very flexible language. I say "simple" because it has no way of accessing the network, accessing the disk, or talking to other processes. It can't even stop execution since it has no kind of exit instruction. This language needs some kind of interface model on top of it to be useful. Node.js does this by exposing a (preferably) nonblocking I/O model using libuv. This nonblocking API allows you to access the filesystem, connect to network services and execute child processes. The API also has two other important elements: buffers and streams. Since JavaScript strings are Unicode friendly, buffers were introduced to help deal with binary data. Streams are used as simple event interfaces to pass data around. Buffers and streams are used all over the API when reading file contents or receiving network packets. A stream is a module, similar to the network module. When loaded, it provides access to some base classes that help create readable, writable, duplex, and transform streams. These can be used to perform all sorts of data manipulation in a simplified and unified format. The buffers module easily becomes your best friend when converting binary data formats to some other format, for example, JSON. Multiple read and write methods help you convert integers and floats, signed or not, big endian or little endian, from 8 bits to 8 bytes long. Most of the platform is designed to be simple, small, and stable. It's designed and ready to create some high-performance applications. Performance analysis Performance is the amount of work completed in a defined period of time and with a set of defined resources. It can be analyzed using one or more metrics that depend on the performance goal. The goal can be low latency, low memory footprint, reduced processor usage, or even reduced power consumption. The act of performance analysis is also called profiling. Profiling is very important for making optimized applications and is achieved by instrumenting either the source or the instance of the application. By instrumenting the source, developers can spot common performance weak spots. By instrumenting an application instance, they can test the application on different environments. This type of instrumentation can also be known by the name benchmarking. Node.js is known for being fast. Actually, it's not that fast; it's just as fast as your resources allow it. What Node.js is best at is not blocking your application because of an I/O task. The perception of performance can be misleading in Node.js applications. In some other languages, when an application task gets blocked—for example, by a disk operation—all other tasks can be affected. In the case of Node.js, this doesn't happen—usually. Some people look at the platform as being single threaded, which isn't true. Your code runs on a thread, but there are a few more threads responsible for I/O operations. Since these operations are extremely slow compared to the processor's performance, they run on a separate thread and signal the platform when they have information for your application. Applications blocking I/O operations perform poorly. Since Node.js doesn't block I/O unless you want it to, other operations can be performed while waiting for I/O. This greatly improves performance. V8 is an open source Google project and is the JavaScript engine behind Node.js. It's responsible for compiling and executing JavaScript, as well as managing your application's memory needs. It is designed with performance in mind. V8 follows several design principles to improve language performance. The engine has a profiler and one of the best and fast garbage collectors that exist, which is one of the keys to its performance. It also does not compile the language into byte code; it compiles it directly into machine code on the first execution. A good background in the development environment will greatly increase the chances of success in developing high-performance applications. It's very important to know how dereferencing works, or why your variables should avoid switching types. Here are other useful tips you would want to follow. You can use a style guide like JSCS and a linter like JSHint to enforce them to for yourself and your team. Here are some of them: Write small functions, as they're more easily optimized Use monomorphic parameters and variables Prefer arrays to manipulate data, as integer-indexed elements are faster Try to have small objects and avoid long prototype chains Avoid cloning objects because big objects will slow the operations Monitoring After an application is put into production mode, performance analysis becomes even more important, as users will be more demanding than you were. Users don't accept anything that takes more than a second, and monitoring the application's behavior over time and over some specific loads will be extremely important, as it will point to you where your platform is failing or will fail next. Yes, your application may fail, and the best you can do is be prepared. Create a backup plan, have fallback hardware, and create service probes. Essentially, anticipate all the scenarios you can think of, and remember that your application will still fail. Here are some of those scenarios and aspects that you should monitor: When in production, application usage is of extreme importance to understand where your application is heading in terms of data size or memory usage. It's important that you carefully define source code probes to monitor metrics—not only performance metrics, such as requests per second or concurrent requests, but also error rate and exception percentage per request served. Your application emits errors and sometimes throws exceptions; it's normal and you shouldn't ignore them. Don't forget the rest of the infrastructure. If your application must perform at high standards, your infrastructure should too. Your server power supply should be uninterruptible and stable, as instability will degrade your hardware faster than it should. Choose your disks wisely, as faster disks are more expensive and usually come in smaller storage sizes. Sometimes, however, this is actually not a bad decision when your application doesn't need that much storage and speed is considered more important. But don't just look at the gigabytes per dollar. Sometimes, it's more important to look at the gigabits per second per dollar. Also, your server temperature and server room should be monitored. High temperatures degrades performance and your hardware has an operation temperature limit. Security, both physical and virtual, is also very important. Everything counts for the standards of high performance, as an application that stops serving its users is not performing at all. Getting high performance Planning is essential in order to achieve the best results possible. High performance is built from the ground up and starts with how you plan and develop. It obviously depends on physical resources, as you can't perform well when you don't have sufficient memory to accomplish your task, but it also depends greatly on how you plan and develop an application. Mastering tools will give much better performance chances than just using them. Setting the bar high from the beginning of development will force the planning to be more prudent. Some bad planning of the database layer can really downgrade performance. Also, cautious planning will cause developers to think more about “use cases and program more consciously. High performance is when you have to think about a new set of resources (processor, memory, storage) because all that you have is exhausted, not just because one resource is. A high-performance application shouldn't need a second server when a little processor is used and the disk is full. In such a case, you just need bigger disks. Applications can't be designed as monolithic these days. An increasing user base enforces a distributed architecture, or at least one that can distribute load by having multiple instances. This is very important to accommodate in the beginning of the planning, as it will be harder to change an application that is already in production. Most common applications will start performing worse over time, not because of deficit of processing power but because of increasing data size on databases and disks. You'll notice that the importance of memory increases and fallback disks become critical to avoiding downtime. It's very important that an application be able to scale horizontally, whether to shard data across servers or across regions. A distributed architecture also increases performance. Geographically distributed servers can be more closed to clients and give a perception of performance. Also, databases distributed by more servers will handle more traffic as a whole and allow DevOps to accomplish zero downtime goals. This is also very useful for maintenance, as nodes can be brought down for support without affecting the application. Testing and benchmarking To know whether an application performs well or not under specific environments, we have to test it. This kind of test is called a benchmark. Benchmarking is important to do and it's specific to every application. Even for the same language and platform, different applications might perform differently, either because of the way in which some parts of an application were structured or the way in which a database was designed. Analyzing the performance will indicate bottleneck of your application, or if you may, the parts of the application that perform not good as others. These are the parts that need to be improved. Constantly trying to improve the worst performing parts will elevate the application's overall performance. There are plenty of tools out there, some more specific or focused on JavaScript applications, such as benchmarkjs (http://benchmarkjs.com/) and ben (https://github.com/substack/node-ben), and others more generic, such as ab (http://httpd.apache.org/docs/2.2/programs/ab.html) and httpload (https://github.com/perusio/httpload). There are several types of benchmark tests depending on the goal, they are as follows: Load testing is the simplest form of benchmarking. It is done to find out how the application performs under a specific load. You can test and find out how many connections an application accepts per second, or how many traffic bytes an application can handle. An application load can be checked by looking at the external performance, such as traffic, and also internal performance, such as the processor used or the memory consumed. Soak testing is used to see how an application performs during a more extended period of time. It is done when an application tends to degrade over time and analysis is needed to see how it reacts. This type of test is important in order to detect memory leaks, as some applications can perform well in some basic tests, but over time, the memory leaks and their performance can degrade. Spike testing is used when a load is increased very fast to see how the application reacts and performs. This test is very useful and important in applications that can have spike usages, and operators need to know how the application will react. Twitter is a good example of an application environment that can be affected by usage spikes (in world events such as sports or religious dates), and need to know how the infrastructure will handle them. All of these tests can become harder as your application grows. Since your user base gets bigger, your application scales and you lose the ability to be able to load test with the resources you have. It's good to be prepared for this moment, especially to be prepared to monitor performance and keep track of soaks and spikes as your application users start to be the ones responsible for continuously test load. Composition in applications Because of this continuous demand of performant applications, composition becomes very important. Composition is a practice where you split the application into several smaller and simpler parts, making them easier to understand, develop, and maintain. It also makes them easier to test and improve. Avoid creating big, monolithic code bases. They don't work well when you need to make a change, and they also don't work well if you need to test and analyze any part of the code to improve it and make it perform better. The Node.js platform helps you—and in some ways, forces you to—compose your code. Node.js Package Manager (NPM) is a great module publishing service. You can download other people's modules and publish your own as well. There are tens of thousands of modules published, which means that you don't have to reinvent the wheel in most cases. This is good since you can avoid wasting time on creating a module and use a module that is already in production and used by many people, which normally means that bugs will be tracked faster and improvements will be delivered even faster. The Node.js platform allows developers to easily separate code. You don't have to do this, as the platform doesn't force you to, but you should try and follow some good practices, such as the ones described in the following sections. Using NPM Don't rewrite code unless you need to. Take your time to try some available modules, and choose the one that is right for you. This reduces the probability of writing faulty code and helps published modules that have a bigger user base. Bugs will be spotted earlier, and more people in different environments will test fixes. Moreover, you will be using a more resilient module. One important and neglected task after starting to use some modules is to track changes and, whenever possible, keep using recent stable versions. If a dependency module has not been updated for a year, you can spot a problem later, but you will have a hard time figuring out what changed between two versions that are a year apart. Node.js modules tend to be improved over time and API changes are not rare. Always upgrade with caution and don't forget to test. Separating your code Again, you should always split your code into smaller parts. Node.js helps you do this in a very easy way. You should not have files bigger than 5 kB. If you have, you better think about splitting it. Also, as a good rule, each user-defined object should have its own separate file. Name your files accordingly: // MyObject.js module.exports = MyObject; function MyObject() { // … } MyObject.prototype.myMethod = function () { … }; Another good rule to check whether you have a file bigger than it should be; that is, it should be easy to read and understand in less than 5 minutes by someone new to the application. If not, it means that it's too complex and it will be harder to track and fix bugs later on. Remember that later on, when your application becomes huge, you will be like a new developer when opening a file to fix something. You can't remember all of the code of the application, and you need to absorb a file behavior fast. Garbage collection When writing applications, managing available memory is boring and difficult. When the application gets complex it's easy to start leaking memory. Many programming languages have automatic memory management, removing this management away from the developer by means of a Garbage Collector (GC). The GC is only a part of this memory management, but it's the most important one and is responsible for reclaiming memory that is no longer in use (garbage), by periodically looking at disposed referenced objects and freeing the memory associated with them. The most common technique used by GC is to monitor reference counting. This means that GC, for each object, holds the number (count) of other objects that reference it. When an object has no references to it, it can be collected, meaning, it can be disposed and its memory freed. CPU Profiling Profiling is boring but it's a good form of software analysis where you measure resource usage. This usage is measured over time and sometimes under specific workloads. Resources can mean anything the application is using, being it memory, disk, network or processor. More specifically, CPU profiling allows one to analyze how and how much your functions use the processor. You can also analyze the opposite, the non-usage of the processor, the idle time. When profiling the processor, we usually take samples of the call stack at a certain frequency and analyze how the stack changes (increases or decreases) over the sampling period. If we use profilers from the operating system we'll have more items in stack than you probably expect, as you'll get internal calls of node.js and V8. Summary Together, Node.js and NPM make a very good platform for developing high-performance applications. Since the language behind them is JavaScript and most applications these days are web applications, these combinations make it an even more appealing choice, as it's one less server-side language to learn (such as PHP or Ruby) and can ultimately allow a developer to share code on the client and server sides. Also, frontend and backend developers can share, read, and improve each other's code. Many developers pick this formula and bring with them many of their habits from the client side. Some of these habits are not applicable because on the server-side, asynchronous tasks must rule as there are many clients connected (as opposed to one) and performance becomes crucial. You now see how the garbage collector task is not that easy. It surely does a very good job managing memory automatically. You can help it a lot, especially if writing applications with performance in mind. Avoiding the GC old space growing is necessary to avoid long GC cycles, pausing your application and sometimes forcing your services to restart. Every time you create a new variable you allocate memory and you get closer to a new GC cycle. Even understanding how memory is managed, sometimes you need to inspect your memory usage behavior In environments seen nowadays, it's very important to be able to profile an application to identify bottlenecks, especially at the processor and memory level. Overall you should focus on your code quality before going into profiling. Resources for Article: Further resources on this subject: Node.js Fundamentals [Article] So, what is Node.js? [Article] Setting up Node [Article]
Read more
  • 0
  • 0
  • 13389

article-image-github-wants-to-improve-open-source-sustainability-invites-maintainers-to-talk-about-their-oss-challenges
Sugandha Lahoti
18 Jan 2019
4 min read
Save for later

Github wants to improve Open Source sustainability; invites maintainers to talk about their OSS challenges

Sugandha Lahoti
18 Jan 2019
4 min read
Open Source Sustainability is an essential and special part of free and open software development. Open source contributors and maintainers build tools and technologies for everyone, but they’ don’t get enough resources, tools, and environment. If anything goes wrong with the project, it is generally the system contributors who are responsible for it. In reality, however, contributors, and maintainers together are equally responsible. Yesterday, Devon Zuegel, the open source product manager at GitHub penned a blog post talking about open source sustainability and what are the issues current open source maintainers face while trying to contribute to open source. The major thing holding back OSS is the work overload that maintainers face. The OS community generally consists of maintainers who are working at some other organization while also maintaining the open source projects mostly in their free time. This leaves little room for software creators to have economic gain from their projects and compensate for costs and people required to maintain their projects. This calls for companies and individuals to donate to these maintainers on GitHub. As a hacker news user points out, “ I think this would be a huge incentive for people to continue their work long-term and not just "hand over" repositories to people with ulterior motives.” Another said, “Integrating bug bounties and donations into GitHub could be one of the best things to happen to Open Source. Funding new features and bug fixes could become seamless, and it would sway more devs to adopt this model for their projects.” Another major challenge is the abuse and frustration that maintainers have to go on a daily basis. As Devon writes on her blog, “No one deserves abuse. OSS contributors are often on the receiving end of harassment, demands, and general disrespect, even as they volunteer their time to the community.”  What is required is to educate people and also build some kind of moderation for trolls like a small barrier to entry. Apart from that maintainers should also be given expanded visibility into how their software is used. Currently, they are only given access to download statistics. There should be a proper governance model that should be regularly updated based on the what decisions team makes, delegates, and communicates. As Adam Jacob founder of SFOSC (Sustainable Free and Open Source Communities) points out, “I believe we need to start talking about Open Source, not in terms of licensing models, or business models (though those things matter): instead, we should be talking about whether or not we are building sustainable communities. What brings us together, as people, in this common effort around the software? What rights do we hold true for each other? What rights are we willing to trade in order to see more of the software in the world, through the investment of capital?” SFOSC is established to discuss the principles that lead to sustainable communities, to develop clear social contracts communities can use, and educate Open Source companies on which business models can create true communities. As with SFOSC, Github also wants to better understand the woes of maintainers from their own experiences and hence the blog. Devon wants to support the people behind OSS at Github inviting people to have an open dialogue with the GitHub community solving nuanced and unique challenges that the current OSS community face. She has created a contact form asking open source contributors or maintainer to join the conversation and share their problems. Open Source Software: Are maintainers the only ones responsible for software sustainability? We need to encourage the meta-conversation around open source, says Nadia Eghbal [Interview] EU to sponsor bug bounty programs for 14 open source projects from January 2019
Read more
  • 0
  • 0
  • 13379

article-image-working-entity-client-and-entity-sql
Packt
21 Aug 2015
11 min read
Save for later

Working with Entity Client and Entity SQL

Packt
21 Aug 2015
11 min read
In this article by Joydip Kanjilal, author of the book Entity Framework Tutorial - Second Edition explains how Entity Framework contains a powerful client-side query engine that allows you to execute queries against the conceptual model of data, irrespective of the underlying data store in use. This query engine works with a rich functional language called Entity SQL (or E-SQL for short), a derivative of Transact SQL (T-SQL), that enables you to query entities or a collection of entities. (For more resources related to this topic, see here.) An overview of the E-SQL language Entity Framework allows you to write programs against the EDM and also add a level of abstraction on top of the relational model. This isolation of the logical view of data from the Object Model is accomplished by expressing queries in terms of abstractions using an enhanced query language called E-SQL. This language is specially designed to query data from the EDM. E-SQL was designed to address the need for a language that can query data from its conceptual view, rather than its logical view. From T-SQL to E-SQL SQL is the primary language that has been in use for years for querying databases. Remember, SQL is a standard and not owned by any particular database vendor. SQL-92 is a standard, and is the most popular SQL standard currently in use. This standard was released in 1992. The 92 in the name reflects this fact. Different database vendors implemented their own flavors of the SQL-92 standard. The T-SQL language was designed by Microsoft as an SQL Server implementation of the SQL-92 standard. Similar to other SQL languages implemented by different database vendors, the E-SQL language is Entity Framework implementation of the SQL-92 standard that can be used to query data from the EDM. E-SQL is a text-based, provider independent, query language used by Entity Framework to express queries in terms of EDM abstractions and to query data from the conceptual layer of the EDM. One of the major differences between E-SQL and T-SQL is in nested queries. Note that you should always enclose your nested queries in E-SQL using parentheses as seen here: SELECT d, (SELECT DEREF (e) FROM NAVIGATE (d, PayrollEntities.FK_Employee_Department) AS e) AS Employees FROM PayrollEntities.Department AS d; The Select VALUE... statement is used to retrieve singleton values. It is also used to retrieve values that don't have any column names. However, the Select ROW... statement is used to select one or more rows. As an example, if you want a value as a collection from an entity without the column name, you can use the VALUE keyword in the SELECT statement as shown here: SELECT VALUE emp.EmployeeName FROM PayrollEntities.Employee as emp The preceding statement will return the employee names from the Employee entity as a collection of strings. In T-SQL, you can have the ORDER BY clause at the end of the last query when using UNION ALL. SELECT EmployeeID, EmployeeName From Employee UNION ALL SELECT EmployeeID, Basic, Allowances FROM Salary ORDER BY EmployeeID On the contrary, you do not have the ORDER BY clause in the UNION ALL operator in E-SQL. Why E-SQL when I already have LINQ to Entities? LINQ to Entities is a new version of LINQ, well suited for Entity Framework. But why do you need E-SQL when you already have LINQ to Entities available to you? LINQ to Entities queries are verified at the time of compilation. Therefore, it is not at all suited for building and executing dynamic queries. On the contrary, E-SQL queries are verified at runtime, so they can be used for building and executing dynamic queries. You now have a new ADO.NET provider in E-SQL, which is a sophisticated query engine that can be used to query your data from the conceptual model. It should be noted, however, that both LINQ and E-SQL queries are converted into canonical command trees that are in turn translated into database-specific query statements based on the underlying database provider in use, as shown in the following diagram: We will now take a quick look at the features of E-SQL before we delve deep into this language. Features of E-SQL These are the features of E-SQL: Provider neutrality: E-SQL is independent of the underlying ADO.NET data provider in use because it works on top of the conceptual model. SQL like: The syntax of E-SQL statements resemble T-SQL. Expressive with support for entities and types: You can write your E-SQL queries in terms of EDM abstractions. Composable and orthogonal: You can use a subquery wherever you have support for an expression of that type. The subqueries are all treated uniformly regardless of where they have been used. In the sections that follow, we will take a look at the E-SQL language in depth. We will discuss the following points: Operators Expressions Identifiers Variables Parameters Canonical functions Operators in E-SQL An operator is one that operates on a particular operand to perform an operation. Operators in E-SQL can broadly be classified into the following categories: Arithmetic operators: These are used to perform arithmetic operations. Comparison operators: You can use these to compare the values of two operands. Logical operators: These are used to perform logical operations. Reference operators: These act as logical pointers to a particular entity belonging to a particular entity set. Type operators: These can operate on the type of an expression. Case operators: These operate on a set of Boolean expressions. Set operators: These operate on set operations. Arithmetic operators Here is an example of an arithmetic operator: SELECT VALUE s FROM PayrollEntities.Salary AS s where s.Basic = 5000 + 1000 The following arithmetic operators are available in E-SQL: + (add) - (subtract) / (divide) % (modulo) * (multiply) Comparison operators Here is an example of a comparison operator: SELECT VALUE e FROM PayrollEntities.Employee AS e where e.EmployeeID = 1 The following is a list of the comparison operators available in E-SQL: = (equals) != (not equal to) <> (not equal to) > (greater than) < (less than) >= (greater than or equal to) <= (less than or equal to) Logical operators Here is an example of using logical operators in E-SQL: SELECT VALUE s FROM PayrollEntities.Salary AS s where s.Basic > 5000 && s.Allowances > 3000 This is a list of the logical operators available in E-SQL: && (And) ! (Not) || (Or) Reference operators The following is an example of how you can use a reference operator in E-SQL: SELECT VALUE REF(e).FirstName FROM PayrollEntities.Employee as e The following is a list of the reference operators available in E-SQL: Key Ref CreateRef DeRef Type operators Here is an example of a type operator that returns a collection of employees from a collection of persons: SELECT VALUE e FROM OFTYPE(PayrollEntities.Person, PayrollEntities.Employee) AS e The following is a list of the type operators available in E-SQL: OfType Cast Is [Not] Of Treat Set operators This is an example of how you can use a set operator in E-SQL: (Select VALUE e from PayrollEntities.Employee as e where e.FirstName Like 'J%') Union All ( select VALUE s from PayrollEntities.Employee as s where s.DepartmentID = 1) Here is a list of the set operators available in E-SQL: Set Union Element AnyElement Except [Not] Exists [Not] In Overlaps Intersect Operator precedence When you have multiple operators operating in a sequence, the order in which the operators will be executed will be determined by the operator precedence. The following table shows the operator, operator type, and their precedence levels in E-SQL language: Operators Operator type Precedence level . , [] () Primary Level 1 ! not Unary Level 2 * / % Multiplicative Level 3 + and - Additive Level 4 < > <= >= Relational Level 5 = != <> Equality Level 6 && Conditional And Level 7 || Conditional Or Level 8 Expressions in E-SQL Expressions are the building blocks of the E-SQL language. Here are some examples of how expressions are represented: 1; //This represents one scalar item {2}; //This represents a collection of one element {3, 4, 5} //This represents a collection of multiple elements Query expressions in E-SQL Query expressions are used in conjunction with query operators to perform a certain operation and return a result set. Query expressions in E-SQL are actually a series of clauses that are represented using one or more of the following: SELECT: This clause is used to specify or limit the number of elements that are returned when a query is executed in E-SQL. FROM: This clause is used to specify the source or collection for retrieval of the elements in a query. WHERE: This clause is used to specify a particular expression. HAVING: This clause is used to specify a filter condition for retrieval of the result set. GROUP BY: This clause is used to group the elements returned by a query. ORDER BY: This clause is used to order the elements returned in either ascending or descending order. Here is the complete syntax of query expressions in E-SQL: SELECT VALUE [ ALL | DISTINCT ] FROM expression [ ,...n ] as C [ WHERE expression ] [ GROUP BY expression [ ,...n ] ] [ HAVING search_condition ] [ ORDER BY expression] And here is an example of a typical E-SQL query with all clause types being used: SELECT emp.FirstName FROM PayrollEntities.Employee emp, PayrollEntities.Department dept Group By dept.DepartmentName Where emp.DepartmentID = dept.DepartmentID Having emp.EmployeeID > 5 Identifiers, variables, parameters, and types in E-SQL Identifiers in E-SQL are of the following two types: Simple identifiers Quoted identifiers Simple identifiers are a sequence of alphanumeric or underscore characters. Note that an identifier should always begin with an alphabetical character. As an example, the following are valid identifiers: a12_ab M_09cd W0001m However, the following are invalid identifiers: 9abcd _xyz 0_pqr Quoted identifiers are those that are enclosed within square brackets ([]). Here are some examples of quoted identifiers: SELECT emp.EmployeeName AS [Employee Name] FROM Employee as emp SELECT dept.DepartmentName AS [Department Name] FROM Department as dept Quoted identifiers cannot contain a new line, tab, backspace, or carriage return characters. In E-SQL, a variable is a reference to a named expression. Note that the naming conventions for variables follow the same rules for an identifier. In other words, a valid variable reference to a named expression in E-SQL should be a valid identifier too. Here is an example: SELECT emp FROM Employee as emp; In the preceding example, emp is a variable reference. Types can be of three versions: Primitive types like integers and strings Nominal types such as entity types, entity sets, and relationships Transient types like rows, collections, and references The E-SQL language supports the following type categories: Rows Collections References Row A row, which is also known as a tuple, has no identity or behavior and cannot be inherited. The following statement returns one row that contains six elements: ROW (1, 'Joydip'); Collections Collections represent zero or more instances of other instances. You can use SET () to retrieve unique values from a collection of values. Here is an example: SET({1,1,2,2,3,3,4,4,5,5,6,6}) The preceding example will return the unique values from the set. Specifically, 2, 3, 4, 5, and 6. This is equivalent to the following statement: Select Value Distinct x from {1,1,2,2,3,3,4,4,5,5,6,6} As x; You can create collections using MULTISET () or even using {} as shown in the following examples: MULTISET (1, 2, 3, 4, 5, 6) The following represents the same as the preceding example: {1, 2, 3, 4, 5, 6} Here is how you can return a collection of 10 identical rows each with six elements in them: SELECT ROW(1,'Joydip') from {1,2,3,4,5,6,7,8,9,10} To return a collection of all rows from the employee set, you can use the following: Select emp from PayrollEntities.Employee as emp; Similarly, to select all rows from the department set, you use the following: Select dept from PayrollEntities.Department as dept; Reference A reference denotes a logical pointer or reference, to a particular entity. In essence, it is a foreign key to a specific entity set. Operators are used to perform operations on one or more operands. In E-SQL, the following operators are available to construct, deconstruct, and also navigate through references: KEY REF CREATEREF DEREF To create a reference to an instance of Employee, you can use REF() as shown here: SELECT REF (emp) FROM PayrollEntities.Employee as emp Once you have created a reference to an entity using REF(), you can also dereference the entity using DREF() as shown: DEREF (CREATEREF(PayrollEntities.Employee, ROW(@EmployeeID))) Summary In this article, we explored E-SQL and how it can be used with the Entity Client provider to perform CRUD operations in our applications. We discussed the differences between E-SQL and T-SQL and the differences between E-SQL and LINQ. We also discussed when one should choose E-SQL instead of LINQ to query data in applications. Resources for Article: Further resources on this subject: Hosting the service in IIS using the TCP protocol [article] Entity Framework Code-First: Accessing Database Views and Stored Procedures [article] Entity Framework DB First – Inheritance Relationships between Entities [article]
Read more
  • 0
  • 0
  • 13367

article-image-its-all-about-data
Packt
14 Sep 2016
12 min read
Save for later

It's All About Data

Packt
14 Sep 2016
12 min read
In this article by Samuli Thomasson, the author of the book, Haskell High Performance Programming, we will know how to choose and design optimal data structures in applications. You will be able to drop the level of abstraction in slow parts of code, all the way to mutable data structures if necessary. (For more resources related to this topic, see here.) Annotating strictness and unpacking datatype fields We used the BangPatterns extension to make function arguments strict: {-# LANGUAGE BangPatterns #-} f !s (x:xs) = f (s + 1) xs f !s _ = s Using bangs for annotating strictness in fact predates the BangPatterns extension (and the older compiler flag -fbang-patterns in GHC6.x). With just plain Haskell98, we are allowed to use bangs to make datatype fields strict: > data T = T !Int A bang in front of a field ensures that whenever the outer constructor (T) is in WHNF, the inner field is as well in WHNF. We can check this: > T undefined `seq` () *** Exception: Prelude.undefined There are no restrictions to which fields can be strict, be it recursive or polymorphic fields, although, it rarely makes sense to make recursive fields strict. Consider the fully strict linked list: data List a = List !a !(List a) | ListEnd With this much strictness, you cannot represent parts of infinite lists without always requiring infinite space. Moreover, before accessing the head of a finite strict list you must evaluate the list all the way to the last element. Strict lists don't have the streaming property of lazy lists. By default, all data constructor fields are pointers to other data constructors or primitives, regardless of their strictness. This applies to basic data types such asInt, Double, Char, and so on, which are not primitive in Haskell. They are data constructors over their primitive counterparts Int#, Double#, and Char#: > :info Int data Int = GHC.Types.I# GHC.Prim.Int# There is a performance overhead, the size of pointer dereference between types, say, Int and Int#, but an Int can represent lazy values (called thunks), whereas primitives cannot. Without thunks, we couldn't have lazy evaluation. Luckily,GHC is intelligent enough to unroll wrapper types as primitives in many situations, completely eliminating indirect references. The hash suffix is specific to GHC and always denotes a primitive type. The GHC modules do expose the primitive interface. Programming with primitives you can further micro-optimize code and get C-like performance. However, several limitations and drawbacks apply. Using anonymous tuples Tuples may seem harmless at first; they just lump a bunch of values together. But note that the fields in a tuple aren't strict, so a twotuple corresponds to the slowest PairP data type from our previous benchmark. If you need a strict Tuple type, you need to define one yourself. This is also one more reason to prefer custom types over nameless tuples in many situations. These two structurally similar tuple types have widely different performance semantics: data Tuple = Tuple {-# UNPACK #-} !Int {-# UNPACK #-} !Int data Tuple2 = Tuple2 {-# UNPACK #-} !(Int, Int) If you really want unboxed anonymous tuples, you can enable the UnboxedTuples extension and write things with types like (# Int#, Char# #). But note that a number of restrictions apply to unboxed tuples like to all primitives. The most important restriction is that unboxed types may not occur where polymorphic types or values are expected, because polymorphic values are always considered as pointers. Representing bit arrays One way to define a bitarray in Haskell that still retains the convenience of Bool is: import Data.Array.Unboxed type BitArray = UArrayInt Bool This representation packs 8 bits per byte, so it's space efficient. See the following section on arrays in general to learn about time efficiency – for now we only note that BitArray is an immutable data structure, like BitStruct, and that copying small BitStructs is cheaper than copying BitArrays due to overheads in UArray. Consider a program that processes a list of integers and tells whether they are even or odd counts of numbers divisible by 2, 3, and 5. We can implement this with simple recursion and a three-bit accumulator. Here are three alternative representations for the accumulator: type BitTuple = (Bool, Bool, Bool) data BitStruct = BitStruct !Bool !Bool !Bool deriving Show type BitArray = UArrayInt Bool And the program itself is defined along these lines: go :: acc -> [Int] ->acc go acc [] = acc go (two three five) (x:xs) = go ((test 2 x `xor` two) (test 3 x `xor` three) (test 5 x `xor` five)) xs test n x = x `mod` n == 0 I've omitted the details here. They can be found in the bitstore.hs file. The fastest variant is BitStruct, then comes BitTuple (30% slower), and BitArray is the slowest (130% slower than BitStruct). Although BitArray is the slowest (due to making a copy of the array on every iteration), it would be easy to scale the array in size or make it dynamic. Note also that this benchmark is really on the extreme side; normally programs do a bunch of other stuff besides updating an array in a tight loop. If you need fast array updates, you can resort to mutable arrays discussed later on. It might also be tempting to use Data.Vector.Unboxed.VectorBool from the vector package, due to its nice interface. But beware that that representation uses one byte for every bit, wasting 7 bits for every bit. Mutable references are slow Data.IORef and Data.STRef are the smallest bits of mutable state, references to mutable variables, one for IO and other for ST. There is also a Data.STRef.Lazy module, which provides a wrapper over strict STRef for lazy ST. However, because IORef and STRef are references, they imply a level of indirection. GHC intentionally does not optimize it away, as that would cause problems in concurrent settings. For this reason, IORef or STRef shouldn't be used like variables in C, for example. Performance will for sure be very bad. Let's verify the performance hit by considering the following ST-based sum-of-range implementation: -- file: sum_mutable.hs import Control.Monad.ST import Data.STRef count_st :: Int ->Int count_st n = runST $ do ref <- newSTRef 0 let go 0 = readSTRef ref go i = modifySTRef' ref (+ i) >> go (i - 1) go n And compare it to this pure recursive implementation: count_pure :: Int ->Int count_pure n = go n 0 where go 0 s = s go i s = go (i - 1) $! (s + i) The ST implementation is many times slower when at least -O is enabled. Without optimizations, the two functions are more or less equivalent in performance;there is similar amount of indirection from not unboxing arguments in the latter version. This is one example of the wonders that can be done to optimize referentially transparent code. Bubble sort with vectors Bubble sort is not an efficient sort algorithm, but because it's an in-place algorithm and simple, we will implement it as a demonstration of mutable vectors: -- file: bubblesort.hs import Control.Monad.ST import Data.Vector as V import Data.Vector.Mutable as MV import System.Random (randomIO) -- for testing The (naive) bubble sort compares values of all adjacent indices in order, and swaps the values if necessary. After reaching the last element, it starts from the beginning or, if no swaps were made, the list is sorted and the algorithm is done: bubblesortM :: (Ord a, PrimMonad m) =>MVector (PrimState m) a -> m () bubblesortM v = loop where indices = V.fromList [1 .. MV.length v - 1] loop = do swapped <- V.foldM' f False indices – (1) if swapped then loop else return () – (2) f swapped i = do – (3) a <- MV.read v (i-1) b <- MV.read v i if a > b then MV.swap v (i-1) i>> return True else return swapped At (1), we fold monadically over all but the last index, keeping state about whether or not we have performed a swap in this iteration. If we had, at (2) we rerun the fold or, if not, we can return. At (3) we compare an index and possibly swap values. We can write a pure function that wraps the stateful algorithm: bubblesort :: Ord a => Vector a -> Vector a bubblesort v = runST $ do mv <- V.thaw v bubblesortM mv V.freeze mv V.thaw and V.freeze (both O(n)) can be used to go back and forth with mutable and immutable vectors. Now, there are multiple code optimization opportunities in our implementation of bubble sort. But before tackling those, let's see how well our straightforward implementation fares using the following main: main = do v <- V.generateM 10000 $ _ ->randomIO :: IO Double let v_sorted = bubblesort v median = v_sorted ! 5000 print median We should remember to compile with -O2. On my machine, this program takes about 1.55s, and Runtime System reports 99.9% productivity, 18.7 megabytes allocated heap and 570 Kilobytes copied during GC. So now with a baseline, let's see if we can squeeze out more performance from vectors. This is a non-exhaustive list: Use unboxed vectors instead. This restricts the types of elements we can store, but it saves us a level of indirection. Down to 960ms and approximately halved GC traffic. Large lists are inefficient, and they don't compose with vectors stream fusion. We should change indices so that it uses V.enumFromTo instead (alternatively turn on OverloadedLists extension and drop V.fromList). Down to 360ms and 94% less GC traffic. Conversion functions V.thaw and V.freeze are O(n), that is, they modify copies. Using in-place V.unsafeThaw and V.unsafeFreeze instead is sometimes useful. V.unsafeFreeze in the bubblesort wrapper is completely safe, but V.unsafeThaw is not. In our example, however, with -O2, the program is optimized into a single loop and all those conversions get eliminated. Vector operations (V.read, V.swap) in bubblesortM are guaranteed to never be out of bounds, so it's perfectly safe to replace these with unsafe variants (V.unsafeRead, V.unsafeSwap) that don't check bounds. Speed-up of about 25 milliseconds, or 5%. To summarize, applying good practices and safe usage of unsafe functions, our Bubble sort just got 80% faster. These optimizations are applied in thebubblesort-optimized.hsfile (omitted here). We noticed that almost all GC traffic came from a linked list, which was constructed and immediately consumed. Lists are bad for performance in that they don't fuse like vectors. To ensure good vector performance, ensure that the fusion framework can work effectively. Anything that can be done with a vector should be done. As final note, when working with vectors(and other libraries) it's a good idea to keep the Haddock documentation handy. There are several big and small performance choices to be made. Often the difference is that of between O(n) and O(1). Speedup via continuation-passing style Implementing monads in continuation-passing style (CPS) can have very good results. Unfortunately, no widely-used or supported library I'm aware of would provide drop-in replacements for ubiquitous Maybe, List, Reader, Writer, and State monads. It's not that hard to implement the standard monads in CPS from scratch. For example, the State monad can be implemented using the Cont monad from mtl as follows: -- file: cont-state-writer.hs {-# LANGUAGE GeneralizedNewtypeDeriving #-} {-# LANGUAGE MultiParamTypeClasses #-} {-# LANGUAGE FlexibleInstances #-} {-# LANGUAGE FlexibleContexts #-} import Control.Monad.State.Strict import Control.Monad.Cont newtypeStateCPS s r a = StateCPS (Cont (s -> r) a) deriving (Functor, Applicative, Monad, MonadCont) instance MonadState s (StateCPS s r) where get = StateCPS $ cont $ next curState→ next curStatecurState put newState = StateCPS $ cont $ next curState→ next () newState runStateCPS :: StateCPS s s () -> s -> s runStateCPS (StateCPS m) = runCont m (_ -> id) In case you're not familiar with the continuation-passing style and the Cont monad, the details might not make much sense, instead of just returning results from a function, a function in CPS applies its results to a continuation. So in short, to "get" the state in continuation-passing style, we pass the current state to the "next" continuation (first argument) and don't change the state (second argument). To "put", we call the continuation with unit (no return value) and change the state to new state (second argument to next). StateCPS is used just like the State monad: action :: MonadStateInt m => m () action = replicateM_ 1000000 $ do i<- get put $! i + 1 main = do print $ (runStateCPS action 0 :: Int) print $ (snd $ runState action 0 :: Int) That action operation is, in the CPS version of the state monad, about 5% faster and performs 30% less heap allocation than the state monad from mtl. This program is limited pretty much only by the speed of monadic composition, so these numbers are at least very close to maximum speedup we can have from CPSing the state monad. Speedups of the writer monad are probably near these results. Other standard monads can be implemented similarly to StateCPS. The definitions can also be generalized to monad transformers over an arbitrary monad (a la ContT). For extra speed, you might wish to combine many monads in a single CPS monad, similarly to what RWST does. Summary We witnessed the performance of the bytestring, text, and vector libraries, all of which get their speed from fusion optimizations, in contrast to linked lists, which have a huge overhead despite also being subject to fusion to some degree. However, linked lists give rise to simple difference lists and zippers. The builder patterns for lists, bytestring, and textwere introduced. We discovered that the array package is low-level and clumsy compared to the superior vector package, unless you must support Haskell 98. We also saw how to implement Bubble sort using vectors and how to speedup via continuation-passing style. Resources for Article: Further resources on this subject: Data Tables and DataTables Plugin in jQuery 1.3 with PHP [article] Data Extracting, Transforming, and Loading [article] Linking Data to Shapes [article]
Read more
  • 0
  • 0
  • 13304
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-image-classification-and-feature-extraction-images
Packt
25 Oct 2013
3 min read
Save for later

Image classification and feature extraction from images

Packt
25 Oct 2013
3 min read
(For more resources related to this topic, see here.) Classifying images Automated Remote Sensing ( ARS ) is rarely ever done in the visible spectrum. The most commonly available wavelengths outside of the visible spectrum are infrared and near-infrared. The following scene is a thermal image (band 10) from a fairly recent Landsat 8 flyover of the US Gulf Coast from New Orleans, Louisiana to Mobile, Alabama. Major natural features in the image are labeled so you can orient yourself: Because every pixel in that image has a reflectance value, it is information. Python can "see" those values and pick out features the same way we intuitively do by grouping related pixel values. We can colorize pixels based on their relation to each other to simplify the image and view related features. This technique is called classification. Classifying can range from fairly simple groupings based only on some value distribution algorithm derived from the histogram to complex methods involving training data sets and even computer learning and artificial intelligence. The simplest forms are called unsupervised classifications, whereas methods involving some sort of training data to guide the computer are called supervised. It should be noted that classification techniques are used across many fields, from medical doctors trying to spot cancerous cells in a patient's body scan, to casinos using facial-recognition software on security videos to automatically spot known con-artists at blackjack tables. To introduce remote sensing classification we'll just use the histogram to group pixels with similar colors and intensities and see what we get. First you'll need to download the Landsat 8 scene here: http://geospatialpython.googlecode.com/files/thermal.zip Instead of our histogram() function from previous examples, we'll use the version included with NumPy that allows you to easily specify a number of bins and returns two arrays with the frequency as well as the ranges of the bin values. We'll use the second array with the ranges as our class definitions for the image. The lut or look-up table is an arbitrary color palette used to assign colors to classes. You can use any colors you want. import gdalnumeric # Input file name (thermal image) src = "thermal.tif" # Output file name tgt = "classified.jpg" # Load the image into numpy using gdal srcArr = gdalnumeric.LoadFile(src) # Split the histogram into 20 bins as our classes classes = gdalnumeric.numpy.histogram(srcArr, bins=20)[1] # Color look-up table (LUT) - must be len(classes)+1. # Specified as R,G,B tuples lut = [[255,0,0],[191,48,48],[166,0,0],[255,64,64], [255,115,115],[255,116,0],[191,113,48],[255,178,115], [0,153,153],[29,115,115],[0,99,99],[166,75,0], [0,204,0],[51,204,204],[255,150,64],[92,204,204],[38,153,38], [0,133,0],[57,230,57],[103,230,103],[184,138,0]] # Starting value for classification start = 1 # Set up the RGB color JPEG output image rgb = gdalnumeric.numpy.zeros((3, srcArr.shape[0], srcArr.shape[1],), gdalnumeric.numpy.float32) # Process all classes and assign colors for i in range(len(classes)): mask = gdalnumeric.numpy.logical_and(start <= srcArr, srcArr <= classes[i]) for j in range(len(lut[i])): rgb[j] = gdalnumeric.numpy.choose(mask, (rgb[j], lut[i][j])) start = classes[i]+1 # Save the image gdalnumeric.SaveArray(rgb.astype(gdalnumeric.numpy.uint8), tgt, format="JPEG") The following image is our classification output, which we just saved as a JPEG. We didn't specify the prototype argument when saving as an image, so it has no georeferencing information. This result isn't bad for a very simple unsupervised classification. The islands and coastal flats show up as different shades of green. The clouds were isolated as shades of orange and dark blues. We did have some confusion inland where the land features were colored the same as the Gulf of Mexico. We could further refine this process by defining the class ranges manually instead of just using the histogram.
Read more
  • 0
  • 0
  • 13101

article-image-aspnet-site-performance-improving-javascript-loading
Packt
28 Oct 2010
11 min read
Save for later

ASP.Net Site Performance: Improving JavaScript Loading

Packt
28 Oct 2010
11 min read
  ASP.NET Site Performance Secrets Simple and proven techniques to quickly speed up your ASP.NET website Speed up your ASP.NET website by identifying performance bottlenecks that hold back your site's performance and fixing them Tips and tricks for writing faster code and pinpointing those areas in the code that matter most, thus saving time and energy Drastically reduce page load times Configure and improve compression – the single most important way to improve your site's performance Written in a simple problem-solving manner – with a practical hands-on approach and just the right amount of theory you need to make sense of it all           Read more about this book       One approach to improving page performance is to shift functionality from the server to the browser. Instead of calculating a result or validating a form in C# on the server, you use JavaScript code on the browser. A drawback of this approach is that it involves physically moving code from the server to the browser. Because JavaScript is not compiled, it can be quite bulky. This can affect page load times, especially if you use large JavaScript libraries. You're effectively trading off increased page load times against faster response times after the page has loaded. In this article by Matt Perdeck, author of ASP.NET Site Performance Secret, you'll see how to reduce the impact on page load times by the need to load JavaScript files. It shows: How JavaScript files can block rendering of the page while they are being loaded and executed How to load JavaScript in parallel with other resources How to load JavaScript more quickly (For more resources on ASP.Net, see here.) Problem: JavaScript loading blocks page rendering JavaScript files are static files, just as images and CSS files. However, unlike images, when a JavaScript file is loaded or executed using a <script> tag, rendering of the page is suspended. This makes sense, because the page may contain script blocks after the <script> tag that are dependent on the JavaScript file. If loading of a JavaScript file didn't block page rendering, the other blocks could be executed before the file had loaded, leading to JavaScript errors. Confirming with a test site You can confirm that loading a JavaScript file blocks rendering of the page by running the website in the folder JavaScriptBlocksRendering in the downloaded code bundle. This site consists of a single page that loads a single script, script1.js. It also has a single image, chemistry.png, and a stylesheet style1.css. It uses an HTTP module that suspends the working thread for five seconds when a JavaScript file is loaded. Images and CSS files are delayed by about two seconds. When you load the page, you'll see that the page content appears after only about five seconds. Then after two seconds, the image appears, unless you use Firefox, which often loads images in parallel with the JavaScript. If you make a Waterfall chart, you can see how the image and stylesheet are loaded after the JavaScript file, instead of in parallel: To get the delays, run the test site on IIS 7 in integrated pipeline mode. Do not use the Cassini web server built into Visual Studio. If you find that there is no delay, clear the browser cache. If that doesn't work either, the files may be in kernel cache on the server—remove them by restarting IIS using Internet Information Services (IIS) Manager. To open IIS manager, click on Start | Control Panel, type "admin" in the search box, click on Administrative Tools, and then double-click on Internet Information Services (IIS) Manager. Integrated/Classic Pipeline Mode As in IIS 6, every website runs as part of an application pool in IIS 7. Each IIS 7 application pool can be switched between Integrated Pipeline Mode (the default) and Classic Pipeline Mode. In Integrated Pipeline Mode, the ASP.NET runtime is integrated with the core web server, so that the server can be managed for example, via web.config elements. In Classic Pipeline Mode, IIS 7 functions more like IIS 6, where ASP.NET runs within an ISAPI extension. Approaches to reduce the impact on load times Although it makes sense to suspend rendering the page while a <script> tag loads or executes JavaScript, it would still be good to minimize the time visitors have to wait for the page to appear, especially if there is a lot of JavaScript to load. Here are a few ways to do that: Start loading JavaScript after other components have started loading, such as images and CSS files. That way, the other components load in parallel with the JavaScript instead of after the JavaScript, and so are available sooner when page rendering resumes. Load JavaScript more quickly. Page rendering is still blocked, but for less time. Load JavaScript on demand. Only load the JavaScript upfront that you need to render the page. Load the JavaScript that handles button clicks, and so on, when you need it. Use specific techniques to prevent JavaScript loading from blocking rendering. This includes loading the JavaScript after the page has rendered, or in parallel with page rendering. These approaches can be combined or used on their own for the best tradeoff between development time and performance. Let's go through each approach. Approach: Start loading after other components This approach aims to render the page sooner by loading CSS stylesheets and images in parallel with the JavaScript rather than after the JavaScript. That way, when the JavaScript has finished loading, the CSS and images will have finished loading too and will be ready to use; or at least it will take less time for them to finish loading after the JavaScript has loaded. To load the CSS stylesheets and images in parallel with the JavaScript, you would start loading them before you start loading the JavaScript. In the case of CSS stylesheets that is easy—simply place their <link> tags before the <script> tags: <link rel="Stylesheet" type="text/css" href="css/style1.css" /><script type="text/javascript" src="js/script1.js"></script> Starting the loading of images is slightly trickier because images are normally loaded when the page body is evaluated, not as a part of the page head. In the test page you just saw with the image chemistry.png, you can use a bit of simple JavaScript to get the browser to start loading the image before it starts loading the JavaScript file. This is referred to as "image preloading" (page PreLoadWithJavaScript.aspx in the folder PreLoadImages in the downloaded code bundle): <script type="text/javascript"> var img1 = new Image(); img1.src = "images/chemistry.png";</script><link rel="Stylesheet" type="text/css" href="css/style1.css" /><script type="text/javascript" src="js/script1.js"></script> Run the page now and you'll get the following Waterfall chart: When the page is rendered after the JavaScript has loaded, the image and CSS files have already been loaded; so the image shows up right away. A second option is to use invisible image tags at the start of the page body that preload the images. You can make the image tags invisible by using the style display:none. You would have to move the <script> tags from the page head to the page body after the invisible image tags, as shown (page PreLoadWithCss.aspx in folder PreLoadImages in the downloaded code bundle): <body> <div style="display:none"> <img src="images/chemistry.png" /> </div> <script type="text/javascript" src="js/script1.js"></script> Although the examples we've seen so far preload only one image, chemistry.png, you could easily preload multiple images. When you do, it makes sense to preload the most important images first, so that they are most likely to appear right away when the page renders. The browser loads components, such as images, in the order in which they appear in the HTML, so you'd wind up with something similar to the following code: <script type="text/javascript"> var img1 = new Image(); img1.src = "images/important.png"; var img1 = new Image(); img2.src = "images/notsoimportant.png"; var img1 = new Image(); img3.src = "images/unimportant.png";</script> Approach: Loading JavaScript more quickly The second approach is to simply spend less time loading the same JavaScript, so that visitors spend less time waiting for the page to render. There are a number of ways to achieve just that: Techniques used with images, such as caching and parallel download Free Content Delivery Networks GZIP compression Minification Combining or breaking up JavaScript files Removing unused code Techniques used with images JavaScript files are static files, just like images and CSS files. This means that many techniques that apply to images apply to JavaScript files as well, including the use of cookie-free domains, caching, and boosting parallel loading. Free Content Delivery Networks Serving static files from a Content Delivery Network (CDN) can greatly reduce download times, by serving the files from a server that is close to the visitor. A CDN also saves you bandwidth because the files are no longer served from your own server. A number of companies now serve popular JavaScript libraries from their CDNs for free. Here are their details: Google AJAX Libraries API http://code.google.com/apis/ajaxlibs/ Serves a wide range of libraries including jQuery, jQuery UI, Prototype, Dojo, and Yahoo! User Interface Library (YUI) Microsoft Ajax Content Delivery Network http://www.asp.net/ajaxlibrary/cdn.ashx Serves libraries used by the ASP.NET and ASP.NET MVC frameworks including the jQuery library and the jQuery Validation plugin jQuery CDN http://docs.jquery.com/Downloading_jQuery Serves the jQuery library In ASP.NET 4.0 and later, you can get the ScriptManager control to load the ASP. NET AJAX script files from the Microsoft AJAX CDN instead of your web server, by setting the EnableCdn property to true: <asp:ScriptManager ID="ScriptManager1" EnableCdn="true" runat="server" /> One issue with loading libraries from a CDN is that it creates another point of failure—if the CDN goes down, your site is crippled. GZIP compression IIS has the ability to compress content sent to the browser, including JavaScript and CSS files. Compression can make a dramatic difference to a JavaScript file as it goes over the wire from the server to the browser. Take for example the production version of the jQuery library:   Uncompressed Compressed jQuery library 78 KB 26 KB   Compression for static files is enabled by default in IIS 7. This immediately benefits CSS files. It should also immediately benefit JavaScript files, but it doesn't because of a quirk in the default configuration of IIS 7. Not all static files benefit from compression; for example JPEG, PNG, and GIF files are already inherently compressed because of their format. To cater to this, the IIS 7 configuration file applicationHost.config contains a list of mime types that get compressed when static compression is enabled: <staticTypes> <add mimeType="text/*" enabled="true" /> <add mimeType="message/*" enabled="true" /> <add mimeType="application/javascript" enabled="true" /> <add mimeType="*/*" enabled="false" /></staticTypes> To allow IIS to figure out what mime type a particular file has, applicationHost.config also contains default mappings from file extensions to mime types, including this one: <staticContent lockAttributes="isDocFooterFileName"> ... <mimeMap fileExtension=".js" mimeType="application/x-javascript" /> ...</staticContent> If you look closely, you'll see that the .js extension is mapped by default to a mime type that isn't in the list of mime types to be compressed when static file compression is enabled. The easiest way to solve this is to modify your site's web.config, so that it maps the extension .js to mime type text/javascript. This matches text/* in the list of mime types to be compressed. So, IIS 7 will now compress JavaScript files with the extension .js (folder Minify in the downloaded code bundle): <system.webServer> <staticContent> <remove fileExtension=".js" /> <mimeMap fileExtension=".js" mimeType="text/javascript" /> </staticContent></system.webServer> Keep in mind that IIS 7 only applies static compression to files that are "frequently" requested. This means that the first time you request a file, it won't be compressed! Refresh the page a couple of times and compression will kick in.
Read more
  • 0
  • 0
  • 13093

article-image-getting-places
Packt
13 Oct 2015
8 min read
Save for later

Getting Places

Packt
13 Oct 2015
8 min read
In this article by Nafiul Islam, the author of Mastering Pycharm, we'll learn all about navigation. It is divided into three parts. The first part is called Omni, which deals with getting to anywhere from any place. The second is called Macro, which deals with navigating to places of significance. The third and final part is about moving within a file and it is called Micro. By the end of this article, you should be able to navigate freely and quickly within PyCharm, and use the right tool for the job to do so. Veteran PyCharm users may not find their favorite navigation tool mentioned or explained. This is because the methods of navigation described throughout this article will lead readers to discover their own tools that they prefer over others. (For more resources related to this topic, see here.) Omni In this section, we will discuss the tools that PyCharm provides for a user to go from anywhere to any place. You could be in your project directory one second, the next, you could be inside the Python standard library or a class in your file. These tools are generally slow or at least slower than more precise tools of navigation provided. Back and Forward The Back and Forward actions allow you to move your cursor back to the place where it was previously for more than a few seconds or where you've made edits. This information persists throughout sessions, so even if you exit the IDE, you can still get back to the positions that you were in before you quit. This falls into the Omni category because these two actions could potentially get you from any place within a file to any place within a file in your directory (that you have been to) to even parts of the standard library that you've looked into as well as your third-party Python packages. The Back and Forward actions are perhaps two of my most used navigation actions, and you can use Keymap. Or, one can simply click on the Navigate menu to see the keyboard shortcuts: Macro The difference between Macro and Omni is subtle. Omni allows you to go to the exact location of a place, even a place of no particular significance (say, the third line of a documentation string) in any file. Macro, on the other hand, allows you to navigate anywhere of significance, such as a function definition, class declaration, or particular class method. Go to definition or navigate to declaration Go to definition is the old name for Navigate to Declaration in PyCharm. This action, like the one previously discussed, could lead you anywhere—a class inside your project or a third party library function. What this action does is allow you to go to the source file declaration of a module, package, class, function, and so on. Keymap is once again useful in finding the shortcut for this particular action. Using this action will move your cursor to the file where the class or function is declared, may it be in your project or elsewhere. Just place your cursor on the function or class and invoke the action. Your cursor will now be directly where the function or class was declared. There is, however, a slight problem with this. If one tries to go to the declaration of a .so object, such as the datetime module or the select module, what one will encounter is a stub file (discussed in detail later). These are helper files that allow PyCharm to give you the code completion that it does. Modules that are .so files are indicated by a terminal icon, as shown here: Search Everywhere The action speaks for itself. You search for classes, files, methods, and even actions. Universally invoked using double Shift (pressing Shift twice in quick succession), this nifty action looks similar to any other search bar. Search Everywhere searches only inside your project, by default; however, one can also use it to search non-project items as well. Not using this option leads to faster search and a lower memory footprint. Search Everywhere is a gateway to other search actions available in PyCharm. In the preceding screenshot, one can see that Search Everywhere has separate parts, such as Recent Files and Classes. Each of these parts has a shortcut next to their section name. If you find yourself using Search Everywhere for Classes all the time, you might start using the Navigate Class action instead which is much faster. The Switcher tool The Switcher tool allows you to quickly navigate through your currently open tabs, recently opened files as well as all of your panels. This tool is essential since you always navigate between tabs. A star to the left indicates open tabs; everything else is a recently opened or edited file. If you just have one file open, Switcher will show more of your recently opened files. It's really handy this way since almost always the files that you want to go to are options in Switcher. The Project panel The Project panel is what I use to see the structure of my project as well as search for files that I can't find with Switcher. This panel is by far the most used panel of all, and for good reason. The Project panel also supports search; just open it up and start typing to find your file. However, the Project panel can give you even more of an understanding of what your code looks similar to if you have Show Members enabled. Once this is enabled, you can see the classes as well as the declared methods inside your files. Note that search works just like before, meaning that your search is limited to only the files/objects that you can see; if you collapse everything, you won't be able to search either your files or the classes and methods in them. Micro Micro deals with getting places within a file. These tools are perhaps what I end up using the most in my development. The Structure panel The Structure panel gives you a bird's eye view of the file that you are currently have your cursor on. This panel is indispensable when trying to understand a project that one is not familiar with. The yellow arrow indicates the option to show inherited fields and methods. The red arrow indicates the option to show field names, meaning if that it is turned off, you will only see properties and methods. The orange arrow indicates the option to scroll to and from the source. If both are turned on (scroll to and scroll from), where your cursor is will be synchronized with what method, field, or property is highlighted in the structure panel. Inherited fields are grayed out in the display. Ace Jump This is my favorite navigation plugin, and was made by John Lindquist who is a developer at JetBrains (creators of PyCharm). Ace Jump is inspired from the Emacs mode with the same name. It allows you to jump from one place to another within the same file. Before one can use Ace Jump, one has to install the plugin for it. Ace Jump is usually invoked using Ctrl or command + ; (semicolon). You can search for Ace Jump in Keymap as well, and is called Ace Jump. Once invoked, you get a small box in which you can input a letter. Choose a letter from the word that you want to navigate to, and you will see letters on that letter pop up immediately. If we were to hit D, the cursor would move to the position indicated by D. This might seem long winded, but it actually leads to really fast navigation. If we wanted to select the word indicated by the letter, then we'd invoke Ace Jump twice before entering a letter. This turns the Ace Jump box red. Upon hitting B, the named parameter rounding will be selected. Often, we don't want to go to a word, but rather the beginning or the end of a line. In order to do this, just hit invoke Ace Jump and then the left arrow for line beginnings or the right arrow for line endings. In this case, we'd just hit V to jump to the beginning of the line that starts with num_type. This is an example, where we hit left arrow instead of the right one, and we get line-ending options. Summary In this article, I discussed some of the best tools for navigation. This is by no means an exhaustive list. However, these tools will serve as a gateway to more precise tools available for navigation in PyCharm. I generally use Ace Jump, Back, Forward, and Switcher the most when I write code. The Project panel is always open for me, with the most used files having their classes and methods expanded for quick search. Resources for Article: Further resources on this subject: Enhancing Your Blog with Advanced Features [article] Adding a developer with Django forms [article] Deployment and Post Deployment [article]
Read more
  • 0
  • 0
  • 13086

article-image-exploring-compilers
Packt
23 Jun 2017
17 min read
Save for later

Exploring Compilers

Packt
23 Jun 2017
17 min read
In this article by Gabriele Lanaro, author of the book, Python High Performance - Second Edition, we will see that Python is a mature and widely used language and there is a large interest in improving its performance by compiling functions and methods directly to machine code rather than executing instructions in the interpreter. In this article, we will explore two projects--Numba and PyPy--that approach compilation in a slightly different way. Numba is a library designed to compile small functions on the fly. Instead of transforming Python code to C, Numba analyzes and compiles Python functions directly to machine code. PyPy is a replacement interpreter that works by analyzing the code at runtime and optimizing the slow loops automatically. (For more resources related to this topic, see here.) Numba Numba was started in 2012 by Travis Oliphant, the original author of NumPy, as a library for compiling individual Python functions at runtime using the Low-Level Virtual Machine  ( LLVM ) toolchain. LLVM is a set of tools designed to write compilers. LLVM is language agnostic and is used to write compilers for a wide range of languages (an important example is the clang compiler). One of the core aspects of LLVM is the intermediate representation (the LLVM IR), a very low-level platform-agnostic language similar to assembly, that can be compiled to machine code for the specific target platform. Numba works by inspecting Python functions and by compiling them, using LLVM, to the IR. As we have already seen in the last article, the speed gains can be obtained when we introduce types for variables and functions. Numba implements clever algorithms to guess the types (this is called type inference) and compiles type-aware versions of the functions for fast execution. Note that Numba was developed to improve the performance of numerical code. The development efforts often prioritize the optimization of applications that intensively use NumPy arrays. Numba is evolving really fast and can have substantial improvements between releases and, sometimes, backward incompatible changes.  To keep up, ensure that you refer to the release notes for each version. In the rest of this article, we will use Numba version 0.30.1; ensure that you install the correct version to avoid any error. The complete code examples in this article can be found in the Numba.ipynb notebook. First steps with Numba Getting started with Numba is fairly straightforward. As a first example, we will implement a function that calculates the sum of squares of an array. The function definition is as follows: def sum_sq(a): result = 0 N = len(a) for i in range(N): result += a[i] return result To set up this function with Numba, it is sufficient to apply the nb.jit decorator: from numba import nb @nb.jit def sum_sq(a): ... The nb.jit decorator won't do much when applied. However, when the function will be invoked for the first time, Numba will detect the type of the input argument, a , and compile a specialized, performant version of the original function. To measure the performance gain obtained by the Numba compiler, we can compare the timings of the original and the specialized functions. The original, undecorated function can be easily accessed through the py_func attribute. The timings for the two functions are as follows: import numpy as np x = np.random.rand(10000) # Original %timeit sum_sq.py_func(x) 100 loops, best of 3: 6.11 ms per loop # Numba %timeit sum_sq(x) 100000 loops, best of 3: 11.7 µs per loop You can see how the Numba version is order of magnitude faster than the Python version. We can also compare how this implementation stacks up against NumPy standard operators: %timeit (x**2).sum() 10000 loops, best of 3: 14.8 µs per loop In this case, the Numba compiled function is marginally faster than NumPy vectorized operations. The reason for the extra speed of the Numba version is likely that the NumPy version allocates an extra array before performing the sum in comparison with the in-place operations performed by our sum_sq function. As we didn't use array-specific methods in sum_sq, we can also try to apply the same function on a regular Python list of floating point numbers. Interestingly, Numba is able to obtain a substantial speed up even in this case, as compared to a list comprehension: x_list = x.tolist() %timeit sum_sq(x_list) 1000 loops, best of 3: 199 µs per loop %timeit sum([x**2 for x in x_list]) 1000 loops, best of 3: 1.28 ms per loop Considering that all we needed to do was apply a simple decorator to obtain an incredible speed up over different data types, it's no wonder that what Numba does looks like magic. In the following sections, we will dig deeper and understand how Numba works and evaluate the benefits and limitations of the Numba compiler. Type specializations As shown earlier, the nb.jit decorator works by compiling a specialized version of the function once it encounters a new argument type. To better understand how this works, we can inspect the decorated function in the sum_sq example. Numba exposes the specialized types using the signatures attribute. Right after the sum_sq definition, we can inspect the available specialization by accessing the sum_sq.signatures, as follows: sum_sq.signatures # Output: # [] If we call this function with a specific argument, for instance, an array of float64 numbers, we can see how Numba compiles a specialized version on the fly. If we also apply the function on an array of float32, we can see how a new entry is added to the sum_sq.signatures list: x = np.random.rand(1000).astype('float64') sum_sq(x) sum_sq.signatures # Result: # [(array(float64, 1d, C),)] x = np.random.rand(1000).astype('float32') sum_sq(x) sum_sq.signatures # Result: # [(array(float64, 1d, C),), (array(float32, 1d, C),)] It is possible to explicitly compile the function for certain types by passing a signature to the nb.jit function. An individual signature can be passed as a tuple that contains the type we would like to accept. Numba provides a great variety of types that can be found in the nb.types module, and they are also available in the top-level nb namespace. If we want to specify an array of a specific type, we can use the slicing operator, [:], on the type itself. In the following example, we demonstrate how to declare a function that takes an array of float64 as its only argument: @nb.jit((nb.float64[:],)) def sum_sq(a): Note that when we explicitly declare a signature, we are prevented from using other types, as demonstrated in the following example. If we try to pass an array, x, as float32, Numba will raise a TypeError: sum_sq(x.astype('float32')) # TypeError: No matching definition for argument type(s) array(float32, 1d, C) Another way to declare signatures is through type strings. For example, a function that takes a float64 as input and returns a float64 as output can be declared with the float64(float64) string. Array types can be declared using a [:] suffix. To put this together, we can declare a signature for our sum_sq function, as follows: @nb.jit("float64(float64[:])") def sum_sq(a): You can also pass multiple signatures by passing a list: @nb.jit(["float64(float64[:])", "float64(float32[:])"]) def sum_sq(a): Object mode versus native mode So far, we have shown how Numba behaves when handling a fairly simple function. In this case, Numba worked exceptionally well, and we obtained great performance on arrays and lists.The degree of optimization obtainable from Numba depends on how well Numba is able to infer the variable types and how well it can translate those standard Python operations to fast type-specific versions. If this happens, the interpreter is side-stepped and we can get performance gains similar to those of Cython. When Numba cannot infer variable types, it will still try and compile the code, reverting to the interpreter when the types can't be determined or when certain operations are unsupported. In Numba, this is called object mode and is in contrast to the intepreter-free scenario, called native mode. Numba provides a function, called inspect_types, that helps understand how effective the type inference was and which operations were optimized. As an example, we can take a look at the types inferred for our sum_sq function: sum_sq.inspect_types() When this function is called, Numba will print the type inferred for each specialized version of the function. The output consists of blocks that contain information about variables and types associated with them. For example, we can examine the N = len(a) line: # --- LINE 4 --- # a = arg(0, name=a) :: array(float64, 1d, A) # $0.1 = global(len: <built-in function len>) :: Function(<built-in function len>) # $0.3 = call $0.1(a) :: (array(float64, 1d, A),) -> int64 # N = $0.3 :: int64 N = len(a) For each line, Numba prints a thorough description of variables, functions, and intermediate results. In the preceding example, you can see (second line) that the argument a is correctly identified as an array of float64 numbers. At LINE 4, the input and return type of the len function is also correctly identified (and likely optimized) as taking an array of float64 numbers and returning an int64. If you scroll through the output, you can see how all the variables have a well-defined type. Therefore, we can be certain that Numba is able to compile the code quite efficiently. This form of compilation is called native mode. As a counter example, we can see what happens if we write a function with unsupported operations. For example, as of version 0.30.1, Numba has limited support for string operations. We can implement a function that concatenates a series of strings, and compiles it as follows: @nb.jit def concatenate(strings): result = '' for s in strings: result += s return result Now, we can invoke this function with a list of strings and inspect the types: concatenate(['hello', 'world']) concatenate.signatures # Output: [(reflected list(str),)] concatenate.inspect_types() Numba will return the output of the function for the reflected list (str) type. We can, for instance, examine how line 3 gets inferred. The output of concatenate.inspect_types() is reproduced here: # --- LINE 3 --- # strings = arg(0, name=strings) :: pyobject # $const0.1 = const(str, ) :: pyobject # result = $const0.1 :: pyobject # jump 6 # label 6 result = '' You can see how this time, each variable or function is of the generic pyobject type rather than a specific one. This means that, in this case, Numba is unable to compile this operation without the help of the Python interpreter. Most importantly, if we time the original and compiled function, we note that the compiled function is about three times slower than the pure Python counterpart: x = ['hello'] * 1000 %timeit concatenate.py_func(x) 10000 loops, best of 3: 111 µs per loop %timeit concatenate(x) 1000 loops, best of 3: 317 µs per loop This is because the Numba compiler is not able to optimize the code and adds some extra overhead to the function call.As you may have noted, Numba compiled the code without complaints even if it is inefficient. The main reason for this is that Numba can still compile other sections of the code in an efficient manner while falling back to the Python interpreter for other parts of the code. This compilation strategy is called object mode. It is possible to force the use of native mode by passing the nopython=True option to the nb.jit decorator. If, for example, we apply this decorator to our concatenate function, we observe that Numba throws an error on first invocation: @nb.jit(nopython=True) def concatenate(strings): result = '' for s in strings: result += s return result concatenate(x) # Exception: # TypingError: Failed at nopython (nopython frontend) This feature is quite useful for debugging and ensuring that all the code is fast and correctly typed. Numba and NumPy Numba was originally developed to easily increase performance of code that uses NumPy arrays. Currently, many NumPy features are implemented efficiently by the compiler. Universal functions with Numba Universal functions are special functions defined in NumPy that are able to operate on arrays of different sizes and shapes according to the broadcasting rules. One of the best features of Numba is the implementation of fast ufuncs. We have already seen some ufunc examples in article 3, Fast Array Operations with NumPy and Pandas. For instance, the np.log function is a ufunc because it can accept scalars and arrays of different sizes and shapes. Also, universal functions that take multiple arguments still work according to the  broadcasting rules. Examples of universal functions that take multiple arguments are np.sum or np.difference. Universal functions can be defined in standard NumPy by implementing the scalar version and using the np.vectorize function to enhance the function with the broadcasting feature. As an example, we will see how to write the Cantor pairing function. A pairing function is a function that encodes two natural numbers into a single natural number so that you can easily interconvert between the two representations. The Cantor pairing function can be written as follows: import numpy as np def cantor(a, b): return int(0.5 * (a + b)*(a + b + 1) + b) As already mentioned, it is possible to create a ufunc in pure Python using the np.vectorized decorator: @np.vectorize def cantor(a, b): return int(0.5 * (a + b)*(a + b + 1) + b) cantor(np.array([1, 2]), 2) # Result: # array([ 8, 12]) Except for the convenience, defining universal functions in pure Python is not very useful as it requires a lot of function calls affected by interpreter overhead. For this reason, ufunc implementation is usually done in C or Cython, but Numba beats all these methods by its convenience. All that is needed to do in order to perform the conversion is using the equivalent decorator, nb.vectorize. We can compare the speed of the standard np.vectorized version which, in the following code, is called cantor_py, and the same function is implemented using standard NumPy operations: # Pure Python %timeit cantor_py(x1, x2) 100 loops, best of 3: 6.06 ms per loop # Numba %timeit cantor(x1, x2) 100000 loops, best of 3: 15 µs per loop # NumPy %timeit (0.5 * (x1 + x2)*(x1 + x2 + 1) + x2).astype(int) 10000 loops, best of 3: 57.1 µs per loop You can see how the Numba version beats all the other options by a large margin! Numba works extremely well because the function is simple and type inference is possible. An additional advantage of universal functions is that, since they depend on individual values, their evaluation can also be executed in parallel. Numba provides an easy way to parallelize such functions by passing the target="cpu" or target="gpu" keyword argument to the nb.vectorize decorator. Generalized universal functions One of the main limitations of universal functions is that they must be defined on scalar values. A generalized universal function, abbreviated gufunc, is an extension of universal functions to procedures that take arrays. A classic example is the matrix multiplication. In NumPy, matrix multiplication can be applied using the np.matmul function, which takes two 2D arrays and returns another 2D array. An example usage of np.matmul is as follows: a = np.random.rand(3, 3) b = np.random.rand(3, 3) c = np.matmul(a, b) c.shape # Result: # (3, 3) As we saw in the previous subsection, a ufunc broadcasts the operation over arrays of scalars, its natural generalization will be to broadcast over an array of arrays. If, for instance, we take two arrays of 3 by 3 matrices, we will expect np.matmul to take to match the matrices and take their product. In the following example, we take two arrays containing 10 matrices of shape (3, 3). If we apply np.matmul, the product will be applied matrix-wise to obtain a new array containing the 10 results (which are, again, (3, 3) matrices): a = np.random.rand(10, 3, 3) b = np.random.rand(10, 3, 3) c = np.matmul(a, b) c.shape # Output # (10, 3, 3) The usual rules for broadcasting will work in a similar way. For example, if we have an array of (3, 3) matrices, which will have a shape of (10, 3, 3), we can use np.matmul to calculate the matrix multiplication of each element with a single (3, 3) matrix. According to the broadcasting rules, we obtain that the single matrix will be repeated to obtain a size of (10, 3, 3): a = np.random.rand(10, 3, 3) b = np.random.rand(3, 3) # Broadcasted to shape (10, 3, 3) c = np.matmul(a, b) c.shape # Result: # (10, 3, 3) Numba supports the implementation of efficient generalized universal functions through the nb.guvectorize decorator. As an example, we will implement a function that computes the euclidean distance between two arrays as a gufunc. To create a gufunc, we have to define a function that takes the input arrays, plus an output array where we will store the result of our calculation. The nb.guvectorize decorator requires two arguments: The types of the input and output: two 1D arrays as input and a scalar as output The so called layout string, which is a representation of the input and output sizes; in our case, we take two arrays of the same size (denoted arbitrarily by n), and we output a scalar In the following example, we show the implementation of the euclidean function using the nb.guvectorize decorator: @nb.guvectorize(['float64[:], float64[:], float64[:]'], '(n), (n) - > ()') def euclidean(a, b, out): N = a.shape[0] out[0] = 0.0 for i in range(N): out[0] += (a[i] - b[i])**2 There are a few very important points to be made. Predictably, we declared the types of the inputs a and b as float64[:], because they are 1D arrays. However, what about the output argument? Wasn't it supposed to be a scalar? Yes, however, Numba treats scalar argument as arrays of size 1. That's why it was declared as float64[:]. Similarly, the layout string indicates that we have two arrays of size (n) and the output is a scalar, denoted by empty brackets--(). However, the array out will be passed as an array of size 1. Also, note that we don't return anything from the function; all the output has to be written in the out array. The letter n in the layout string is completely arbitrary; you may choose to use k  or other letters of your liking. Also, if you want to combine arrays of uneven sizes, you can use layouts strings, such as (n, m). Our brand new euclidean function can be conveniently used on arrays of different shapes, as shown in the following example: a = np.random.rand(2) b = np.random.rand(2) c = euclidean(a, b) # Shape: (1,) a = np.random.rand(10, 2) b = np.random.rand(10, 2) c = euclidean(a, b) # Shape: (10,) a = np.random.rand(10, 2) b = np.random.rand(2) c = euclidean(a, b) # Shape: (10,) How does the speed of euclidean compare to standard NumPy? In the following code, we benchmark a NumPy vectorized version with our previously defined euclidean function: a = np.random.rand(10000, 2) b = np.random.rand(10000, 2) %timeit ((a - b)**2).sum(axis=1) 1000 loops, best of 3: 288 µs per loop %timeit euclidean(a, b) 10000 loops, best of 3: 35.6 µs per loop The Numba version, again, beats the NumPy version by a large margin! Summary Numba is a tool that compiles fast, specialized versions of Python functions at runtime. In this article, we learned how to compile, inspect, and analyze functions compiled by Numba. We also learned how to implement fast NumPy universal functions that are useful in a wide array of numerical applications.  Tools such as PyPy allow us to run Python programs unchanged to obtain significant speed improvements. We demonstrated how to set up PyPy, and we assessed the performance improvements on our particle simulator application. Resources for Article: Further resources on this subject: Getting Started with Python Packages [article] Python for Driving Hardware [article] Python Data Science Up and Running [article]
Read more
  • 0
  • 0
  • 12993
article-image-overview-tomcat-6-servlet-container-part-1
Packt
18 Jan 2010
11 min read
Save for later

An Overview of Tomcat 6 Servlet Container: Part 1

Packt
18 Jan 2010
11 min read
In practice, it is highly unlikely that you will interface an EJB container from WebSphere and a JMS implementation from WebLogic, with the Tomcat servlet container from the Apache foundation, but it is at least theoretically possible. Note that the term 'interface', as it is used here, also encompasses abstract classes. The specification's API might provide a template implementation whose operations are defined in terms of some basic set of primitives that are kept abstract for the service provider to implement. A service provider is required to make available concrete implementations of these interfaces and abstract classes. For example, the HttpSession interface is implemented by Tomcat in the form of org.apache.catalina.session.StandardSession. Let's examine the image of the Tomcat container: The objective of this article is to cover the primary request processing components that are present in this image. Advanced topics, such as clustering and security, are shown as shaded in this image and are not covered. In this image, the '+' symbol after the Service, Host, Context, and Wrapper instances indicate that there can be one or more of these elements. For instance, a Service may have a single Engine, but an Engine can contain one or more Hosts. In addition, the whirling circle represents a pool of request processor threads. Here, we will fly over the architecture of Tomcat from a 10,000-foot perspective taking in the sights as we go. Component taxonomy Tomcat's architecture follows the construction of a Matrushka doll from Russia. In other words, it is all about containment where one entity contains another, and that entity in turn contains yet another. In Tomcat, a 'container' is a generic term that refers to any component that can contain another, such as a Server, Service, Engine, Host, or Context. Of these, the Server and Service components are special containers, designated as Top Level Elements as they represent aspects of the running Tomcat instance. All the other Tomcat components are subordinate to these top level elements. The Engine, Host, and Context components are officially termed Containers, and refer to components that process incoming requests and generate an appropriate outgoing response. Nested Components can be thought of as sub-elements that can be nested inside either Top Level Elements or other Containers to configure how they function. Examples of nested components include the Valve, which represents a reusable unit of work; the Pipeline, which represents a chain of Valves strung together; and a Realm which helps set up container-managed security for a particular container. Other nested components include the Loader which is used to enforce the specification's guidelines for servlet class loading; the Manager that supports session management for each web application; the Resources component that represents the web application's static resources and a mechanism to access these resources; and the Listener that allows you to insert custom processing at important points in a container's life cycle, such as when a component is being started or stopped. Not all nested components can be nested within every container. A final major component, which falls into its own category, is the Connector. It represents the connection end point that an external client (such as a web browser) can use to connect to the Tomcat container. Before we go on to examine these components, let's take a quick look at how they are organized structurally. Note that this diagram only shows the key properties of each container. When Tomcat is started, the Java Virtual Machine (JVM) instance in which it runs will contain a singleton Server top level element, which represents the entire Tomcat server. A Server will usually contain just one Service object, which is a structural element that combines one or more Connectors (for example, an HTTP and an HTTPS connector) that funnel incoming requests through to a single Catalina servlet Engine. The Engine represents the core request processing code within Tomcat and supports the definition of multiple Virtual Hosts within it. A virtual host allows a single running Tomcat engine to make it seem to the outside world that there are multiple separate domains (for example, www.my-site.com and www.your-site.com) being hosted on a single machine. Each virtual host can, in turn, support multiple web applications known as Contexts that are deployed to it. A context is represented using the web application format specified by the servlet specification, either as a single compressed WAR (Web Application Archive) file or as an uncompressed directory. In addition, a context is configured using a web.xml file, as defined by the servlet specification. A context can, in turn, contain multiple servlets that are deployed into it, each of which is wrapped in a Wrapper component. The Server, Service, Connector, Engine, Host, and Context elements that will be present in a particular running Tomcat instance are configured using the server.xml configuration file. Architectural benefits This architecture has a couple of useful features. It not only makes it easy to manage component life cycles (each component manages the life cycle notifications for its children), but also to dynamically assemble a running Tomcat server instance that is based on the information that has been read from configuration files at startup. In particular, the server.xml file is parsed at startup, and its contents are used to instantiate and configure the defined elements, which are then assembled into a running Tomcat instance. The server.xml file is read only once, and edits to it will not be picked up until Tomcat is restarted. This architecture also eases the configuration burden by allowing child containers to inherit the configuration of their parent containers. For instance, a Realm defines a data store that can be used for authentication and authorization of users who are attempting to access protected resources within a web application. For ease of configuration, a realm that is defined for an engine applies to all its children hosts and contexts. At the same time, a particular child, such as a given context, may override its inherited realm by specifying its own realm to be used in place of its parent's realm. Top Level Components The Server and Service container components exist largely as structural conveniences. A Server represents the running instance of Tomcat and contains one or more Service children, each of which represents a collection of request processing components. Server A Server represents the entire Tomcat instance and is a singleton within a Java Virtual Machine, and is responsible for managing the life cycle of its contained services. The following image depicts the key aspects of the Server component. As shown, a Server instance is configured using the server.xml configuration file. The root element of this file is <Server> and represents the Tomcat instance. Its default implementation is provided using org.apache.catalina.core.StandardServer, but you can specify your own custom implementation through the className attribute of the <Server> element. A key aspect of the Server is that it opens a server socket on port 8005 (the default) to listen a shutdown command (by default, this command is the text string SHUTDOWN). When this shutdown command is received, the server gracefully shuts itself down. For security reasons, the connection requesting the shutdown must be initiated from the same machine that is running this instance of Tomcat. A Server also provides an implementation of the Java Naming and Directory Interface (JNDI) service, allowing you to register arbitrary objects (such as data sources) or environment variables, by name. At runtime, individual components (such as servlets) can retrieve this information by looking up the desired object name in the server's JNDI bindings. While a JNDI implementation is not integral to the functioning of a servlet container, it is part of the Java EE specification and is a service that servlets have a right to expect from their application servers or servlet containers. Implementing this service makes for easy portability of web applications across containers. While there is always just one server instance within a JVM, it is entirely possible to have multiple server instances running on a single physical machine, each encased in its own JVM. Doing so insulates web applications that are running on one VM from errors in applications that are running on others, and simplifies maintenance by allowing a JVM to be restarted independently of the others. This is one of the mechanisms used in a shared hosting environment (the other is virtual hosting, which we will see shortly) where you need isolation from other web applications that are running on the same physical server. Service While the Server represents the Tomcat instance itself, a Service represents the set of request processing components within Tomcat. A Server can contain more than one Service, where each service associates a group of Connector components with a single Engine. Requests from clients are received on a connector, which in turn funnels them through into the engine, which is the key request processing component within Tomcat. The image shows connectors for HTTP, HTTPS, and the Apache JServ Protocol (AJP). There is very little reason to modify this element, and the default Service instance is usually sufficient. A hint as to when you might need more than one Service instance can be found in the above image. As shown, a service aggregates connectors, each of which monitors a given IP address and port, and responds in a given protocol. An example use case for having multiple services, therefore, is when you want to partition your services (and their contained engines, hosts, and web applications) by IP address and/or port number. For instance, you might configure your firewall to expose the connectors for one service to an external audience, while restricting your other service to hosting intranet applications that are visible only to internal users. This would ensure that an external user could never access your Intranet application, as that access would be blocked by the firewall. The Service, therefore, is nothing more than a grouping construct. It does not currently add any other value to the proceedings. Connectors A Connector is a service endpoint on which a client connects to the Tomcat container. It serves to insulate the engine from the various communication protocols that are used by clients, such as HTTP, HTTPS, or the Apache JServ Protocol (AJP). Tomcat can be configured to work in two modes—Standalone or in Conjunction with a separate web server. In standalone mode, Tomcat is configured with HTTP and HTTPS connectors, which make it act like a full-fledged web server by serving up static content when requested, as well as by delegating to the Catalina engine for dynamic content. Out of the box, Tomcat provides three possible implementations of the HTTP/1.1 and HTTPS connectors for this mode of operation. The most common are the standard connectors, known as Coyote which are implemented using standard Java I/O mechanisms. You may also make use of a couple of newer implementations, one which uses the non-blocking NIO features of Java 1.4, and the other which takes advantage of native code that is optimized for a particular operating system through the Apache Portable Runtime (APR). Note that both the Connector and the Engine run in the same JVM. In fact, they run within the same Server instance. In conjunction mode, Tomcat plays a supporting role to a web server, such as Apache httpd or Microsoft's IIS. The client here is the web server, communicating with Tomcat either through an Apache module or an ISAPI DLL. When this module determines that a request must be routed to Tomcat for processing, it will communicate this request to Tomcat using AJP, a binary protocol that is designed to be more efficient than the text based HTTP when communicating between a web server and Tomcat. On the Tomcat side, an AJP connector accepts this communication and translates it into a form that the Catalina engine can process. In this mode, Tomcat is running in its own JVM as a separate process from the web server. In either mode, the primary attributes of a Connector are the IP address and port on which it will listen for incoming requests, and the protocol that it supports. Another key attribute is the maximum number of request processing threads that can be created to concurrently handle incoming requests. Once all these threads are busy, any incoming request will be ignored until a thread becomes available. By default, a connector listens on all the IP addresses for the given physical machine (its address attribute defaults to 0.0.0.0). However, a connector can be configured to listen on just one of the IP addresses for a machine. This will constrain it to accept connections from only that specified IP address. Any request that is received by any one of a service's connectors is passed on to the service's single engine. This engine, known as Catalina, is responsible for the processing of the request, and the generation of the response. The engine returns the response to the connector, which then transmits it back to the client using the appropriate communication protocol.
Read more
  • 0
  • 0
  • 12989

article-image-functional-testing-jmeter
Packt
22 Oct 2009
5 min read
Save for later

Functional Testing with JMeter

Packt
22 Oct 2009
5 min read
JMeter is a 100% pure Java desktop application. JMeter is found to be very useful and convenient in support of functional testing. Although JMeter is known more as a performance testing tool, functional testing elements can be integrated within the Test Plan, which was originally designed to support load testing. Many other load-testing tools provide little or none of this feature, restricting themselves to performance-testing purposes. Besides integrating functional-testing elements along with load-testing elements in the Test Plan, you can also create a Test Plan that runs these exclusively. In other words, aside from creating a Load Test Plan, JMeter also allows you to create a Functional Test Plan. This flexibility is certainly resource-efficient for the testing project. In this article by Emily H. Halili, we will give you a walkthrough on how to create a Test Plan as we incorporate and/or configure JMeter elements to support functional testing. Preparing for Functional Testing JMeter does not have a built-in browser, unlike many functional-test tools. It tests on the protocol layer, not the client layer (i.e. JavaScripts, applets, and many more.) and it does not render the page for viewing. Although, by default that embedded resources can be downloaded, rendering these in the Listener | View Results Tree may not yield a 100% browser-like rendering. In fact, it may not be able to render large HTML files at all. This makes it difficult to test the GUI of an application under testing. However, to compensate for these shortcomings, JMeter allows the tester to create assertions based on the tags and text of the page as the HTML file is received by the client. With some knowledge of HTML tags, you can test and verify any elements as you would expect them in the browser. It is unnecessary to select a specific workload time to perform a functional test. In fact, the application you want to test may even reside locally, with your own machine acting as the "localhost" server for your web application. For this article, we will limit ourselves to selected functional aspects of the page that we seek to verify or assert. Using JMeter Components We will create a Test Plan in order to demonstrate how we can configure the Test Plan to include functional testing capabilities. The modified Test Plan will include these scenarios: 1. Create Account —New Visitor creating an Account 2. Login User —User logging in to an Account Following these scenarios, we will simulate various entries and form submission as a request to a page is made, while checking the correct page response to these user entries. We will add assertions to the samples following these scenarios to verify the 'correctness' of a requested page. In this manner, we can see if the pages responded correctly to invalid data. For example, we would like to check that the page responded with the correct warning message when a user enters an invalid password, or whether a request returns the correct page. First of all, we will create a series of test cases following the various user actions in each scenario. The test cases may be designed as follows: CREATE ACCOUNT Test Steps Data Expected 1 Go to Home page. www.packtpub.com Home page loads and renders with no page error 2 Click Your Account link (top right). User action 1. Your Account page loads and renders with no page error.2. Logout link is not found. 3 No Password: - Enter email address in Email text field.- Click the Create Account and Continue button. email=EMAIL 1. Your Account page resets with Warning message-Please enter password.2. Logout link not found. 4 Short Password: - Enter email address in Email text field.- Enter password in Password text field.- Enter password in Confirm Password text field. - Click Create Account and Continue button. email=EMAILpassword=SHORT_PWD confirm password=SHORT_PWD 1. Your Account page resets with Warning message-Your password must be 8 characters or longer.2. Logout link is not found. 5 Unconfirmed Password: - Enter email address in Email text field.- Enter password in Password text field.- Enter password in Confirm Password text field. - Click Create Account and Continue button. email=EMAILpassword=VALID_PWDconfirm password=INVALID_PWD 1. Your Account page resets with Warning messagePassword does not match.2. Logout link is not found. 6 Register Valid User: - Enter email address in Email text field.- Enter password in Password text field.- Enter password in Confirm Password text field. - Click Create Account and Continue button. email=EMAILpassword=VALID_PWDconfirm password=VALID_PWD 1. Logout link is found.2. Page redirects to User Account page.3. Message found: You are registered as: e:<EMAIL>. 7 Click Logout link. User action 1. Logout link is NOT found.     LOGIN USER Test Steps Data Expected 1 Click Home page. User action 1. WELCOME tab is active. 2 Log in Wrong Password: - Enter email in Email text field- Enter password at Password text field.- Click Login button. email=EMAILpassword=INVALID_PWD 1. Logout link is NOT found.2. Page refreshes.3. Warning message-Sorry your password was incorrect appears. 3 Log in Non-Exist Account:- Enter email in Email text field.- Enter password in Password text field.- Click Login button. email=INVALID_EMAILpassword=INVALID_PWD 1. Logout link is NOT found.2. Page refreshes.3. Warning message-Sorry, this does not match any existing accounts. Please check your details and try again or open a new account below appears. 4 Log in Valid Account:- Enter email in Email text field.- Enter password in Password text field.- Click Login-button. email=EMAILpassword=VALID_PWD 1. Logout link is found.2. Page reloads.3. Login successful message-You are logged in as: appears. 5 Click Logout link. User action 1. Logout link is NOT found.    
Read more
  • 0
  • 3
  • 12881

article-image-qgis-feature-selection-tools
Packt
05 Dec 2014
4 min read
Save for later

QGIS Feature Selection Tools

Packt
05 Dec 2014
4 min read
 In this article by Anita Graser, the author of Learning QGIS Third Edition, we will cover the following topics: Selecting features with the mouse Selecting features using expressions Selecting features using Spatial queries (For more resources related to this topic, see here.) Selecting features with the mouse The first group of tools in the Attributes toolbar allows us to select features on the map using the mouse. The following screenshot shows the Select Feature(s) tool. We can select a single feature by clicking on it or select multiple features by drawing a rectangle. The other tools can be used to select features by drawing different shapes: polygons, freehand areas, or circles around the features. All features that intersect with the drawn shape are selected. Holding down the Ctrl key will add the new selection to an existing one. Similarly, holding down Ctrl + Shift will remove the new selection from the existing selection. Selecting features by expression The second type of select tool is called Select by Expression, and it is also available in the Attribute toolbar. It selects features based on expressions that can contain references and functions using feature attributes and/or geometry. The list of available functions is pretty long, but we can use the search box to filter the list by name to find the function we are looking for faster. On the right-hand side of the window, we will find Selected Function Help, which explains the functionality and how to use the function in an expression. The Function List option also shows the layer attribute fields, and by clicking on Load all unique values or Load 10 sample values, we can easily access their content. As with the mouse tools, we can choose between creating a new selection or adding to or deleting from an existing selection. Additionally, we can choose to only select features from within an existing selection. Let's have a look at some example expressions that you can build on and use in your own work: Using the lakes.shp file in our sample data, we can, for example, select big lakes with an area bigger than 1,000 square miles using a simple attribute query, "AREA_MI" > 1000.0, or using geometry functions such as $area > (1000.0 * 27878400). Note that the lakes.shp CRS uses feet, and we, therefore, have to multiply by 27,878,400 to convert from square feet to square miles. The dialog will look like the one shown in the following screenshot. We can also work with string functions, for example, to find lakes with long names, such as length("NAMES") > 12, or lakes with names that contain the s or S character, such as lower("NAMES") LIKE '%s%', which first converts the names to lowercase and then looks for any appearance of s. Selecting features using spatial queries The third type of tool is called Spatial Query and allows us to select features in one layer based on their location, relative to the features in a second layer. These tools can be accessed by going to Vector | Research Tools | Select by location and then going to Vector | Spatial Query | Spatial Query. Enable it in Plugin Manager if you cannot find it in the Vector menu. In general, we want to use the Spatial Query plugin, as it supports a variety of spatial operations such as crosses, equals, intersects, is disjoint, overlaps, touches, and contains, depending on the layer's geometry type. Let's test the Spatial Query plugin using railroads.shp and pipelines.shp from the sample data. For example, we might want to find all the railroad features that cross a pipeline; we will, therefore, select the railroads layer, the Crosses operation, and the pipelines layer. After clicking on Apply, the plugin presents us with the query results. There is a list of IDs of the result features on the right-hand side of the window, as you can see in the following screenshot. Below this list, we can select the Zoom to item checkbox, and QGIS will zoom to the feature that belongs to the selected ID. Additionally, the plugin offers buttons to directly save all the resulting features to a new layer. Summary This article introduced you to three solutions to select features in QGIS: selecting features with mouse, using spatial queries, and using expressions. Resources for Article: Further resources on this subject: Editing attributes [article] Server Logs [article] Improving proximity filtering with KNN [article]
Read more
  • 0
  • 0
  • 12869
article-image-running-your-applications-aws
Cheryl Adams
17 Aug 2016
4 min read
Save for later

Running Your Applications with AWS

Cheryl Adams
17 Aug 2016
4 min read
If you’ve ever been told not to run with scissors, you should not have the same concern when running with AWS. It is neither dangerous nor unsafe when you know what you are doing and where to look when you don’t. Amazon’s current service offering, AWS (Amazon Web Services), is a collection of services, applications and tools that can be used to deploy your infrastructure and application environment to the cloud.  Amazon gives you the option to start their service offerings with a ‘free tier’ and then move toward a pay as you go model.  We will highlight a few of the features when you open your account with AWS. One of the first things you will notice is that Amazon offers a bulk of information regarding cloud computing right up front. Whether you are a novice, amateur or an expert in cloud computing, Amazon offers documented information before you create your account.  This type of information is essential if you are exploring this tool for a project or doing some self-study on your own. If you are a pre-existing Amazon customer, you can use your same account to get started with AWS. If you want to keep your personal account separate from your development or business, it would be best to create a separate account. Amazon Web Services Landing Page The Free Tier is one of the most attractive features of AWS. As a new account you are entitled to twelve months within the Free Tier. In addition to this span of time, there are services that can continue after the free tier is over. This gives the user ample time to explore the offerings within this free-tier period. The caution is not to exceed the free service limitations as it will incur charges. Setting up the free-tier still requires a credit card. Fee-based services will be offered throughout the free tier, so it is important not to select a fee-based charge unless you are ready to start paying for it. Actual paid use will vary based on what you have selected.   AWS Service and Offerings (shown on an open account)     AWS overview of services available on the landing page Amazon’s service list is very robust. If you are already considering AWS, hopefully this means you are aware of what you need or at least what you would like to use. If not, this would be a good time to press pause and look at some resource-based materials. Before the clock starts ticking on your free-tier, I would recommend a slow walk through the introductory information on this site to ensure that you are selecting the right mix of services before creating your account. Amazon’s technical resources has a 10-minute tutorial that gives you a complete overview of the services. Topics like ‘AWS Training and Introduction’ and ‘Get Started with AWS’ include a list of 10-minute videos as well as a short list of ‘how to’ instructions for some of the more commonly used features. If you are a techie by trade or hobby, this may be something you want to dive into immediately.In a company, generally there is a predefined need or issue that the organization may feel can be resolved by the cloud.  If it is a team initiative, it would be good to review the resources mentioned in this article so that everyone is on the same page as to what this solution can do.It’s recommended before you start any trial, subscription or new service that you have a set goal or expectation of why you are doing it. Simply stated, a cloud solution is not the perfect solution for everyone.  There is so much information here on the AWS site. It’s also great if you are comparing between competing cloud service vendors in the same space. You will be able to do a complete assessment of most services within the free-tier. You can map use case scenarios to determine if AWS is the right fit for your project. AWS First Project is a great place to get started if you are new to AWS. If you are wondering how to get started, these technical resources will set you in the right direction. By reviewing this information during your setup or before you start, you will be able to make good use out of your first few months and your introduction to AWS. About the author Cheryl Adams is a senior cloud data and infrastructure architect in the healthcare data realm. She is also the co-author of Professional Hadoop by Wrox.
Read more
  • 0
  • 0
  • 12861

article-image-cassandra-design-patterns
Packt
22 Sep 2015
18 min read
Save for later

Cassandra Design Patterns

Packt
22 Sep 2015
18 min read
In this article by Rajanarayanan Thottuvaikkatumana, author of the book Cassandra Design Patterns, Second Edition, the author has discussed how Apache Cassandra is one of the most popular NoSQL data stores. He states this based on the research paper Dynamo: Amazon’s Highly Available Key-Value Store and the research paper Bigtable: A Distributed Storage System for Structured Data. Cassandra is implemented with best features from both of these research papers. In general, NoSQL data stores can be classified into the following groups: Key-value data store Column family data store Document data store Graph data store Cassandra belongs to the column family data store group. Cassandra’s peer-to-peer architecture avoids single point failures in the cluster of Cassandra nodes and gives the ability to distribute the nodes across racks or data centres. This makes Cassandra a linearly scalable data store. In other words, the more processing you need, the more Cassandra nodes you can add to your cluster. Cassandra’s multi data centre support makes it a perfect choice to replicate the data stores across data centres for disaster recovery, high availability, separating transaction processing, analytical environments, and for building resiliency into the data store infrastructure.   Design patterns in Cassandra The term “design patterns” is a highly misinterpreted term in the software development community. In an extremely general sense, it is a set of solutions for some known problems in quite a specific context. It is used in this book to describe a pattern of using certain features of Cassandra to solve some real-world problems. This book is a collection of such design patterns with real-world examples. Coexistence patterns Cassandra is one of the highly successful NoSQL data stores, which is greatly similar to the traditional RDBMS. Cassandra column families (also known as Cassandra tables), in a logical perspective, have a similarity with RDBMS-based tables in the view of the users, even though the underlying structure of these tables are totally different. Because of this, Cassandra is best fit to be deployed along with the traditional RDBMS to solve some of the problems that RDBMS is not able to handle. The caveat here is that because of the similarity of RDBMS tables and Cassandra column families in the view of the end users, many users and data modelers try to use Cassandra in the exact the same way as the RDBMS schema is being modeled, used, and getting into serious deployment issues. How do you prevent such pitfalls? The key here is to understand the differences in a theoretical perspective as well as in a practical perspective, and follow best practices prescribed by the creators of Cassandra. Where do you start with Cassandra? The best place to look at is the new application development requirements and take it from there. Look at the cases where there is a need to normalize the RDBMS tables and keep all the data items together, which would have got distributed if you were to design the same solution in RDBMS. Instead of thinking from the pure data model perspective, start thinking in terms of the application's perspective. How the data is generated by the application, what are the read requirements, what are the write requirements, what is the response time expected out of some of the use cases, and so on. Depending on these aspects, design the data model. In the big data world, the application becomes the first class citizen and the data model leaves the driving seat in the application design. Design the data model to serve the needs of the applications. In any organization, new reporting requirements come all the time. The major challenge in to generate reports is the underlying data store. In the RDBMS world, reporting is always a challenge. You may have to join multiple tables to generate even simple reports. Even though the RDBMS objects such as views, stored procedures, and indexes maybe used to get the desired data for the reports, when the report is being generated, the query plan is going to be very complex most of the time. The consumption of processing power is another need to consider when generating such reports on the fly. Because of these complexities, many times, for reporting requirements, it is common to keep separate tables containing data exported from the transactional tables. This is a great opportunity to start with NoSQL stores like Cassandra as a reporting data store. Data aggregation and summarization are common requirements in any organization. This helps to control the data growth by storing only the summary statistics and moving the transactional data into archives. Many times, this aggregated and summarized data is used for statistical analysis. Making the summary accurate and easily accessible is a big challenge. Most of the time, data aggregation and reporting goes hand in hand. The aggregated data is heavily used in reports. The aggregation process speeds up the queries to a great extent. This is another place where you can start with NoSQL stores like Cassandra. The coexistence of RDBMS and NoSQL data stores like Cassandra is very much possible, feasible, and sensible; and this is the only way to get started with the NoSQL movement, unless you embark on a totally new product development from scratch. In summary, this section of the book discusses about some design patterns related to de-normalization, reporting, and aggregation of data using Cassandra as the preferred NoSQL data store. RDBMS migration patterns A big bang approach to any kind of technology migration is not advisable. A series of deliberations have to happen before the eventual and complete change over. Migration from RDBMS to Cassandra is not different at all. Any new technology replacing an old one must coexist harmoniously, at least for a short period of time. This gives a lot of confidence on the new technology to the stakeholders. Many technology pundits give various approaches on the RDBMS to NoSQL migration strategies. Many such guidelines are specific to the particular NoSQL data stores giving attention to specific areas, and most of the time, this will end up on the process rather than the technology. The migration from RDBMS to Cassandra is not an easy task. Mainly because the RDBMS-based systems are really time tested and trust worthy in most of the organizations. So, migrating from such a robust RDBMS-based system to Cassandra is not going to be easy for anyone. One of the best approaches to achieve this goal is to exploit some of the new or unique features in Cassandra, which many of the traditional RDBMS don't have. This also prevents the usage of Cassandra just like any other RDBMS. Cassandra is unique. Cassandra is not an RDBMS. The approach of banking on the unique features is not only applicable to the RDBMS to Cassandra migration, but also to any migration from one paradigm to another. Some of the design patterns that are discussed in this section of the book revolve around very simple and important features of Cassandra, but have profound application potential when designing the next generation NoSQL data stores using Cassandra. A wise usage of these unique features in Cassandra will give a head start on the eventual and complete migration from RDBMS. The modeling of collection objects in RDBMS is a real pain, because multiple tables are to be defined and a join is required to access data. Many RDBMS offer this by providing capability to define user-defined data types, but there is absolutely no standardization at all in this space. Collection objects are very commonly seen in the real-world applications. A list of actions, tuple of related values, set of objects, dictionaries, and things like that come quite often in applications. Cassandra has elegant ways to model this because they are data types in column families. Counting is a very commonly required process in many business processes and applications. In RDBMS, this has to be modeled as integers or long numbers, but many times, applications make big mistakes in using them in wrong ways. Cassandra has a counter data type in the column family that alleviates this problem. Getting rid of unwanted records from an RDBMS table is not an automatic process. When some application events occur, they have to be removed by application programs or through some other means. But in many situations, many data items will have a preallocated time to live. They should go away without the intervention of any external events. Cassandra has a way to assign time-to-live (TTL) attribute to data items. By making use of TTL, data items get removed without any other external event's intervention. All the design patterns covered in this section of the book revolve around some of the new features of Cassandra that will make the migration from RDBMS to Cassandra an easy task. Cache migration pattern Database access whether it is from RDBMS or other highly distributed NoSQL data stores is always an input/output (I/O) intensive operation. It makes perfect sense to cache the frequently used, but reasonably static data for fast access for the applications consuming this data. In such situations, the in-memory cache is preferred to the repeated database access for each request. Using cache is not always a pleasant experience. Getting into really weird problems such as data loss, data getting out of sync with its source and other data integrity problems are very common. It is very common to see wrong components coming into the enterprise solution stack all the time for various reasons. Overlooking on some of the features and adopting the technology without much background work is a very common pitfall. Many a times, the use of cache comes into the solution stack to reduce the latency of the responses. Once the initial results are favorable, more and more data will get tossed into the cache. Slowly, this will become a practice to see that more and more data is getting into cache. Now is the time when problems start popping up one by one. Pure in-memory cache solutions are favored by everybody, by the virtue of its ability to serve the data quickly until you start loosing data. This is because of the faults in the system, along with application and node crashes. Cache serves data much faster than being served from other data stores. But if the caching solution in use is giving data integrity problems, it is better to migrate to NoSQL data stores like Cassandra. Is Cassandra faster than the in-memory caching solutions? The obvious answer is no. But it is not as bad as many think. Cassandra can be configured to serve fast reads, and bonus comes in the form of high data integrity with strong replication capabilities. Cache is good as long as it serves its purpose without any data loss or any other data integrity issues. Emphasizing on the use case of the key/value type cache and various methods of cache to NoSQL migration are discussed in this section of the book. Cassandra cannot be used as a replacement for cache in terms of the speed of data access. But when it comes to data integrity, Cassandra shines all the time with its tuneable consistency feature. With a continual tuning and manipulating data with clean and well-written application code, data access can be improved to a great level, and it will be much better than many other data stores. The design pattern covered in this section of the book gives some guidance on migrating from caching solutions to Cassandra, if this is a must. CAP patterns When it comes to large-scale Internet applications or web services, popularly known as the Internet of Things (IoT) applications, the number of components are huge and the way they are distributed is beyond imagination. There will be hundreds of application servers, hundreds of data store nodes, and many other components in the whole ecosystem. In such a scenario, for doing an atomic transaction by getting an agreement from all the components involved is, for all practical purposes, impossible. Consistency, availability, and partition tolerance are three important guarantees, popularly known as CAP guarantees that any distributed computing systems should offer even though all is not possible simultaneously. In the IoT applications, the distribution of the application nodes is unavoidable. This means that the possibility of network partition is pretty much there. So, it is mandatory to give the P guarantee. Now, the question is whether to forfeit the C guarantee or the A guarantee. At this stage, the situation is not as grave as portrayed in the CAP Theorem conjectured by Eric Brewer. For all the use cases in a given IoT application, there is no need of having 100% of C guarantee and 100% of A guarantee. So, depending on the need of the level of A guarantee, the C guarantee can be tuned. In other words, it is called tunable consistency. Depending on the way data is ingested into Cassandra, and the way it is consumed from Cassandra, tuning is possible to give best results for the appropriate read and write requirements of the applications. In some applications, the speed at which the data is written will be very high. In other words, the velocity of the data ingestion into Cassandra is very high. This falls into the write-heavy applications. In some applications, the need to read data quickly will be an important requirement. This is mainly needed in the applications where there is a lot of data processing required. Data analytics applications, batch processing applications, and so on fall under this category. These fall into the read-heavy applications. Now, there is a third category of applications where there is an equal importance for fast writes as well as fast reads. These are the kind of applications where there is a constant inflow of data, and at the same time, there is a need to read the data by clients for various purposes. This falls into the read-write balanced applications. The consistency level requirements for all the previous three types of applications are totally different. There is no one way to tune so that it is optimal for all the three types of applications. All the three applications' consistency levels are to be tuned differently from use case to use case. In this section of the book, various design patterns related to applications with the needs of fast writes, fast reads, and moderate write and read are discussed. All these design patterns revolve around using the tuneable consistency parameters of Cassandra. Whether it is for write or read and if the consistency levels are set high, the availability levels will be low and vice versa. So, by making use of the consistency level knob, the Cassandra data store can be used for various types of writing and reading use cases. Temporal patterns In any applications, the usage of data that varies over the period of time is called as temporal data, which is very important. Temporal data is needed wherever there is a need to maintain chronology. There are so many applications in which there is a huge need for storage, retrieval, and processing of data that is tied to time. The biggest challenge in dealing with temporal data stored in a data store is that they are hugely used for analytical purposes and retrieving the data, based on various sort orders in terms of time. So, the data stores that are used to capture the temporal data should be capable of storing the data strictly adhering to the chronology. There are so many usage patterns that are seen in the real world that fall into showing temporal behavior. For the classification purpose in this book, they are bucketed into three. The first one is the general time series category. The second one is the log category, such as in an audit log, a transaction log, and so on. The third one is the conversation category, such as in the conversation messages of a chat application. There is relevance in this classification, because these are commonly used across in many of the applications. In many of the applications, these are really cross cutting concerns; and designers underestimate this aspect; and finally, many of the applications will have different data stores capturing this temporal data. There is a need to have a common strategy dealing with temporal data that fall in these three commonly seen categories in an enterprise wide solution architecture. In other words, there should be a uniform way of capturing temporal data; there should be a uniform way of processing temporal data; and there should be a commonly used set of tools and libraries to manage the temporal data. Out of the three design patterns that are discussed in this section of the book, the first Time Series pattern is a general design pattern that covers the most general behavior of any kind of temporal data. The next two design patterns namely Log pattern and Conversation pattern are two special cases of the first design pattern. This section of the book covers the general nature of temporal data, some specific instances of such data items in the real-world applications, and why Cassandra is the best fit as a NoSQL data store to persist the temporal data. Temporal data comes quite often in many use cases of lots of applications. Data modeling of temporal data is very important in the Cassandra perspective for optimal storage and quick access of the data. Some common design patterns to model temporal data have been covered in this section of the book. By focusing on some very few aspects, such as the partition key, primary key, clustering column and the number of records that gets stored in a wide row of Cassandra, very effective and high performing temporal data models can be built. Analytical patterns The 3Vs of big data namely Volume, Variety, and Velocity pose another big challenge, which is the analysis of the data stored in NoSQL data stores, such as Cassandra. What are the analytics use cases? How can the distributed data be processed? What are the data transformations that are typically seen in the applications? These are the topics covered in this section of the book. Unlike other sections of this book, the focus is shifted from Cassandra to other technologies like Apache Hadoop, Hadoop MapReduce, and Apache Spark to introduce the big data analytics tool space. The design patterns such as Map/Reduce Pattern and Transformation Pattern are very commonly seen in the data analytics world. Cassandra with Apache Spark has good compatibility, and is a very ideal tool set in the data analysis use cases. This section of the book covers some data analysis aspects and mainly discusses about data processing. Data transformation is one of the major activity in data processing. Out of the many data processing patterns, Map/Reduce Pattern deserves a special mention, because it is being used in so many batch processing and analysis use cases, dealing with big data. Spark has been chosen as the tool of choice to explain the data processing activities. This section explains how a Map/Reduce kind of data processing task can be done using Cassandra. Spark has also been discussed, which is very powerful to perform online data analysis. This section of the book also covers some of the commonly seen data transformations that are used in the data processing applications. Summary Many Cassandra design patterns have been covered in this book. If the design patterns are not being used in any real-world applications, it has only theoretical value. To give a practical approach to the applicability of these design patterns, an end-to-end application is taken as a case point and described as the last chapter of the book, which is used as a vehicle to explain the applicability of the Cassandra design patterns discussed in the earlier sections of the book. Users love Cassandra because of its SQL-like interface CQL. Also, its features are very closely related to the RDBMS even though the paradigm is totally new. Application developers love Cassandra because of the plethora of drivers available in the market so that they can write applications in their preferred programming language. Architects love Cassandra because they can store structured, semi-structured, and unstructured data in it. Database administers love Cassandra because it comes with almost no maintenance overhead. Service managers love Cassandra because of the wonderful monitoring tools available in the market. CIOs love Cassandra because it gives value for their money. And Cassandra works! An application based on Cassandra will be perfect only if its features are used in the right way, and this book is an attempt to guide the Cassandra community in this direction. Resources for Article: Further resources on this subject: Cassandra Architecture [article] Getting Up and Running with Cassandra [article] Getting Started with Apache Cassandra [article]
Read more
  • 0
  • 0
  • 12821
Modal Close icon
Modal Close icon