Programming | 0 articles | Tech News, Tutorials & Expert Insights

article-image-the-developer-tester-face-off-needs-to-end

21 Jul 2018

6 min read

The developer-tester face-off needs to end. It's putting our projects at risk.

21 Jul 2018

Penny and Leonard work at the same company as a tester and developer respectively. Penny arrives home late, to find Leonard on the couch with his legs up on the table, playing his favourite video game. Leonard: Oh hi sweety, it looks like you had a long day at work. Penny, throwing him a hostile, sideways glance, heads over to the refrigerator. Penny: Did you remember to take out the garbage? Leonard: Of course, sweety. I used 2 bags so Sheldon’s Szechuan sauce from Szechuan Palace doesn’t seep through. Penny: Did you buy new shampoo for the bathroom? Leonard: Yes, I picked up your regular one from the store on the way back. Penny: And did you slap on a last minute field on the SPA at work? Leonard, pausing his video game and answering in a soft, high pitched voice: Whaaaaaat? Source: giphy If you’re a developer or a tester, you’ve probably been in this situation at least once, if not more. Even if your husband or wife might not be on the other side of the source code. The war goes on... The funny thing is that this isn’t something that’s happened in the recent past. The war between Developers and Testers is a long standing, unresolved battle, that is usually brought up in bouts of unnecessary humor. The truth is that this battle is the cause of several projects slipping deadlines, teams not respecting each other’s views, etc. Here we’ll discuss some of the main reasons for this disconnect and try and address them in hope of making the office a better place. #1 You talkin’ to me? One of the main reasons that developers and testers are not on the same page is because neither bother to communicate effectively with the other. Each individual considers informing the other about the strategy/techniques used, a waste of effort. Obviously there are bound to be issues arising with such a disjointed team. The only way to resolve this problem is to toss egos out of the window, sit down and resolve problems like professionals. While tickets might be the most professional and efficient way to resolve things, walking up to the person (if possible), and discussing the best way forward lets you build a relationship, and resolve things more effectively. Moreover, the person on the receiving end will not consider the move offensive, or demeaning. #2 Is it ‘team’ or ‘teams’? You know the answer to this one, but you’re still not willing to accept it. IT managers and team leads need to create an environment in which developers and testers are not two separate teams. Rather, consider them all as engineers working in the same team, towards the same goal! There’s no better recipe to meet success. Use modern methods like Mob or Pair Programming, where both developers and testers work together closely. The ideal scenario would be to possibly have both team members work on the same machine, addressing and strategising to achieve the goal with continuous, real-time feedback. A good pairing station, Source: ministry of testing #3 On the same page? Which book you got there? If you’re a developer, this one’s especially for you, so listen carefully! Most developers aren’t aware of what tools the testers in their teams use, which is a sin. Being aware of testing tools, methodologies and processes, goes a long way in enabling a smooth and speedy testing process. A developer will be able to understand which parts of their code can probably be a tester’s target, what changes would give testers a tough time and on the other hand, what makes it easy. #4 One goal, two paths to achieve it Well, this is true a lot of times. Developers aim to “build” an application. Testers on the other hand aim to “break” the application. Now while this is not wrong, it’s the vision with which the tester is actually planning on breaking things. Testers, you should always keep the customer’s or end user’s requirements clear, while approaching the application. I may not be an actual tester, and you might wonder how I can empathise with other testers. Honestly, I don’t build software, neither do I test any. But I’ve been in a very similar role, earlier on in my Publishing career. As a Commissioning Gatekeeper, I was responsible for validating book and video ideas from the Commissioning Editors. Like a tester, my job was to identify why or how something wouldn’t work in the market. Now, I could easily approach a particular book idea from the perspective of ‘trashing’ it. But when I learned to approach it from the customer’s point of view, my perspective changed, and I was able to give better constructive feedback to the editor. Don’t aim to destroy, aim to improve. If you must kill off an idea or a feature, do it firmly but with kindness. #5 Trust the Developer’s testing skills Yes! Lack of trust is one of the main reasons why there’s so much friction between developers and testers. A tester needs to understand the developer and believe that they can also write tests with a clear goal in mind. Test Driven Development is a great approach to follow. Here, the developer will know better, what angles to test from, and this can help the tester write a mutually defined test case for the developer to run. At the same time, the tester can also provide insight into how to address bugs that might creep up while running the tests. With this combined knowledge, the developers will be able to minimize the number of bugs at the first go itself! Toss in a Business-Driven Development approach, and you’ve got yourself a team that delivers user stories that are more aligned to the business requirement than ever before! In the end, developers and testers, both need to set their egos aside and make peace with each other. If you really look at it, it’s not that hard at all. It’s all about how the two collaborate to create better software, rather than working in silos. IT managers can play an important role here, and they need to understand the advantages and limitations of their team. They need to ensure the unity of the team by encouraging more engaging ways of working as well as introducing modern methodologies that would assist a peaceful, collaborative effort. Why does more than half the IT industry suffer from Burnout? Abandoning Agile Unit testing with Java frameworks: JUnit and TestNG [Tutorial]

0
0
13465

article-image-asynchronous-programming-f

Packt

12 Oct 2016

15 min read

Asynchronous Programming in F#

Packt

12 Oct 2016

15 min read

0
0
13428

article-image-introduction-and-composition

Packt

19 Aug 2015

17 min read

Introduction and Composition

Packt

19 Aug 2015

17 min read

In this article written by Diogo Resende, author of the book Node.js High Performance, we will discuss how high performance is hard, and how it depends on many factors. Best performance should be a constant goal for developers. To achieve it, a developer must know the programming language they use and, more importantly, how the language performs under heavy loads, these being disk, memory, network, and processor usage. (For more resources related to this topic, see here.) Developers will make the most out of a language if they know its weaknesses. In a perfect world, since every job is different, a developer should look for the best tool for the job. But this is not feasible and a developer wouldn't be able to know every best tool, so they have to look for the second best tool for every job. A developer will excel if they know few tools but master them. As a metaphor, a hammer is used to drive nails, and you can also use it to break objects apart or forge metals, but you shouldn't use it to drive screws. The same applies to languages and platforms. Some platforms are very good for a lot of jobs but perform really badly at other jobs. This performance can sometimes be mitigated, but at other times, can't be avoided and you should look for better tools. Node.js is not a language; it's actually a platform built on top of V8, Google's open source JavaScript engine. This engine implements ECMAScript, which itself is a simple and very flexible language. I say "simple" because it has no way of accessing the network, accessing the disk, or talking to other processes. It can't even stop execution since it has no kind of exit instruction. This language needs some kind of interface model on top of it to be useful. Node.js does this by exposing a (preferably) nonblocking I/O model using libuv. This nonblocking API allows you to access the filesystem, connect to network services and execute child processes. The API also has two other important elements: buffers and streams. Since JavaScript strings are Unicode friendly, buffers were introduced to help deal with binary data. Streams are used as simple event interfaces to pass data around. Buffers and streams are used all over the API when reading file contents or receiving network packets. A stream is a module, similar to the network module. When loaded, it provides access to some base classes that help create readable, writable, duplex, and transform streams. These can be used to perform all sorts of data manipulation in a simplified and unified format. The buffers module easily becomes your best friend when converting binary data formats to some other format, for example, JSON. Multiple read and write methods help you convert integers and floats, signed or not, big endian or little endian, from 8 bits to 8 bytes long. Most of the platform is designed to be simple, small, and stable. It's designed and ready to create some high-performance applications. Performance analysis Performance is the amount of work completed in a defined period of time and with a set of defined resources. It can be analyzed using one or more metrics that depend on the performance goal. The goal can be low latency, low memory footprint, reduced processor usage, or even reduced power consumption. The act of performance analysis is also called profiling. Profiling is very important for making optimized applications and is achieved by instrumenting either the source or the instance of the application. By instrumenting the source, developers can spot common performance weak spots. By instrumenting an application instance, they can test the application on different environments. This type of instrumentation can also be known by the name benchmarking. Node.js is known for being fast. Actually, it's not that fast; it's just as fast as your resources allow it. What Node.js is best at is not blocking your application because of an I/O task. The perception of performance can be misleading in Node.js applications. In some other languages, when an application task gets blocked—for example, by a disk operation—all other tasks can be affected. In the case of Node.js, this doesn't happen—usually. Some people look at the platform as being single threaded, which isn't true. Your code runs on a thread, but there are a few more threads responsible for I/O operations. Since these operations are extremely slow compared to the processor's performance, they run on a separate thread and signal the platform when they have information for your application. Applications blocking I/O operations perform poorly. Since Node.js doesn't block I/O unless you want it to, other operations can be performed while waiting for I/O. This greatly improves performance. V8 is an open source Google project and is the JavaScript engine behind Node.js. It's responsible for compiling and executing JavaScript, as well as managing your application's memory needs. It is designed with performance in mind. V8 follows several design principles to improve language performance. The engine has a profiler and one of the best and fast garbage collectors that exist, which is one of the keys to its performance. It also does not compile the language into byte code; it compiles it directly into machine code on the first execution. A good background in the development environment will greatly increase the chances of success in developing high-performance applications. It's very important to know how dereferencing works, or why your variables should avoid switching types. Here are other useful tips you would want to follow. You can use a style guide like JSCS and a linter like JSHint to enforce them to for yourself and your team. Here are some of them: Write small functions, as they're more easily optimized Use monomorphic parameters and variables Prefer arrays to manipulate data, as integer-indexed elements are faster Try to have small objects and avoid long prototype chains Avoid cloning objects because big objects will slow the operations Monitoring After an application is put into production mode, performance analysis becomes even more important, as users will be more demanding than you were. Users don't accept anything that takes more than a second, and monitoring the application's behavior over time and over some specific loads will be extremely important, as it will point to you where your platform is failing or will fail next. Yes, your application may fail, and the best you can do is be prepared. Create a backup plan, have fallback hardware, and create service probes. Essentially, anticipate all the scenarios you can think of, and remember that your application will still fail. Here are some of those scenarios and aspects that you should monitor: When in production, application usage is of extreme importance to understand where your application is heading in terms of data size or memory usage. It's important that you carefully define source code probes to monitor metrics—not only performance metrics, such as requests per second or concurrent requests, but also error rate and exception percentage per request served. Your application emits errors and sometimes throws exceptions; it's normal and you shouldn't ignore them. Don't forget the rest of the infrastructure. If your application must perform at high standards, your infrastructure should too. Your server power supply should be uninterruptible and stable, as instability will degrade your hardware faster than it should. Choose your disks wisely, as faster disks are more expensive and usually come in smaller storage sizes. Sometimes, however, this is actually not a bad decision when your application doesn't need that much storage and speed is considered more important. But don't just look at the gigabytes per dollar. Sometimes, it's more important to look at the gigabits per second per dollar. Also, your server temperature and server room should be monitored. High temperatures degrades performance and your hardware has an operation temperature limit. Security, both physical and virtual, is also very important. Everything counts for the standards of high performance, as an application that stops serving its users is not performing at all. Getting high performance Planning is essential in order to achieve the best results possible. High performance is built from the ground up and starts with how you plan and develop. It obviously depends on physical resources, as you can't perform well when you don't have sufficient memory to accomplish your task, but it also depends greatly on how you plan and develop an application. Mastering tools will give much better performance chances than just using them. Setting the bar high from the beginning of development will force the planning to be more prudent. Some bad planning of the database layer can really downgrade performance. Also, cautious planning will cause developers to think more about “use cases and program more consciously. High performance is when you have to think about a new set of resources (processor, memory, storage) because all that you have is exhausted, not just because one resource is. A high-performance application shouldn't need a second server when a little processor is used and the disk is full. In such a case, you just need bigger disks. Applications can't be designed as monolithic these days. An increasing user base enforces a distributed architecture, or at least one that can distribute load by having multiple instances. This is very important to accommodate in the beginning of the planning, as it will be harder to change an application that is already in production. Most common applications will start performing worse over time, not because of deficit of processing power but because of increasing data size on databases and disks. You'll notice that the importance of memory increases and fallback disks become critical to avoiding downtime. It's very important that an application be able to scale horizontally, whether to shard data across servers or across regions. A distributed architecture also increases performance. Geographically distributed servers can be more closed to clients and give a perception of performance. Also, databases distributed by more servers will handle more traffic as a whole and allow DevOps to accomplish zero downtime goals. This is also very useful for maintenance, as nodes can be brought down for support without affecting the application. Testing and benchmarking To know whether an application performs well or not under specific environments, we have to test it. This kind of test is called a benchmark. Benchmarking is important to do and it's specific to every application. Even for the same language and platform, different applications might perform differently, either because of the way in which some parts of an application were structured or the way in which a database was designed. Analyzing the performance will indicate bottleneck of your application, or if you may, the parts of the application that perform not good as others. These are the parts that need to be improved. Constantly trying to improve the worst performing parts will elevate the application's overall performance. There are plenty of tools out there, some more specific or focused on JavaScript applications, such as benchmarkjs (http://benchmarkjs.com/) and ben (https://github.com/substack/node-ben), and others more generic, such as ab (http://httpd.apache.org/docs/2.2/programs/ab.html) and httpload (https://github.com/perusio/httpload). There are several types of benchmark tests depending on the goal, they are as follows: Load testing is the simplest form of benchmarking. It is done to find out how the application performs under a specific load. You can test and find out how many connections an application accepts per second, or how many traffic bytes an application can handle. An application load can be checked by looking at the external performance, such as traffic, and also internal performance, such as the processor used or the memory consumed. Soak testing is used to see how an application performs during a more extended period of time. It is done when an application tends to degrade over time and analysis is needed to see how it reacts. This type of test is important in order to detect memory leaks, as some applications can perform well in some basic tests, but over time, the memory leaks and their performance can degrade. Spike testing is used when a load is increased very fast to see how the application reacts and performs. This test is very useful and important in applications that can have spike usages, and operators need to know how the application will react. Twitter is a good example of an application environment that can be affected by usage spikes (in world events such as sports or religious dates), and need to know how the infrastructure will handle them. All of these tests can become harder as your application grows. Since your user base gets bigger, your application scales and you lose the ability to be able to load test with the resources you have. It's good to be prepared for this moment, especially to be prepared to monitor performance and keep track of soaks and spikes as your application users start to be the ones responsible for continuously test load. Composition in applications Because of this continuous demand of performant applications, composition becomes very important. Composition is a practice where you split the application into several smaller and simpler parts, making them easier to understand, develop, and maintain. It also makes them easier to test and improve. Avoid creating big, monolithic code bases. They don't work well when you need to make a change, and they also don't work well if you need to test and analyze any part of the code to improve it and make it perform better. The Node.js platform helps you—and in some ways, forces you to—compose your code. Node.js Package Manager (NPM) is a great module publishing service. You can download other people's modules and publish your own as well. There are tens of thousands of modules published, which means that you don't have to reinvent the wheel in most cases. This is good since you can avoid wasting time on creating a module and use a module that is already in production and used by many people, which normally means that bugs will be tracked faster and improvements will be delivered even faster. The Node.js platform allows developers to easily separate code. You don't have to do this, as the platform doesn't force you to, but you should try and follow some good practices, such as the ones described in the following sections. Using NPM Don't rewrite code unless you need to. Take your time to try some available modules, and choose the one that is right for you. This reduces the probability of writing faulty code and helps published modules that have a bigger user base. Bugs will be spotted earlier, and more people in different environments will test fixes. Moreover, you will be using a more resilient module. One important and neglected task after starting to use some modules is to track changes and, whenever possible, keep using recent stable versions. If a dependency module has not been updated for a year, you can spot a problem later, but you will have a hard time figuring out what changed between two versions that are a year apart. Node.js modules tend to be improved over time and API changes are not rare. Always upgrade with caution and don't forget to test. Separating your code Again, you should always split your code into smaller parts. Node.js helps you do this in a very easy way. You should not have files bigger than 5 kB. If you have, you better think about splitting it. Also, as a good rule, each user-defined object should have its own separate file. Name your files accordingly: // MyObject.js module.exports = MyObject; function MyObject() { // … } MyObject.prototype.myMethod = function () { … }; Another good rule to check whether you have a file bigger than it should be; that is, it should be easy to read and understand in less than 5 minutes by someone new to the application. If not, it means that it's too complex and it will be harder to track and fix bugs later on. Remember that later on, when your application becomes huge, you will be like a new developer when opening a file to fix something. You can't remember all of the code of the application, and you need to absorb a file behavior fast. Garbage collection When writing applications, managing available memory is boring and difficult. When the application gets complex it's easy to start leaking memory. Many programming languages have automatic memory management, removing this management away from the developer by means of a Garbage Collector (GC). The GC is only a part of this memory management, but it's the most important one and is responsible for reclaiming memory that is no longer in use (garbage), by periodically looking at disposed referenced objects and freeing the memory associated with them. The most common technique used by GC is to monitor reference counting. This means that GC, for each object, holds the number (count) of other objects that reference it. When an object has no references to it, it can be collected, meaning, it can be disposed and its memory freed. CPU Profiling Profiling is boring but it's a good form of software analysis where you measure resource usage. This usage is measured over time and sometimes under specific workloads. Resources can mean anything the application is using, being it memory, disk, network or processor. More specifically, CPU profiling allows one to analyze how and how much your functions use the processor. You can also analyze the opposite, the non-usage of the processor, the idle time. When profiling the processor, we usually take samples of the call stack at a certain frequency and analyze how the stack changes (increases or decreases) over the sampling period. If we use profilers from the operating system we'll have more items in stack than you probably expect, as you'll get internal calls of node.js and V8. Summary Together, Node.js and NPM make a very good platform for developing high-performance applications. Since the language behind them is JavaScript and most applications these days are web applications, these combinations make it an even more appealing choice, as it's one less server-side language to learn (such as PHP or Ruby) and can ultimately allow a developer to share code on the client and server sides. Also, frontend and backend developers can share, read, and improve each other's code. Many developers pick this formula and bring with them many of their habits from the client side. Some of these habits are not applicable because on the server-side, asynchronous tasks must rule as there are many clients connected (as opposed to one) and performance becomes crucial. You now see how the garbage collector task is not that easy. It surely does a very good job managing memory automatically. You can help it a lot, especially if writing applications with performance in mind. Avoiding the GC old space growing is necessary to avoid long GC cycles, pausing your application and sometimes forcing your services to restart. Every time you create a new variable you allocate memory and you get closer to a new GC cycle. Even understanding how memory is managed, sometimes you need to inspect your memory usage behavior In environments seen nowadays, it's very important to be able to profile an application to identify bottlenecks, especially at the processor and memory level. Overall you should focus on your code quality before going into profiling. Resources for Article: Further resources on this subject: Node.js Fundamentals [Article] So, what is Node.js? [Article] Setting up Node [Article]

0
0
13389

article-image-github-wants-to-improve-open-source-sustainability-invites-maintainers-to-talk-about-their-oss-challenges

Sugandha Lahoti

18 Jan 2019

4 min read

Github wants to improve Open Source sustainability; invites maintainers to talk about their OSS challenges

Sugandha Lahoti

18 Jan 2019

4 min read

Open Source Sustainability is an essential and special part of free and open software development. Open source contributors and maintainers build tools and technologies for everyone, but they’ don’t get enough resources, tools, and environment. If anything goes wrong with the project, it is generally the system contributors who are responsible for it. In reality, however, contributors, and maintainers together are equally responsible. Yesterday, Devon Zuegel, the open source product manager at GitHub penned a blog post talking about open source sustainability and what are the issues current open source maintainers face while trying to contribute to open source. The major thing holding back OSS is the work overload that maintainers face. The OS community generally consists of maintainers who are working at some other organization while also maintaining the open source projects mostly in their free time. This leaves little room for software creators to have economic gain from their projects and compensate for costs and people required to maintain their projects. This calls for companies and individuals to donate to these maintainers on GitHub. As a hacker news user points out, “ I think this would be a huge incentive for people to continue their work long-term and not just "hand over" repositories to people with ulterior motives.” Another said, “Integrating bug bounties and donations into GitHub could be one of the best things to happen to Open Source. Funding new features and bug fixes could become seamless, and it would sway more devs to adopt this model for their projects.” Another major challenge is the abuse and frustration that maintainers have to go on a daily basis. As Devon writes on her blog, “No one deserves abuse. OSS contributors are often on the receiving end of harassment, demands, and general disrespect, even as they volunteer their time to the community.” What is required is to educate people and also build some kind of moderation for trolls like a small barrier to entry. Apart from that maintainers should also be given expanded visibility into how their software is used. Currently, they are only given access to download statistics. There should be a proper governance model that should be regularly updated based on the what decisions team makes, delegates, and communicates. As Adam Jacob founder of SFOSC (Sustainable Free and Open Source Communities) points out, “I believe we need to start talking about Open Source, not in terms of licensing models, or business models (though those things matter): instead, we should be talking about whether or not we are building sustainable communities. What brings us together, as people, in this common effort around the software? What rights do we hold true for each other? What rights are we willing to trade in order to see more of the software in the world, through the investment of capital?” SFOSC is established to discuss the principles that lead to sustainable communities, to develop clear social contracts communities can use, and educate Open Source companies on which business models can create true communities. As with SFOSC, Github also wants to better understand the woes of maintainers from their own experiences and hence the blog. Devon wants to support the people behind OSS at Github inviting people to have an open dialogue with the GitHub community solving nuanced and unique challenges that the current OSS community face. She has created a contact form asking open source contributors or maintainer to join the conversation and share their problems. Open Source Software: Are maintainers the only ones responsible for software sustainability? We need to encourage the meta-conversation around open source, says Nadia Eghbal [Interview] EU to sponsor bug bounty programs for 14 open source projects from January 2019

0
0
13379

article-image-working-entity-client-and-entity-sql

Packt

21 Aug 2015

11 min read

Working with Entity Client and Entity SQL

Packt

21 Aug 2015

11 min read

0
0
13367

Packt

14 Sep 2016

12 min read

It's All About Data

Packt

14 Sep 2016

12 min read

In this article by Samuli Thomasson, the author of the book, Haskell High Performance Programming, we will know how to choose and design optimal data structures in applications. You will be able to drop the level of abstraction in slow parts of code, all the way to mutable data structures if necessary. (For more resources related to this topic, see here.) Annotating strictness and unpacking datatype fields We used the BangPatterns extension to make function arguments strict: {-# LANGUAGE BangPatterns #-} f !s (x:xs) = f (s + 1) xs f !s _ = s Using bangs for annotating strictness in fact predates the BangPatterns extension (and the older compiler flag -fbang-patterns in GHC6.x). With just plain Haskell98, we are allowed to use bangs to make datatype fields strict: > data T = T !Int A bang in front of a field ensures that whenever the outer constructor (T) is in WHNF, the inner field is as well in WHNF. We can check this: > T undefined `seq` () *** Exception: Prelude.undefined There are no restrictions to which fields can be strict, be it recursive or polymorphic fields, although, it rarely makes sense to make recursive fields strict. Consider the fully strict linked list: data List a = List !a !(List a) | ListEnd With this much strictness, you cannot represent parts of infinite lists without always requiring infinite space. Moreover, before accessing the head of a finite strict list you must evaluate the list all the way to the last element. Strict lists don't have the streaming property of lazy lists. By default, all data constructor fields are pointers to other data constructors or primitives, regardless of their strictness. This applies to basic data types such asInt, Double, Char, and so on, which are not primitive in Haskell. They are data constructors over their primitive counterparts Int#, Double#, and Char#: > :info Int data Int = GHC.Types.I# GHC.Prim.Int# There is a performance overhead, the size of pointer dereference between types, say, Int and Int#, but an Int can represent lazy values (called thunks), whereas primitives cannot. Without thunks, we couldn't have lazy evaluation. Luckily,GHC is intelligent enough to unroll wrapper types as primitives in many situations, completely eliminating indirect references. The hash suffix is specific to GHC and always denotes a primitive type. The GHC modules do expose the primitive interface. Programming with primitives you can further micro-optimize code and get C-like performance. However, several limitations and drawbacks apply. Using anonymous tuples Tuples may seem harmless at first; they just lump a bunch of values together. But note that the fields in a tuple aren't strict, so a twotuple corresponds to the slowest PairP data type from our previous benchmark. If you need a strict Tuple type, you need to define one yourself. This is also one more reason to prefer custom types over nameless tuples in many situations. These two structurally similar tuple types have widely different performance semantics: data Tuple = Tuple {-# UNPACK #-} !Int {-# UNPACK #-} !Int data Tuple2 = Tuple2 {-# UNPACK #-} !(Int, Int) If you really want unboxed anonymous tuples, you can enable the UnboxedTuples extension and write things with types like (# Int#, Char# #). But note that a number of restrictions apply to unboxed tuples like to all primitives. The most important restriction is that unboxed types may not occur where polymorphic types or values are expected, because polymorphic values are always considered as pointers. Representing bit arrays One way to define a bitarray in Haskell that still retains the convenience of Bool is: import Data.Array.Unboxed type BitArray = UArrayInt Bool This representation packs 8 bits per byte, so it's space efficient. See the following section on arrays in general to learn about time efficiency – for now we only note that BitArray is an immutable data structure, like BitStruct, and that copying small BitStructs is cheaper than copying BitArrays due to overheads in UArray. Consider a program that processes a list of integers and tells whether they are even or odd counts of numbers divisible by 2, 3, and 5. We can implement this with simple recursion and a three-bit accumulator. Here are three alternative representations for the accumulator: type BitTuple = (Bool, Bool, Bool) data BitStruct = BitStruct !Bool !Bool !Bool deriving Show type BitArray = UArrayInt Bool And the program itself is defined along these lines: go :: acc -> [Int] ->acc go acc [] = acc go (two three five) (x:xs) = go ((test 2 x `xor` two) (test 3 x `xor` three) (test 5 x `xor` five)) xs test n x = x `mod` n == 0 I've omitted the details here. They can be found in the bitstore.hs file. The fastest variant is BitStruct, then comes BitTuple (30% slower), and BitArray is the slowest (130% slower than BitStruct). Although BitArray is the slowest (due to making a copy of the array on every iteration), it would be easy to scale the array in size or make it dynamic. Note also that this benchmark is really on the extreme side; normally programs do a bunch of other stuff besides updating an array in a tight loop. If you need fast array updates, you can resort to mutable arrays discussed later on. It might also be tempting to use Data.Vector.Unboxed.VectorBool from the vector package, due to its nice interface. But beware that that representation uses one byte for every bit, wasting 7 bits for every bit. Mutable references are slow Data.IORef and Data.STRef are the smallest bits of mutable state, references to mutable variables, one for IO and other for ST. There is also a Data.STRef.Lazy module, which provides a wrapper over strict STRef for lazy ST. However, because IORef and STRef are references, they imply a level of indirection. GHC intentionally does not optimize it away, as that would cause problems in concurrent settings. For this reason, IORef or STRef shouldn't be used like variables in C, for example. Performance will for sure be very bad. Let's verify the performance hit by considering the following ST-based sum-of-range implementation: -- file: sum_mutable.hs import Control.Monad.ST import Data.STRef count_st :: Int ->Int count_st n = runST $ do ref <- newSTRef 0 let go 0 = readSTRef ref go i = modifySTRef' ref (+ i) >> go (i - 1) go n And compare it to this pure recursive implementation: count_pure :: Int ->Int count_pure n = go n 0 where go 0 s = s go i s = go (i - 1) $! (s + i) The ST implementation is many times slower when at least -O is enabled. Without optimizations, the two functions are more or less equivalent in performance;there is similar amount of indirection from not unboxing arguments in the latter version. This is one example of the wonders that can be done to optimize referentially transparent code. Bubble sort with vectors Bubble sort is not an efficient sort algorithm, but because it's an in-place algorithm and simple, we will implement it as a demonstration of mutable vectors: -- file: bubblesort.hs import Control.Monad.ST import Data.Vector as V import Data.Vector.Mutable as MV import System.Random (randomIO) -- for testing The (naive) bubble sort compares values of all adjacent indices in order, and swaps the values if necessary. After reaching the last element, it starts from the beginning or, if no swaps were made, the list is sorted and the algorithm is done: bubblesortM :: (Ord a, PrimMonad m) =>MVector (PrimState m) a -> m () bubblesortM v = loop where indices = V.fromList [1 .. MV.length v - 1] loop = do swapped <- V.foldM' f False indices – (1) if swapped then loop else return () – (2) f swapped i = do – (3) a <- MV.read v (i-1) b <- MV.read v i if a > b then MV.swap v (i-1) i>> return True else return swapped At (1), we fold monadically over all but the last index, keeping state about whether or not we have performed a swap in this iteration. If we had, at (2) we rerun the fold or, if not, we can return. At (3) we compare an index and possibly swap values. We can write a pure function that wraps the stateful algorithm: bubblesort :: Ord a => Vector a -> Vector a bubblesort v = runST $ do mv <- V.thaw v bubblesortM mv V.freeze mv V.thaw and V.freeze (both O(n)) can be used to go back and forth with mutable and immutable vectors. Now, there are multiple code optimization opportunities in our implementation of bubble sort. But before tackling those, let's see how well our straightforward implementation fares using the following main: main = do v <- V.generateM 10000 $ _ ->randomIO :: IO Double let v_sorted = bubblesort v median = v_sorted ! 5000 print median We should remember to compile with -O2. On my machine, this program takes about 1.55s, and Runtime System reports 99.9% productivity, 18.7 megabytes allocated heap and 570 Kilobytes copied during GC. So now with a baseline, let's see if we can squeeze out more performance from vectors. This is a non-exhaustive list: Use unboxed vectors instead. This restricts the types of elements we can store, but it saves us a level of indirection. Down to 960ms and approximately halved GC traffic. Large lists are inefficient, and they don't compose with vectors stream fusion. We should change indices so that it uses V.enumFromTo instead (alternatively turn on OverloadedLists extension and drop V.fromList). Down to 360ms and 94% less GC traffic. Conversion functions V.thaw and V.freeze are O(n), that is, they modify copies. Using in-place V.unsafeThaw and V.unsafeFreeze instead is sometimes useful. V.unsafeFreeze in the bubblesort wrapper is completely safe, but V.unsafeThaw is not. In our example, however, with -O2, the program is optimized into a single loop and all those conversions get eliminated. Vector operations (V.read, V.swap) in bubblesortM are guaranteed to never be out of bounds, so it's perfectly safe to replace these with unsafe variants (V.unsafeRead, V.unsafeSwap) that don't check bounds. Speed-up of about 25 milliseconds, or 5%. To summarize, applying good practices and safe usage of unsafe functions, our Bubble sort just got 80% faster. These optimizations are applied in thebubblesort-optimized.hsfile (omitted here). We noticed that almost all GC traffic came from a linked list, which was constructed and immediately consumed. Lists are bad for performance in that they don't fuse like vectors. To ensure good vector performance, ensure that the fusion framework can work effectively. Anything that can be done with a vector should be done. As final note, when working with vectors(and other libraries) it's a good idea to keep the Haddock documentation handy. There are several big and small performance choices to be made. Often the difference is that of between O(n) and O(1). Speedup via continuation-passing style Implementing monads in continuation-passing style (CPS) can have very good results. Unfortunately, no widely-used or supported library I'm aware of would provide drop-in replacements for ubiquitous Maybe, List, Reader, Writer, and State monads. It's not that hard to implement the standard monads in CPS from scratch. For example, the State monad can be implemented using the Cont monad from mtl as follows: -- file: cont-state-writer.hs {-# LANGUAGE GeneralizedNewtypeDeriving #-} {-# LANGUAGE MultiParamTypeClasses #-} {-# LANGUAGE FlexibleInstances #-} {-# LANGUAGE FlexibleContexts #-} import Control.Monad.State.Strict import Control.Monad.Cont newtypeStateCPS s r a = StateCPS (Cont (s -> r) a) deriving (Functor, Applicative, Monad, MonadCont) instance MonadState s (StateCPS s r) where get = StateCPS $ cont $ next curState→ next curStatecurState put newState = StateCPS $ cont $ next curState→ next () newState runStateCPS :: StateCPS s s () -> s -> s runStateCPS (StateCPS m) = runCont m (_ -> id) In case you're not familiar with the continuation-passing style and the Cont monad, the details might not make much sense, instead of just returning results from a function, a function in CPS applies its results to a continuation. So in short, to "get" the state in continuation-passing style, we pass the current state to the "next" continuation (first argument) and don't change the state (second argument). To "put", we call the continuation with unit (no return value) and change the state to new state (second argument to next). StateCPS is used just like the State monad: action :: MonadStateInt m => m () action = replicateM_ 1000000 $ do i<- get put $! i + 1 main = do print $ (runStateCPS action 0 :: Int) print $ (snd $ runState action 0 :: Int) That action operation is, in the CPS version of the state monad, about 5% faster and performs 30% less heap allocation than the state monad from mtl. This program is limited pretty much only by the speed of monadic composition, so these numbers are at least very close to maximum speedup we can have from CPSing the state monad. Speedups of the writer monad are probably near these results. Other standard monads can be implemented similarly to StateCPS. The definitions can also be generalized to monad transformers over an arbitrary monad (a la ContT). For extra speed, you might wish to combine many monads in a single CPS monad, similarly to what RWST does. Summary We witnessed the performance of the bytestring, text, and vector libraries, all of which get their speed from fusion optimizations, in contrast to linked lists, which have a huge overhead despite also being subject to fusion to some degree. However, linked lists give rise to simple difference lists and zippers. The builder patterns for lists, bytestring, and textwere introduced. We discovered that the array package is low-level and clumsy compared to the superior vector package, unless you must support Haskell 98. We also saw how to implement Bubble sort using vectors and how to speedup via continuation-passing style. Resources for Article: Further resources on this subject: Data Tables and DataTables Plugin in jQuery 1.3 with PHP [article] Data Extracting, Transforming, and Loading [article] Linking Data to Shapes [article]

0
0
13304

article-image-image-classification-and-feature-extraction-images

Packt

25 Oct 2013

3 min read

Image classification and feature extraction from images

Packt

25 Oct 2013

3 min read

(For more resources related to this topic, see here.) Classifying images Automated Remote Sensing ( ARS ) is rarely ever done in the visible spectrum. The most commonly available wavelengths outside of the visible spectrum are infrared and near-infrared. The following scene is a thermal image (band 10) from a fairly recent Landsat 8 flyover of the US Gulf Coast from New Orleans, Louisiana to Mobile, Alabama. Major natural features in the image are labeled so you can orient yourself: Because every pixel in that image has a reflectance value, it is information. Python can "see" those values and pick out features the same way we intuitively do by grouping related pixel values. We can colorize pixels based on their relation to each other to simplify the image and view related features. This technique is called classification. Classifying can range from fairly simple groupings based only on some value distribution algorithm derived from the histogram to complex methods involving training data sets and even computer learning and artificial intelligence. The simplest forms are called unsupervised classifications, whereas methods involving some sort of training data to guide the computer are called supervised. It should be noted that classification techniques are used across many fields, from medical doctors trying to spot cancerous cells in a patient's body scan, to casinos using facial-recognition software on security videos to automatically spot known con-artists at blackjack tables. To introduce remote sensing classification we'll just use the histogram to group pixels with similar colors and intensities and see what we get. First you'll need to download the Landsat 8 scene here: http://geospatialpython.googlecode.com/files/thermal.zip Instead of our histogram() function from previous examples, we'll use the version included with NumPy that allows you to easily specify a number of bins and returns two arrays with the frequency as well as the ranges of the bin values. We'll use the second array with the ranges as our class definitions for the image. The lut or look-up table is an arbitrary color palette used to assign colors to classes. You can use any colors you want. import gdalnumeric # Input file name (thermal image) src = "thermal.tif" # Output file name tgt = "classified.jpg" # Load the image into numpy using gdal srcArr = gdalnumeric.LoadFile(src) # Split the histogram into 20 bins as our classes classes = gdalnumeric.numpy.histogram(srcArr, bins=20)[1] # Color look-up table (LUT) - must be len(classes)+1. # Specified as R,G,B tuples lut = [[255,0,0],[191,48,48],[166,0,0],[255,64,64], [255,115,115],[255,116,0],[191,113,48],[255,178,115], [0,153,153],[29,115,115],[0,99,99],[166,75,0], [0,204,0],[51,204,204],[255,150,64],[92,204,204],[38,153,38], [0,133,0],[57,230,57],[103,230,103],[184,138,0]] # Starting value for classification start = 1 # Set up the RGB color JPEG output image rgb = gdalnumeric.numpy.zeros((3, srcArr.shape[0], srcArr.shape[1],), gdalnumeric.numpy.float32) # Process all classes and assign colors for i in range(len(classes)): mask = gdalnumeric.numpy.logical_and(start <= srcArr, srcArr <= classes[i]) for j in range(len(lut[i])): rgb[j] = gdalnumeric.numpy.choose(mask, (rgb[j], lut[i][j])) start = classes[i]+1 # Save the image gdalnumeric.SaveArray(rgb.astype(gdalnumeric.numpy.uint8), tgt, format="JPEG") The following image is our classification output, which we just saved as a JPEG. We didn't specify the prototype argument when saving as an image, so it has no georeferencing information. This result isn't bad for a very simple unsupervised classification. The islands and coastal flats show up as different shades of green. The clouds were isolated as shades of orange and dark blues. We did have some confusion inland where the land features were colored the same as the Gulf of Mexico. We could further refine this process by defining the class ranges manually instead of just using the histogram.

0
0
13101

article-image-aspnet-site-performance-improving-javascript-loading

Packt

28 Oct 2010

11 min read

ASP.Net Site Performance: Improving JavaScript Loading

Packt

28 Oct 2010

11 min read

ASP.NET Site Performance Secrets Simple and proven techniques to quickly speed up your ASP.NET website Speed up your ASP.NET website by identifying performance bottlenecks that hold back your site's performance and fixing them Tips and tricks for writing faster code and pinpointing those areas in the code that matter most, thus saving time and energy Drastically reduce page load times Configure and improve compression – the single most important way to improve your site's performance Written in a simple problem-solving manner – with a practical hands-on approach and just the right amount of theory you need to make sense of it all Read more about this book One approach to improving page performance is to shift functionality from the server to the browser. Instead of calculating a result or validating a form in C# on the server, you use JavaScript code on the browser. A drawback of this approach is that it involves physically moving code from the server to the browser. Because JavaScript is not compiled, it can be quite bulky. This can affect page load times, especially if you use large JavaScript libraries. You're effectively trading off increased page load times against faster response times after the page has loaded. In this article by Matt Perdeck, author of ASP.NET Site Performance Secret, you'll see how to reduce the impact on page load times by the need to load JavaScript files. It shows: How JavaScript files can block rendering of the page while they are being loaded and executed How to load JavaScript in parallel with other resources How to load JavaScript more quickly (For more resources on ASP.Net, see here.) Problem: JavaScript loading blocks page rendering JavaScript files are static files, just as images and CSS files. However, unlike images, when a JavaScript file is loaded or executed using a <script> tag, rendering of the page is suspended. This makes sense, because the page may contain script blocks after the <script> tag that are dependent on the JavaScript file. If loading of a JavaScript file didn't block page rendering, the other blocks could be executed before the file had loaded, leading to JavaScript errors. Confirming with a test site You can confirm that loading a JavaScript file blocks rendering of the page by running the website in the folder JavaScriptBlocksRendering in the downloaded code bundle. This site consists of a single page that loads a single script, script1.js. It also has a single image, chemistry.png, and a stylesheet style1.css. It uses an HTTP module that suspends the working thread for five seconds when a JavaScript file is loaded. Images and CSS files are delayed by about two seconds. When you load the page, you'll see that the page content appears after only about five seconds. Then after two seconds, the image appears, unless you use Firefox, which often loads images in parallel with the JavaScript. If you make a Waterfall chart, you can see how the image and stylesheet are loaded after the JavaScript file, instead of in parallel: To get the delays, run the test site on IIS 7 in integrated pipeline mode. Do not use the Cassini web server built into Visual Studio. If you find that there is no delay, clear the browser cache. If that doesn't work either, the files may be in kernel cache on the server—remove them by restarting IIS using Internet Information Services (IIS) Manager. To open IIS manager, click on Start | Control Panel, type "admin" in the search box, click on Administrative Tools, and then double-click on Internet Information Services (IIS) Manager. Integrated/Classic Pipeline Mode As in IIS 6, every website runs as part of an application pool in IIS 7. Each IIS 7 application pool can be switched between Integrated Pipeline Mode (the default) and Classic Pipeline Mode. In Integrated Pipeline Mode, the ASP.NET runtime is integrated with the core web server, so that the server can be managed for example, via web.config elements. In Classic Pipeline Mode, IIS 7 functions more like IIS 6, where ASP.NET runs within an ISAPI extension. Approaches to reduce the impact on load times Although it makes sense to suspend rendering the page while a <script> tag loads or executes JavaScript, it would still be good to minimize the time visitors have to wait for the page to appear, especially if there is a lot of JavaScript to load. Here are a few ways to do that: Start loading JavaScript after other components have started loading, such as images and CSS files. That way, the other components load in parallel with the JavaScript instead of after the JavaScript, and so are available sooner when page rendering resumes. Load JavaScript more quickly. Page rendering is still blocked, but for less time. Load JavaScript on demand. Only load the JavaScript upfront that you need to render the page. Load the JavaScript that handles button clicks, and so on, when you need it. Use specific techniques to prevent JavaScript loading from blocking rendering. This includes loading the JavaScript after the page has rendered, or in parallel with page rendering. These approaches can be combined or used on their own for the best tradeoff between development time and performance. Let's go through each approach. Approach: Start loading after other components This approach aims to render the page sooner by loading CSS stylesheets and images in parallel with the JavaScript rather than after the JavaScript. That way, when the JavaScript has finished loading, the CSS and images will have finished loading too and will be ready to use; or at least it will take less time for them to finish loading after the JavaScript has loaded. To load the CSS stylesheets and images in parallel with the JavaScript, you would start loading them before you start loading the JavaScript. In the case of CSS stylesheets that is easy—simply place their <link> tags before the <script> tags: <link rel="Stylesheet" type="text/css" href="css/style1.css" /><script type="text/javascript" src="js/script1.js"></script> Starting the loading of images is slightly trickier because images are normally loaded when the page body is evaluated, not as a part of the page head. In the test page you just saw with the image chemistry.png, you can use a bit of simple JavaScript to get the browser to start loading the image before it starts loading the JavaScript file. This is referred to as "image preloading" (page PreLoadWithJavaScript.aspx in the folder PreLoadImages in the downloaded code bundle): <script type="text/javascript"> var img1 = new Image(); img1.src = "images/chemistry.png";</script><link rel="Stylesheet" type="text/css" href="css/style1.css" /><script type="text/javascript" src="js/script1.js"></script> Run the page now and you'll get the following Waterfall chart: When the page is rendered after the JavaScript has loaded, the image and CSS files have already been loaded; so the image shows up right away. A second option is to use invisible image tags at the start of the page body that preload the images. You can make the image tags invisible by using the style display:none. You would have to move the <script> tags from the page head to the page body after the invisible image tags, as shown (page PreLoadWithCss.aspx in folder PreLoadImages in the downloaded code bundle): <body> <div style="display:none"> <img src="images/chemistry.png" /> </div> <script type="text/javascript" src="js/script1.js"></script> Although the examples we've seen so far preload only one image, chemistry.png, you could easily preload multiple images. When you do, it makes sense to preload the most important images first, so that they are most likely to appear right away when the page renders. The browser loads components, such as images, in the order in which they appear in the HTML, so you'd wind up with something similar to the following code: <script type="text/javascript"> var img1 = new Image(); img1.src = "images/important.png"; var img1 = new Image(); img2.src = "images/notsoimportant.png"; var img1 = new Image(); img3.src = "images/unimportant.png";</script> Approach: Loading JavaScript more quickly The second approach is to simply spend less time loading the same JavaScript, so that visitors spend less time waiting for the page to render. There are a number of ways to achieve just that: Techniques used with images, such as caching and parallel download Free Content Delivery Networks GZIP compression Minification Combining or breaking up JavaScript files Removing unused code Techniques used with images JavaScript files are static files, just like images and CSS files. This means that many techniques that apply to images apply to JavaScript files as well, including the use of cookie-free domains, caching, and boosting parallel loading. Free Content Delivery Networks Serving static files from a Content Delivery Network (CDN) can greatly reduce download times, by serving the files from a server that is close to the visitor. A CDN also saves you bandwidth because the files are no longer served from your own server. A number of companies now serve popular JavaScript libraries from their CDNs for free. Here are their details: Google AJAX Libraries API http://code.google.com/apis/ajaxlibs/ Serves a wide range of libraries including jQuery, jQuery UI, Prototype, Dojo, and Yahoo! User Interface Library (YUI) Microsoft Ajax Content Delivery Network http://www.asp.net/ajaxlibrary/cdn.ashx Serves libraries used by the ASP.NET and ASP.NET MVC frameworks including the jQuery library and the jQuery Validation plugin jQuery CDN http://docs.jquery.com/Downloading_jQuery Serves the jQuery library In ASP.NET 4.0 and later, you can get the ScriptManager control to load the ASP. NET AJAX script files from the Microsoft AJAX CDN instead of your web server, by setting the EnableCdn property to true: <asp:ScriptManager ID="ScriptManager1" EnableCdn="true" runat="server" /> One issue with loading libraries from a CDN is that it creates another point of failure—if the CDN goes down, your site is crippled. GZIP compression IIS has the ability to compress content sent to the browser, including JavaScript and CSS files. Compression can make a dramatic difference to a JavaScript file as it goes over the wire from the server to the browser. Take for example the production version of the jQuery library: Uncompressed Compressed jQuery library 78 KB 26 KB Compression for static files is enabled by default in IIS 7. This immediately benefits CSS files. It should also immediately benefit JavaScript files, but it doesn't because of a quirk in the default configuration of IIS 7. Not all static files benefit from compression; for example JPEG, PNG, and GIF files are already inherently compressed because of their format. To cater to this, the IIS 7 configuration file applicationHost.config contains a list of mime types that get compressed when static compression is enabled: <staticTypes> <add mimeType="text/*" enabled="true" /> <add mimeType="message/*" enabled="true" /> <add mimeType="application/javascript" enabled="true" /> <add mimeType="*/*" enabled="false" /></staticTypes> To allow IIS to figure out what mime type a particular file has, applicationHost.config also contains default mappings from file extensions to mime types, including this one: <staticContent lockAttributes="isDocFooterFileName"> ... <mimeMap fileExtension=".js" mimeType="application/x-javascript" /> ...</staticContent> If you look closely, you'll see that the .js extension is mapped by default to a mime type that isn't in the list of mime types to be compressed when static file compression is enabled. The easiest way to solve this is to modify your site's web.config, so that it maps the extension .js to mime type text/javascript. This matches text/* in the list of mime types to be compressed. So, IIS 7 will now compress JavaScript files with the extension .js (folder Minify in the downloaded code bundle): <system.webServer> <staticContent> <remove fileExtension=".js" /> <mimeMap fileExtension=".js" mimeType="text/javascript" /> </staticContent></system.webServer> Keep in mind that IIS 7 only applies static compression to files that are "frequently" requested. This means that the first time you request a file, it won't be compressed! Refresh the page a couple of times and compression will kick in.

0
0
13093

Packt

13 Oct 2015

8 min read

Getting Places

Packt

13 Oct 2015

8 min read

In this article by Nafiul Islam, the author of Mastering Pycharm, we'll learn all about navigation. It is divided into three parts. The first part is called Omni, which deals with getting to anywhere from any place. The second is called Macro, which deals with navigating to places of significance. The third and final part is about moving within a file and it is called Micro. By the end of this article, you should be able to navigate freely and quickly within PyCharm, and use the right tool for the job to do so. Veteran PyCharm users may not find their favorite navigation tool mentioned or explained. This is because the methods of navigation described throughout this article will lead readers to discover their own tools that they prefer over others. (For more resources related to this topic, see here.) Omni In this section, we will discuss the tools that PyCharm provides for a user to go from anywhere to any place. You could be in your project directory one second, the next, you could be inside the Python standard library or a class in your file. These tools are generally slow or at least slower than more precise tools of navigation provided. Back and Forward The Back and Forward actions allow you to move your cursor back to the place where it was previously for more than a few seconds or where you've made edits. This information persists throughout sessions, so even if you exit the IDE, you can still get back to the positions that you were in before you quit. This falls into the Omni category because these two actions could potentially get you from any place within a file to any place within a file in your directory (that you have been to) to even parts of the standard library that you've looked into as well as your third-party Python packages. The Back and Forward actions are perhaps two of my most used navigation actions, and you can use Keymap. Or, one can simply click on the Navigate menu to see the keyboard shortcuts: Macro The difference between Macro and Omni is subtle. Omni allows you to go to the exact location of a place, even a place of no particular significance (say, the third line of a documentation string) in any file. Macro, on the other hand, allows you to navigate anywhere of significance, such as a function definition, class declaration, or particular class method. Go to definition or navigate to declaration Go to definition is the old name for Navigate to Declaration in PyCharm. This action, like the one previously discussed, could lead you anywhere—a class inside your project or a third party library function. What this action does is allow you to go to the source file declaration of a module, package, class, function, and so on. Keymap is once again useful in finding the shortcut for this particular action. Using this action will move your cursor to the file where the class or function is declared, may it be in your project or elsewhere. Just place your cursor on the function or class and invoke the action. Your cursor will now be directly where the function or class was declared. There is, however, a slight problem with this. If one tries to go to the declaration of a .so object, such as the datetime module or the select module, what one will encounter is a stub file (discussed in detail later). These are helper files that allow PyCharm to give you the code completion that it does. Modules that are .so files are indicated by a terminal icon, as shown here: Search Everywhere The action speaks for itself. You search for classes, files, methods, and even actions. Universally invoked using double Shift (pressing Shift twice in quick succession), this nifty action looks similar to any other search bar. Search Everywhere searches only inside your project, by default; however, one can also use it to search non-project items as well. Not using this option leads to faster search and a lower memory footprint. Search Everywhere is a gateway to other search actions available in PyCharm. In the preceding screenshot, one can see that Search Everywhere has separate parts, such as Recent Files and Classes. Each of these parts has a shortcut next to their section name. If you find yourself using Search Everywhere for Classes all the time, you might start using the Navigate Class action instead which is much faster. The Switcher tool The Switcher tool allows you to quickly navigate through your currently open tabs, recently opened files as well as all of your panels. This tool is essential since you always navigate between tabs. A star to the left indicates open tabs; everything else is a recently opened or edited file. If you just have one file open, Switcher will show more of your recently opened files. It's really handy this way since almost always the files that you want to go to are options in Switcher. The Project panel The Project panel is what I use to see the structure of my project as well as search for files that I can't find with Switcher. This panel is by far the most used panel of all, and for good reason. The Project panel also supports search; just open it up and start typing to find your file. However, the Project panel can give you even more of an understanding of what your code looks similar to if you have Show Members enabled. Once this is enabled, you can see the classes as well as the declared methods inside your files. Note that search works just like before, meaning that your search is limited to only the files/objects that you can see; if you collapse everything, you won't be able to search either your files or the classes and methods in them. Micro Micro deals with getting places within a file. These tools are perhaps what I end up using the most in my development. The Structure panel The Structure panel gives you a bird's eye view of the file that you are currently have your cursor on. This panel is indispensable when trying to understand a project that one is not familiar with. The yellow arrow indicates the option to show inherited fields and methods. The red arrow indicates the option to show field names, meaning if that it is turned off, you will only see properties and methods. The orange arrow indicates the option to scroll to and from the source. If both are turned on (scroll to and scroll from), where your cursor is will be synchronized with what method, field, or property is highlighted in the structure panel. Inherited fields are grayed out in the display. Ace Jump This is my favorite navigation plugin, and was made by John Lindquist who is a developer at JetBrains (creators of PyCharm). Ace Jump is inspired from the Emacs mode with the same name. It allows you to jump from one place to another within the same file. Before one can use Ace Jump, one has to install the plugin for it. Ace Jump is usually invoked using Ctrl or command + ; (semicolon). You can search for Ace Jump in Keymap as well, and is called Ace Jump. Once invoked, you get a small box in which you can input a letter. Choose a letter from the word that you want to navigate to, and you will see letters on that letter pop up immediately. If we were to hit D, the cursor would move to the position indicated by D. This might seem long winded, but it actually leads to really fast navigation. If we wanted to select the word indicated by the letter, then we'd invoke Ace Jump twice before entering a letter. This turns the Ace Jump box red. Upon hitting B, the named parameter rounding will be selected. Often, we don't want to go to a word, but rather the beginning or the end of a line. In order to do this, just hit invoke Ace Jump and then the left arrow for line beginnings or the right arrow for line endings. In this case, we'd just hit V to jump to the beginning of the line that starts with num_type. This is an example, where we hit left arrow instead of the right one, and we get line-ending options. Summary In this article, I discussed some of the best tools for navigation. This is by no means an exhaustive list. However, these tools will serve as a gateway to more precise tools available for navigation in PyCharm. I generally use Ace Jump, Back, Forward, and Switcher the most when I write code. The Project panel is always open for me, with the most used files having their classes and methods expanded for quick search. Resources for Article: Further resources on this subject: Enhancing Your Blog with Advanced Features [article] Adding a developer with Django forms [article] Deployment and Post Deployment [article]

0
0
13086

Packt

23 Jun 2017

17 min read

Exploring Compilers

Packt

23 Jun 2017

17 min read

0
0
12993

article-image-overview-tomcat-6-servlet-container-part-1

Packt

18 Jan 2010

11 min read

An Overview of Tomcat 6 Servlet Container: Part 1

Packt

18 Jan 2010

11 min read

In practice, it is highly unlikely that you will interface an EJB container from WebSphere and a JMS implementation from WebLogic, with the Tomcat servlet container from the Apache foundation, but it is at least theoretically possible. Note that the term 'interface', as it is used here, also encompasses abstract classes. The specification's API might provide a template implementation whose operations are defined in terms of some basic set of primitives that are kept abstract for the service provider to implement. A service provider is required to make available concrete implementations of these interfaces and abstract classes. For example, the HttpSession interface is implemented by Tomcat in the form of org.apache.catalina.session.StandardSession. Let's examine the image of the Tomcat container: The objective of this article is to cover the primary request processing components that are present in this image. Advanced topics, such as clustering and security, are shown as shaded in this image and are not covered. In this image, the '+' symbol after the Service, Host, Context, and Wrapper instances indicate that there can be one or more of these elements. For instance, a Service may have a single Engine, but an Engine can contain one or more Hosts. In addition, the whirling circle represents a pool of request processor threads. Here, we will fly over the architecture of Tomcat from a 10,000-foot perspective taking in the sights as we go. Component taxonomy Tomcat's architecture follows the construction of a Matrushka doll from Russia. In other words, it is all about containment where one entity contains another, and that entity in turn contains yet another. In Tomcat, a 'container' is a generic term that refers to any component that can contain another, such as a Server, Service, Engine, Host, or Context. Of these, the Server and Service components are special containers, designated as Top Level Elements as they represent aspects of the running Tomcat instance. All the other Tomcat components are subordinate to these top level elements. The Engine, Host, and Context components are officially termed Containers, and refer to components that process incoming requests and generate an appropriate outgoing response. Nested Components can be thought of as sub-elements that can be nested inside either Top Level Elements or other Containers to configure how they function. Examples of nested components include the Valve, which represents a reusable unit of work; the Pipeline, which represents a chain of Valves strung together; and a Realm which helps set up container-managed security for a particular container. Other nested components include the Loader which is used to enforce the specification's guidelines for servlet class loading; the Manager that supports session management for each web application; the Resources component that represents the web application's static resources and a mechanism to access these resources; and the Listener that allows you to insert custom processing at important points in a container's life cycle, such as when a component is being started or stopped. Not all nested components can be nested within every container. A final major component, which falls into its own category, is the Connector. It represents the connection end point that an external client (such as a web browser) can use to connect to the Tomcat container. Before we go on to examine these components, let's take a quick look at how they are organized structurally. Note that this diagram only shows the key properties of each container. When Tomcat is started, the Java Virtual Machine (JVM) instance in which it runs will contain a singleton Server top level element, which represents the entire Tomcat server. A Server will usually contain just one Service object, which is a structural element that combines one or more Connectors (for example, an HTTP and an HTTPS connector) that funnel incoming requests through to a single Catalina servlet Engine. The Engine represents the core request processing code within Tomcat and supports the definition of multiple Virtual Hosts within it. A virtual host allows a single running Tomcat engine to make it seem to the outside world that there are multiple separate domains (for example, www.my-site.com and www.your-site.com) being hosted on a single machine. Each virtual host can, in turn, support multiple web applications known as Contexts that are deployed to it. A context is represented using the web application format specified by the servlet specification, either as a single compressed WAR (Web Application Archive) file or as an uncompressed directory. In addition, a context is configured using a web.xml file, as defined by the servlet specification. A context can, in turn, contain multiple servlets that are deployed into it, each of which is wrapped in a Wrapper component. The Server, Service, Connector, Engine, Host, and Context elements that will be present in a particular running Tomcat instance are configured using the server.xml configuration file. Architectural benefits This architecture has a couple of useful features. It not only makes it easy to manage component life cycles (each component manages the life cycle notifications for its children), but also to dynamically assemble a running Tomcat server instance that is based on the information that has been read from configuration files at startup. In particular, the server.xml file is parsed at startup, and its contents are used to instantiate and configure the defined elements, which are then assembled into a running Tomcat instance. The server.xml file is read only once, and edits to it will not be picked up until Tomcat is restarted. This architecture also eases the configuration burden by allowing child containers to inherit the configuration of their parent containers. For instance, a Realm defines a data store that can be used for authentication and authorization of users who are attempting to access protected resources within a web application. For ease of configuration, a realm that is defined for an engine applies to all its children hosts and contexts. At the same time, a particular child, such as a given context, may override its inherited realm by specifying its own realm to be used in place of its parent's realm. Top Level Components The Server and Service container components exist largely as structural conveniences. A Server represents the running instance of Tomcat and contains one or more Service children, each of which represents a collection of request processing components. Server A Server represents the entire Tomcat instance and is a singleton within a Java Virtual Machine, and is responsible for managing the life cycle of its contained services. The following image depicts the key aspects of the Server component. As shown, a Server instance is configured using the server.xml configuration file. The root element of this file is <Server> and represents the Tomcat instance. Its default implementation is provided using org.apache.catalina.core.StandardServer, but you can specify your own custom implementation through the className attribute of the <Server> element. A key aspect of the Server is that it opens a server socket on port 8005 (the default) to listen a shutdown command (by default, this command is the text string SHUTDOWN). When this shutdown command is received, the server gracefully shuts itself down. For security reasons, the connection requesting the shutdown must be initiated from the same machine that is running this instance of Tomcat. A Server also provides an implementation of the Java Naming and Directory Interface (JNDI) service, allowing you to register arbitrary objects (such as data sources) or environment variables, by name. At runtime, individual components (such as servlets) can retrieve this information by looking up the desired object name in the server's JNDI bindings. While a JNDI implementation is not integral to the functioning of a servlet container, it is part of the Java EE specification and is a service that servlets have a right to expect from their application servers or servlet containers. Implementing this service makes for easy portability of web applications across containers. While there is always just one server instance within a JVM, it is entirely possible to have multiple server instances running on a single physical machine, each encased in its own JVM. Doing so insulates web applications that are running on one VM from errors in applications that are running on others, and simplifies maintenance by allowing a JVM to be restarted independently of the others. This is one of the mechanisms used in a shared hosting environment (the other is virtual hosting, which we will see shortly) where you need isolation from other web applications that are running on the same physical server. Service While the Server represents the Tomcat instance itself, a Service represents the set of request processing components within Tomcat. A Server can contain more than one Service, where each service associates a group of Connector components with a single Engine. Requests from clients are received on a connector, which in turn funnels them through into the engine, which is the key request processing component within Tomcat. The image shows connectors for HTTP, HTTPS, and the Apache JServ Protocol (AJP). There is very little reason to modify this element, and the default Service instance is usually sufficient. A hint as to when you might need more than one Service instance can be found in the above image. As shown, a service aggregates connectors, each of which monitors a given IP address and port, and responds in a given protocol. An example use case for having multiple services, therefore, is when you want to partition your services (and their contained engines, hosts, and web applications) by IP address and/or port number. For instance, you might configure your firewall to expose the connectors for one service to an external audience, while restricting your other service to hosting intranet applications that are visible only to internal users. This would ensure that an external user could never access your Intranet application, as that access would be blocked by the firewall. The Service, therefore, is nothing more than a grouping construct. It does not currently add any other value to the proceedings. Connectors A Connector is a service endpoint on which a client connects to the Tomcat container. It serves to insulate the engine from the various communication protocols that are used by clients, such as HTTP, HTTPS, or the Apache JServ Protocol (AJP). Tomcat can be configured to work in two modes—Standalone or in Conjunction with a separate web server. In standalone mode, Tomcat is configured with HTTP and HTTPS connectors, which make it act like a full-fledged web server by serving up static content when requested, as well as by delegating to the Catalina engine for dynamic content. Out of the box, Tomcat provides three possible implementations of the HTTP/1.1 and HTTPS connectors for this mode of operation. The most common are the standard connectors, known as Coyote which are implemented using standard Java I/O mechanisms. You may also make use of a couple of newer implementations, one which uses the non-blocking NIO features of Java 1.4, and the other which takes advantage of native code that is optimized for a particular operating system through the Apache Portable Runtime (APR). Note that both the Connector and the Engine run in the same JVM. In fact, they run within the same Server instance. In conjunction mode, Tomcat plays a supporting role to a web server, such as Apache httpd or Microsoft's IIS. The client here is the web server, communicating with Tomcat either through an Apache module or an ISAPI DLL. When this module determines that a request must be routed to Tomcat for processing, it will communicate this request to Tomcat using AJP, a binary protocol that is designed to be more efficient than the text based HTTP when communicating between a web server and Tomcat. On the Tomcat side, an AJP connector accepts this communication and translates it into a form that the Catalina engine can process. In this mode, Tomcat is running in its own JVM as a separate process from the web server. In either mode, the primary attributes of a Connector are the IP address and port on which it will listen for incoming requests, and the protocol that it supports. Another key attribute is the maximum number of request processing threads that can be created to concurrently handle incoming requests. Once all these threads are busy, any incoming request will be ignored until a thread becomes available. By default, a connector listens on all the IP addresses for the given physical machine (its address attribute defaults to 0.0.0.0). However, a connector can be configured to listen on just one of the IP addresses for a machine. This will constrain it to accept connections from only that specified IP address. Any request that is received by any one of a service's connectors is passed on to the service's single engine. This engine, known as Catalina, is responsible for the processing of the request, and the generation of the response. The engine returns the response to the connector, which then transmits it back to the client using the appropriate communication protocol.

0
0
12989

Packt

22 Oct 2009

5 min read

Functional Testing with JMeter

Packt

22 Oct 2009

5 min read

0
3
12881

article-image-qgis-feature-selection-tools

Packt

05 Dec 2014

4 min read

QGIS Feature Selection Tools

Packt

05 Dec 2014

4 min read

In this article by Anita Graser, the author of Learning QGIS Third Edition, we will cover the following topics: Selecting features with the mouse Selecting features using expressions Selecting features using Spatial queries (For more resources related to this topic, see here.) Selecting features with the mouse The first group of tools in the Attributes toolbar allows us to select features on the map using the mouse. The following screenshot shows the Select Feature(s) tool. We can select a single feature by clicking on it or select multiple features by drawing a rectangle. The other tools can be used to select features by drawing different shapes: polygons, freehand areas, or circles around the features. All features that intersect with the drawn shape are selected. Holding down the Ctrl key will add the new selection to an existing one. Similarly, holding down Ctrl + Shift will remove the new selection from the existing selection. Selecting features by expression The second type of select tool is called Select by Expression, and it is also available in the Attribute toolbar. It selects features based on expressions that can contain references and functions using feature attributes and/or geometry. The list of available functions is pretty long, but we can use the search box to filter the list by name to find the function we are looking for faster. On the right-hand side of the window, we will find Selected Function Help, which explains the functionality and how to use the function in an expression. The Function List option also shows the layer attribute fields, and by clicking on Load all unique values or Load 10 sample values, we can easily access their content. As with the mouse tools, we can choose between creating a new selection or adding to or deleting from an existing selection. Additionally, we can choose to only select features from within an existing selection. Let's have a look at some example expressions that you can build on and use in your own work: Using the lakes.shp file in our sample data, we can, for example, select big lakes with an area bigger than 1,000 square miles using a simple attribute query, "AREA_MI" > 1000.0, or using geometry functions such as $area > (1000.0 * 27878400). Note that the lakes.shp CRS uses feet, and we, therefore, have to multiply by 27,878,400 to convert from square feet to square miles. The dialog will look like the one shown in the following screenshot. We can also work with string functions, for example, to find lakes with long names, such as length("NAMES") > 12, or lakes with names that contain the s or S character, such as lower("NAMES") LIKE '%s%', which first converts the names to lowercase and then looks for any appearance of s. Selecting features using spatial queries The third type of tool is called Spatial Query and allows us to select features in one layer based on their location, relative to the features in a second layer. These tools can be accessed by going to Vector | Research Tools | Select by location and then going to Vector | Spatial Query | Spatial Query. Enable it in Plugin Manager if you cannot find it in the Vector menu. In general, we want to use the Spatial Query plugin, as it supports a variety of spatial operations such as crosses, equals, intersects, is disjoint, overlaps, touches, and contains, depending on the layer's geometry type. Let's test the Spatial Query plugin using railroads.shp and pipelines.shp from the sample data. For example, we might want to find all the railroad features that cross a pipeline; we will, therefore, select the railroads layer, the Crosses operation, and the pipelines layer. After clicking on Apply, the plugin presents us with the query results. There is a list of IDs of the result features on the right-hand side of the window, as you can see in the following screenshot. Below this list, we can select the Zoom to item checkbox, and QGIS will zoom to the feature that belongs to the selected ID. Additionally, the plugin offers buttons to directly save all the resulting features to a new layer. Summary This article introduced you to three solutions to select features in QGIS: selecting features with mouse, using spatial queries, and using expressions. Resources for Article: Further resources on this subject: Editing attributes [article] Server Logs [article] Improving proximity filtering with KNN [article]

0
0
12869

article-image-running-your-applications-aws

Cheryl Adams

17 Aug 2016

4 min read

Running Your Applications with AWS

Cheryl Adams

17 Aug 2016

4 min read

If you’ve ever been told not to run with scissors, you should not have the same concern when running with AWS. It is neither dangerous nor unsafe when you know what you are doing and where to look when you don’t. Amazon’s current service offering, AWS (Amazon Web Services), is a collection of services, applications and tools that can be used to deploy your infrastructure and application environment to the cloud. Amazon gives you the option to start their service offerings with a ‘free tier’ and then move toward a pay as you go model. We will highlight a few of the features when you open your account with AWS. One of the first things you will notice is that Amazon offers a bulk of information regarding cloud computing right up front. Whether you are a novice, amateur or an expert in cloud computing, Amazon offers documented information before you create your account. This type of information is essential if you are exploring this tool for a project or doing some self-study on your own. If you are a pre-existing Amazon customer, you can use your same account to get started with AWS. If you want to keep your personal account separate from your development or business, it would be best to create a separate account. Amazon Web Services Landing Page The Free Tier is one of the most attractive features of AWS. As a new account you are entitled to twelve months within the Free Tier. In addition to this span of time, there are services that can continue after the free tier is over. This gives the user ample time to explore the offerings within this free-tier period. The caution is not to exceed the free service limitations as it will incur charges. Setting up the free-tier still requires a credit card. Fee-based services will be offered throughout the free tier, so it is important not to select a fee-based charge unless you are ready to start paying for it. Actual paid use will vary based on what you have selected. AWS Service and Offerings (shown on an open account) AWS overview of services available on the landing page Amazon’s service list is very robust. If you are already considering AWS, hopefully this means you are aware of what you need or at least what you would like to use. If not, this would be a good time to press pause and look at some resource-based materials. Before the clock starts ticking on your free-tier, I would recommend a slow walk through the introductory information on this site to ensure that you are selecting the right mix of services before creating your account. Amazon’s technical resources has a 10-minute tutorial that gives you a complete overview of the services. Topics like ‘AWS Training and Introduction’ and ‘Get Started with AWS’ include a list of 10-minute videos as well as a short list of ‘how to’ instructions for some of the more commonly used features. If you are a techie by trade or hobby, this may be something you want to dive into immediately.In a company, generally there is a predefined need or issue that the organization may feel can be resolved by the cloud. If it is a team initiative, it would be good to review the resources mentioned in this article so that everyone is on the same page as to what this solution can do.It’s recommended before you start any trial, subscription or new service that you have a set goal or expectation of why you are doing it. Simply stated, a cloud solution is not the perfect solution for everyone. There is so much information here on the AWS site. It’s also great if you are comparing between competing cloud service vendors in the same space. You will be able to do a complete assessment of most services within the free-tier. You can map use case scenarios to determine if AWS is the right fit for your project. AWS First Project is a great place to get started if you are new to AWS. If you are wondering how to get started, these technical resources will set you in the right direction. By reviewing this information during your setup or before you start, you will be able to make good use out of your first few months and your introduction to AWS. About the author Cheryl Adams is a senior cloud data and infrastructure architect in the healthcare data realm. She is also the co-author of Professional Hadoop by Wrox.

0
0
12861

Packt

22 Sep 2015

18 min read

Cassandra Design Patterns

Packt

22 Sep 2015

18 min read

In this article by Rajanarayanan Thottuvaikkatumana, author of the book Cassandra Design Patterns, Second Edition, the author has discussed how Apache Cassandra is one of the most popular NoSQL data stores. He states this based on the research paper Dynamo: Amazon’s Highly Available Key-Value Store and the research paper Bigtable: A Distributed Storage System for Structured Data. Cassandra is implemented with best features from both of these research papers. In general, NoSQL data stores can be classified into the following groups: Key-value data store Column family data store Document data store Graph data store Cassandra belongs to the column family data store group. Cassandra’s peer-to-peer architecture avoids single point failures in the cluster of Cassandra nodes and gives the ability to distribute the nodes across racks or data centres. This makes Cassandra a linearly scalable data store. In other words, the more processing you need, the more Cassandra nodes you can add to your cluster. Cassandra’s multi data centre support makes it a perfect choice to replicate the data stores across data centres for disaster recovery, high availability, separating transaction processing, analytical environments, and for building resiliency into the data store infrastructure. Design patterns in Cassandra The term “design patterns” is a highly misinterpreted term in the software development community. In an extremely general sense, it is a set of solutions for some known problems in quite a specific context. It is used in this book to describe a pattern of using certain features of Cassandra to solve some real-world problems. This book is a collection of such design patterns with real-world examples. Coexistence patterns Cassandra is one of the highly successful NoSQL data stores, which is greatly similar to the traditional RDBMS. Cassandra column families (also known as Cassandra tables), in a logical perspective, have a similarity with RDBMS-based tables in the view of the users, even though the underlying structure of these tables are totally different. Because of this, Cassandra is best fit to be deployed along with the traditional RDBMS to solve some of the problems that RDBMS is not able to handle. The caveat here is that because of the similarity of RDBMS tables and Cassandra column families in the view of the end users, many users and data modelers try to use Cassandra in the exact the same way as the RDBMS schema is being modeled, used, and getting into serious deployment issues. How do you prevent such pitfalls? The key here is to understand the differences in a theoretical perspective as well as in a practical perspective, and follow best practices prescribed by the creators of Cassandra. Where do you start with Cassandra? The best place to look at is the new application development requirements and take it from there. Look at the cases where there is a need to normalize the RDBMS tables and keep all the data items together, which would have got distributed if you were to design the same solution in RDBMS. Instead of thinking from the pure data model perspective, start thinking in terms of the application's perspective. How the data is generated by the application, what are the read requirements, what are the write requirements, what is the response time expected out of some of the use cases, and so on. Depending on these aspects, design the data model. In the big data world, the application becomes the first class citizen and the data model leaves the driving seat in the application design. Design the data model to serve the needs of the applications. In any organization, new reporting requirements come all the time. The major challenge in to generate reports is the underlying data store. In the RDBMS world, reporting is always a challenge. You may have to join multiple tables to generate even simple reports. Even though the RDBMS objects such as views, stored procedures, and indexes maybe used to get the desired data for the reports, when the report is being generated, the query plan is going to be very complex most of the time. The consumption of processing power is another need to consider when generating such reports on the fly. Because of these complexities, many times, for reporting requirements, it is common to keep separate tables containing data exported from the transactional tables. This is a great opportunity to start with NoSQL stores like Cassandra as a reporting data store. Data aggregation and summarization are common requirements in any organization. This helps to control the data growth by storing only the summary statistics and moving the transactional data into archives. Many times, this aggregated and summarized data is used for statistical analysis. Making the summary accurate and easily accessible is a big challenge. Most of the time, data aggregation and reporting goes hand in hand. The aggregated data is heavily used in reports. The aggregation process speeds up the queries to a great extent. This is another place where you can start with NoSQL stores like Cassandra. The coexistence of RDBMS and NoSQL data stores like Cassandra is very much possible, feasible, and sensible; and this is the only way to get started with the NoSQL movement, unless you embark on a totally new product development from scratch. In summary, this section of the book discusses about some design patterns related to de-normalization, reporting, and aggregation of data using Cassandra as the preferred NoSQL data store. RDBMS migration patterns A big bang approach to any kind of technology migration is not advisable. A series of deliberations have to happen before the eventual and complete change over. Migration from RDBMS to Cassandra is not different at all. Any new technology replacing an old one must coexist harmoniously, at least for a short period of time. This gives a lot of confidence on the new technology to the stakeholders. Many technology pundits give various approaches on the RDBMS to NoSQL migration strategies. Many such guidelines are specific to the particular NoSQL data stores giving attention to specific areas, and most of the time, this will end up on the process rather than the technology. The migration from RDBMS to Cassandra is not an easy task. Mainly because the RDBMS-based systems are really time tested and trust worthy in most of the organizations. So, migrating from such a robust RDBMS-based system to Cassandra is not going to be easy for anyone. One of the best approaches to achieve this goal is to exploit some of the new or unique features in Cassandra, which many of the traditional RDBMS don't have. This also prevents the usage of Cassandra just like any other RDBMS. Cassandra is unique. Cassandra is not an RDBMS. The approach of banking on the unique features is not only applicable to the RDBMS to Cassandra migration, but also to any migration from one paradigm to another. Some of the design patterns that are discussed in this section of the book revolve around very simple and important features of Cassandra, but have profound application potential when designing the next generation NoSQL data stores using Cassandra. A wise usage of these unique features in Cassandra will give a head start on the eventual and complete migration from RDBMS. The modeling of collection objects in RDBMS is a real pain, because multiple tables are to be defined and a join is required to access data. Many RDBMS offer this by providing capability to define user-defined data types, but there is absolutely no standardization at all in this space. Collection objects are very commonly seen in the real-world applications. A list of actions, tuple of related values, set of objects, dictionaries, and things like that come quite often in applications. Cassandra has elegant ways to model this because they are data types in column families. Counting is a very commonly required process in many business processes and applications. In RDBMS, this has to be modeled as integers or long numbers, but many times, applications make big mistakes in using them in wrong ways. Cassandra has a counter data type in the column family that alleviates this problem. Getting rid of unwanted records from an RDBMS table is not an automatic process. When some application events occur, they have to be removed by application programs or through some other means. But in many situations, many data items will have a preallocated time to live. They should go away without the intervention of any external events. Cassandra has a way to assign time-to-live (TTL) attribute to data items. By making use of TTL, data items get removed without any other external event's intervention. All the design patterns covered in this section of the book revolve around some of the new features of Cassandra that will make the migration from RDBMS to Cassandra an easy task. Cache migration pattern Database access whether it is from RDBMS or other highly distributed NoSQL data stores is always an input/output (I/O) intensive operation. It makes perfect sense to cache the frequently used, but reasonably static data for fast access for the applications consuming this data. In such situations, the in-memory cache is preferred to the repeated database access for each request. Using cache is not always a pleasant experience. Getting into really weird problems such as data loss, data getting out of sync with its source and other data integrity problems are very common. It is very common to see wrong components coming into the enterprise solution stack all the time for various reasons. Overlooking on some of the features and adopting the technology without much background work is a very common pitfall. Many a times, the use of cache comes into the solution stack to reduce the latency of the responses. Once the initial results are favorable, more and more data will get tossed into the cache. Slowly, this will become a practice to see that more and more data is getting into cache. Now is the time when problems start popping up one by one. Pure in-memory cache solutions are favored by everybody, by the virtue of its ability to serve the data quickly until you start loosing data. This is because of the faults in the system, along with application and node crashes. Cache serves data much faster than being served from other data stores. But if the caching solution in use is giving data integrity problems, it is better to migrate to NoSQL data stores like Cassandra. Is Cassandra faster than the in-memory caching solutions? The obvious answer is no. But it is not as bad as many think. Cassandra can be configured to serve fast reads, and bonus comes in the form of high data integrity with strong replication capabilities. Cache is good as long as it serves its purpose without any data loss or any other data integrity issues. Emphasizing on the use case of the key/value type cache and various methods of cache to NoSQL migration are discussed in this section of the book. Cassandra cannot be used as a replacement for cache in terms of the speed of data access. But when it comes to data integrity, Cassandra shines all the time with its tuneable consistency feature. With a continual tuning and manipulating data with clean and well-written application code, data access can be improved to a great level, and it will be much better than many other data stores. The design pattern covered in this section of the book gives some guidance on migrating from caching solutions to Cassandra, if this is a must. CAP patterns When it comes to large-scale Internet applications or web services, popularly known as the Internet of Things (IoT) applications, the number of components are huge and the way they are distributed is beyond imagination. There will be hundreds of application servers, hundreds of data store nodes, and many other components in the whole ecosystem. In such a scenario, for doing an atomic transaction by getting an agreement from all the components involved is, for all practical purposes, impossible. Consistency, availability, and partition tolerance are three important guarantees, popularly known as CAP guarantees that any distributed computing systems should offer even though all is not possible simultaneously. In the IoT applications, the distribution of the application nodes is unavoidable. This means that the possibility of network partition is pretty much there. So, it is mandatory to give the P guarantee. Now, the question is whether to forfeit the C guarantee or the A guarantee. At this stage, the situation is not as grave as portrayed in the CAP Theorem conjectured by Eric Brewer. For all the use cases in a given IoT application, there is no need of having 100% of C guarantee and 100% of A guarantee. So, depending on the need of the level of A guarantee, the C guarantee can be tuned. In other words, it is called tunable consistency. Depending on the way data is ingested into Cassandra, and the way it is consumed from Cassandra, tuning is possible to give best results for the appropriate read and write requirements of the applications. In some applications, the speed at which the data is written will be very high. In other words, the velocity of the data ingestion into Cassandra is very high. This falls into the write-heavy applications. In some applications, the need to read data quickly will be an important requirement. This is mainly needed in the applications where there is a lot of data processing required. Data analytics applications, batch processing applications, and so on fall under this category. These fall into the read-heavy applications. Now, there is a third category of applications where there is an equal importance for fast writes as well as fast reads. These are the kind of applications where there is a constant inflow of data, and at the same time, there is a need to read the data by clients for various purposes. This falls into the read-write balanced applications. The consistency level requirements for all the previous three types of applications are totally different. There is no one way to tune so that it is optimal for all the three types of applications. All the three applications' consistency levels are to be tuned differently from use case to use case. In this section of the book, various design patterns related to applications with the needs of fast writes, fast reads, and moderate write and read are discussed. All these design patterns revolve around using the tuneable consistency parameters of Cassandra. Whether it is for write or read and if the consistency levels are set high, the availability levels will be low and vice versa. So, by making use of the consistency level knob, the Cassandra data store can be used for various types of writing and reading use cases. Temporal patterns In any applications, the usage of data that varies over the period of time is called as temporal data, which is very important. Temporal data is needed wherever there is a need to maintain chronology. There are so many applications in which there is a huge need for storage, retrieval, and processing of data that is tied to time. The biggest challenge in dealing with temporal data stored in a data store is that they are hugely used for analytical purposes and retrieving the data, based on various sort orders in terms of time. So, the data stores that are used to capture the temporal data should be capable of storing the data strictly adhering to the chronology. There are so many usage patterns that are seen in the real world that fall into showing temporal behavior. For the classification purpose in this book, they are bucketed into three. The first one is the general time series category. The second one is the log category, such as in an audit log, a transaction log, and so on. The third one is the conversation category, such as in the conversation messages of a chat application. There is relevance in this classification, because these are commonly used across in many of the applications. In many of the applications, these are really cross cutting concerns; and designers underestimate this aspect; and finally, many of the applications will have different data stores capturing this temporal data. There is a need to have a common strategy dealing with temporal data that fall in these three commonly seen categories in an enterprise wide solution architecture. In other words, there should be a uniform way of capturing temporal data; there should be a uniform way of processing temporal data; and there should be a commonly used set of tools and libraries to manage the temporal data. Out of the three design patterns that are discussed in this section of the book, the first Time Series pattern is a general design pattern that covers the most general behavior of any kind of temporal data. The next two design patterns namely Log pattern and Conversation pattern are two special cases of the first design pattern. This section of the book covers the general nature of temporal data, some specific instances of such data items in the real-world applications, and why Cassandra is the best fit as a NoSQL data store to persist the temporal data. Temporal data comes quite often in many use cases of lots of applications. Data modeling of temporal data is very important in the Cassandra perspective for optimal storage and quick access of the data. Some common design patterns to model temporal data have been covered in this section of the book. By focusing on some very few aspects, such as the partition key, primary key, clustering column and the number of records that gets stored in a wide row of Cassandra, very effective and high performing temporal data models can be built. Analytical patterns The 3Vs of big data namely Volume, Variety, and Velocity pose another big challenge, which is the analysis of the data stored in NoSQL data stores, such as Cassandra. What are the analytics use cases? How can the distributed data be processed? What are the data transformations that are typically seen in the applications? These are the topics covered in this section of the book. Unlike other sections of this book, the focus is shifted from Cassandra to other technologies like Apache Hadoop, Hadoop MapReduce, and Apache Spark to introduce the big data analytics tool space. The design patterns such as Map/Reduce Pattern and Transformation Pattern are very commonly seen in the data analytics world. Cassandra with Apache Spark has good compatibility, and is a very ideal tool set in the data analysis use cases. This section of the book covers some data analysis aspects and mainly discusses about data processing. Data transformation is one of the major activity in data processing. Out of the many data processing patterns, Map/Reduce Pattern deserves a special mention, because it is being used in so many batch processing and analysis use cases, dealing with big data. Spark has been chosen as the tool of choice to explain the data processing activities. This section explains how a Map/Reduce kind of data processing task can be done using Cassandra. Spark has also been discussed, which is very powerful to perform online data analysis. This section of the book also covers some of the commonly seen data transformations that are used in the data processing applications. Summary Many Cassandra design patterns have been covered in this book. If the design patterns are not being used in any real-world applications, it has only theoretical value. To give a practical approach to the applicability of these design patterns, an end-to-end application is taken as a case point and described as the last chapter of the book, which is used as a vehicle to explain the applicability of the Cassandra design patterns discussed in the earlier sections of the book. Users love Cassandra because of its SQL-like interface CQL. Also, its features are very closely related to the RDBMS even though the paradigm is totally new. Application developers love Cassandra because of the plethora of drivers available in the market so that they can write applications in their preferred programming language. Architects love Cassandra because they can store structured, semi-structured, and unstructured data in it. Database administers love Cassandra because it comes with almost no maintenance overhead. Service managers love Cassandra because of the wonderful monitoring tools available in the market. CIOs love Cassandra because it gives value for their money. And Cassandra works! An application based on Cassandra will be perfect only if its features are used in the right way, and this book is an attempt to guide the Cassandra community in this direction. Resources for Article: Further resources on this subject: Cassandra Architecture [article] Getting Up and Running with Cassandra [article] Getting Started with Apache Cassandra [article]

0
0
12821

How-To Tutorials - Programming

The developer-tester face-off needs to end. It's putting our projects at risk.

Asynchronous Programming in F#

Introduction and Composition

Github wants to improve Open Source sustainability; invites maintainers to talk about their OSS challenges

Working with Entity Client and Entity SQL

It's All About Data

Image classification and feature extraction from images

ASP.Net Site Performance: Improving JavaScript Loading

Getting Places

Exploring Compilers

Trending Topics

An Overview of Tomcat 6 Servlet Container: Part 1

Functional Testing with JMeter

QGIS Feature Selection Tools

Running Your Applications with AWS

Cassandra Design Patterns

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access