Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7018 Articles
article-image-apache-spark-0
Packt
14 Sep 2015
9 min read
Save for later

Apache Spark

Packt
14 Sep 2015
9 min read
 In this article by Mike, author of the book Mastering Apache Spark many Hadoop-based tools built on Hadoop CDH cluster are introduced. (For more resources related to this topic, see here.) His premise, when approaching any big data system, is that none of the components exist in isolation. There are many functions that need to be addressed in a big data system with components passing data along an ETL (Extract Transform and Load) chain, or calling the subcomponents to carry out processing. Some of the functions are: Data Movement Scheduling Storage Data Acquisition Real Time Data Processing Batch Data Processing Monitoring Reporting This list is not exhaustive, but it gives you an idea of the functional areas that are involved. For instance, HDFS (Hadoop Distributed File System) might be used for storage, Oozie for scheduling, Hue for monitoring, and Spark for real-time processing. His point, though, is that none of these systems exists in isolation; they either exist in an ETL chain when processing data, and rely on other sub components as in Oozie, or depend on other components to provide functionality that they do not have. His contention is that integration between big data systems is an important factor. One needs to consider from where the data is coming, how it will be processed, and where it is then going to. Given this consideration, the integration options for a big data component need to be investigated both, in terms of what is available now, and what might be available in the future. In the book, the author has distributed the system functionality by chapters, and tried to determine what tools might be available to carry out these functions. Then, with the help of simple examples by using code and data, he has shown how the systems might be used together. The book is based upon Apache Spark, so as you might expect, it investigates the four main functional modules of Spark: MLlib for machine learning Streaming for the data stream processing SQL for data processing in a tabular format GraphX for graph-based processing However, the book attempts to extend these common, real-time big data processing areas by examining extra areas such as graph-based storage and real-time cloud-based processing via Databricks. It provides examples of integration with external tools, such as Kafka and Flume, as well as Scala-based development examples. In order to Spark your interest, and prepare you for the book's contents, he has described the contents of the book by subject, and given you a sample of the content. Overview The introduction sets the scene for the book by examining topics such as Spark cluster design, and the choice of cluster managers. It considers the issues, affecting the cluster performance, and explains how real-time big data processing can be carried out in the cloud. The following diagram, describes the topics that are explained in the book: The Spark Streaming examples are provided along with details for checkpointing to avoid data loss. Installation and integration examples are provided for Kafka (messaging) and Flume (data movement). The functionality of Spark MLlib is extended via 0xdata H2O, and a deep learning example neural system is created and tested. The Spark SQL is investigated, and integrated with Hive to show that Spark can become a real-time processing engine for Hive. Spark storage is considered, by example, using Aurelius (Datastax) Titan along with underlying storage in HBase and Cassandra. The use of Tinkerpop and Gremlin shell are explained by example for graph processing. Finally, of course many, methods of integrating Spark to HDFS are shown with the help of an example. This gives you a flavor of what is in the book, but it doesn't give you the detail. Keep reading to find out what is in each area. Spark MLlib Spark MLlib examines data classification with Naïve Bayes, data clustering with K-Means, and neural processing with ANN (Artificial Neural Network). If these terms do not mean anything to you, don't worry. They are explained both, in terms of theory, and then practically with examples. The author has always been interested in neural networks, and was pleased to be able to base the ANN section on the work by Bert Greevenbosch (www.bertgreevenbosch.nl). This allows to show how Apache Spark can be built from source code, and be extended in the same process with extra functionality. The following diagram shows a real, biological neuron to the left, and a simulated neuron to the right. It also explains how computational neurons are simulated in a step-by-step process from real neurons in your head. It then goes on to describe how neural networks are created, and how processing takes place. It's an interesting topic. The integration of big data systems, and neural processing. Spark Streaming An important issue, when processing stream-based data, is failure recover. Here, we examine error recovery, and checkpointing with the help of an example for Apache Spark. It also provides examples for TCP, file, Flume, and Kafka-based stream processing using Spark. Even though he has provided step-by-step, code-based examples, data stream processing can become complicated. He has tried to reduce complexity, so that learning does not become a challenge. For example, when introducing a Kafka-based example, The following diagram is used to explain the test components with the data flow, and the component set up in a logical, step-by-step manner: Spark SQL When introducing Spark SQL, he has described the data file formats that might be used to assist with data integration. Then move on to describe with the help of an example the use of the data frames, followed closely by practical SQL examples. Finally, integration with Apache Hive is introduced to provide big data warehouse real-time processing by example. The user-defined functions are also explained, showing how they can be defined in multiple ways, and be used with Spark SQL. Spark GraphX Graph processing is examined by showing how a simple graph can be created in Scala. Then, sample graph algorithms are introduced like PageRank and Triangles. With permission from Kenny Bastani (http://www.kennybastani.com/), the Mazerunner prototype application is discussed. A step-by-step approach is described by which Docker, Neo4j, and Mazerunner can be installed. Then, the functionality of both, Neo4j and Mazerunner, is used to move the data between Neo4j and HDFS. The following diagram gives an overview of the architecture that will be introduced: Spark storage Apache Spark is a highly functional, real-time, distributed big data processing system. However, it does not provide any data storage. In many places within the book, the examples are provided for using HDFS-based storage, but what if you want graph-based storage? What if you want to process and store data as a graph? The Aurelius (Datastax) Titan graph database is examined in the book. The underlying storage options with Cassandra, and HBase are used with Scala examples. The graph-based processing is examined using Tinkerpop and Gremlin-based scripts. Using a simple, example-based approach, both: the architecture involved, and multiple ways of using Gremlin shell are introduced in the following diagram: Spark H2O While Apache Spark is highly functional and agile, allowing data to move easily between its modules, how might we extend it? By considering the H2O product from http://h2o.ai/, the machine learning functionality of Apache Spark can be extended. H2O plus Spark equals Sparkling Water. Sparkling Water is used to create a deep learning neural processing example for data processing. The H2O web-based Flow application is also introduced for analytics, and data investigation. Spark Databricks Having created big data processing clusters on the physical machines, the next logical step is to move processing into the cloud. This might be carried out by obtaining cloud-based storage, using Spark as a cloud-based service, or using a Spark-based management system. The people who designed Apache Spark have created a Spark cloud-based processing platform called https://databricks.com/. He has dedicated two chapters in the book to this service, because he feels that it is important to investigate the future trends. All the aspects of Databricks are examined from the user and cluster management to the use of Notebooks for data processing. The languages that can be used are investigated as the ways of developing code on local machines, and then they can be moved to the cloud, in order to save money. The data import is examined with examples, as is the DbUtils package for data processing. The REST interface for the Spark cloud instance management is investigated, because it offers integration options between your potential cloud instance, and the external systems. Finally, options for moving data and functionality are investigated in terms of data and folder import/export, along with library import, and cluster creation on demand. Databricks visualisation The various options of cloud-based big data visualization using Databricks are investigated. Multiple ways are described for creating reports with the help of tables and SQL bar graphs. Pie charts and world maps are used to present data. Databricks allows geolocation data to be combined with your raw data to create geographical real-time charts. The following figure, taken from the book, shows the result of a worked example, combining GeoNames data with geolocation data. The color coded country-based data counts are the result. It's difficult to demonstrate this in a book, but imagine this map, based upon the stream-based data, and continuously updating in real time. In a similar way, it is possible to create dashboards from your Databricks reports, and make them available to your external customers via a web-based URL. Summary Mike hopes that this article has given you an idea of the book's contents. And also that it has intrigued you, so that you will search out a copy of the Spark-based book, Mastering Apache Spark, and try out all of these examples for yourself. The book comes with a code package that provides the example-based sample code, as well as build and execution scripts. This should provide you with an easy start, and a platform to build your own Spark based-code. Resources for Article: Further resources on this subject: Sabermetrics with Apache Spark[article] Getting Started with Apache Spark[article] Machine Learning Using Spark MLlib[article]
Read more
  • 0
  • 0
  • 2801

article-image-postgresql-action
Packt
14 Sep 2015
10 min read
Save for later

PostgreSQL in Action

Packt
14 Sep 2015
10 min read
In this article by Salahadin Juba, Achim Vannahme, and Andrey Volkov, authors of the book Learning PostgreSQL, we will discuss PostgreSQL (pronounced Post-Gres-Q-L) or Postgres is an open source, object-relational database management system. It emphasizes extensibility, creativity, and compatibility. It competes with major relational database vendors, such as Oracle, MySQL, SQL servers, and others. It is used by different sectors, including government agencies and the public and private sectors. It is cross-platform and runs on most modern operating systems, including Windows, Mac, and Linux flavors. It conforms to SQL standards and it is ACID complaint. (For more resources related to this topic, see here.) An overview of PostgreSQL PostgreSQL has many rich features. It provides enterprise-level services, including performance and scalability. It has a very supportive community and very good documentation. The history of PostgreSQL The name PostgreSQL comes from post-Ingres database. the history of PostgreSQL can be summarized as follows: Academia: University of California at Berkeley (UC Berkeley) 1977-1985, Ingres project: Michael Stonebraker created RDBMS according to the formal relational model 1986-1994, postgres: Michael Stonebraker created postgres in order to support complex data types and the object-relational model. 1995, Postgres95: Andrew Yu and Jolly Chen changed postgres to postgres query language (P) with an extended subset of SQL. Industry 1996, PostgreSQL: Several developers dedicated a lot of labor and time to stabilize Postgres95. The first open source version was released on January 29, 1997. With the introduction of new features, and enhancements, and at the start of open source projects, the Postgres95 name was changed to PostgreSQL. PostgreSQL began at version 6, with a very strong starting point by taking advantage of several years of research and development. Being an open source with a very good reputation, PostgreSQL has attracted hundreds of developers. Currently, PostgreSQL has innumerable extensions and a very active community. Advantages of PostgreSQL PostgreSQL provides many features that attract developers, administrators, architects, and companies. Business advantages of PostgreSQL PostgreSQL is free, open source software (OSS); it has been released under the PostgreSQL license, which is similar to the BSD and MIT licenses. The PostgreSQL license is highly permissive, and PostgreSQL is not a subject to monopoly and acquisition. This gives the company the following advantages. There is no associated licensing cost to PostgreSQL. The number of deployments of PostgreSQL is unlimited. A more profitable business model. PostgreSQL is SQL standards compliant. Thus finding professional developers is not very difficult. PostgreSQL is easy to learn and porting code from one database vendor to PostgreSQL is cost efficient. Also, PostgreSQL administrative tasks are easy to automate. Thus, the staffing cost is significantly reduced. PostgreSQL is cross-platform, and it has drivers for all modern programming languages; so, there is no need to change the company policy about the software stack in order to use PostgreSQL. PostgreSQL is scalable and it has a high performance. PostgreSQL is very reliable; it rarely crashes. Also, PostgreSQL is ACID compliant, which means that it can tolerate some hardware failure. In addition to that, it can be configured and installed as a cluster to ensure high availability (HA). User advantages of PostgreSQL PostgreSQL is very attractive for developers, administrators, and architects; it has rich features that enable developers to perform tasks in an agile way. The following are some attractive features for the developer: There is a new release almost each year; until now, starting from Postgres95, there have been 23 major releases. Very good documentation and an active community enable developers to find and solve problems quickly. The PostgreSQL manual is over than 2,500 pages in length. A rich extension repository enables developers to focus on the business logic. Also, it enables developers to meet requirement changes easily. The source code is available free of charge, it can be customized and extended without a huge effort. Rich clients and administrative tools enable developers to perform routine tasks, such as describing database objects, exporting and importing data, and dumping and restoring databases, very quickly. Database administration tasks do not requires a lot of time and can be automated. PostgreSQL can be integrated easily with other database management systems, giving software architecture good flexibility in putting software designs. Applications of PostgreSQL PostgreSQL can be used for a variety of applications. The main PostgreSQL application domains can be classified into two categories: Online transactional processing (OLTP): OLTP is characterized by a large number of CRUD operations, very fast processing of operations, and maintaining data integrity in a multiaccess environment. The performance is measured in the number of transactions per second. Online analytical processing (OLAP): OLAP is characterized by a small number of requests, complex queries that involve data aggregation, and a huge amount of data from different sources, with different formats and data mining and historical data analysis. OLTP is used to model business operations, such as customer relationship management (CRM). OLAP applications are used for business intelligence, decision support, reporting, and planning. An OLTP database size is relatively small compared to an OLAP database. OLTP normally follows the relational model concepts, such as normalization when designing the database, while OLAP is less relational and the schema is often star shaped. Unlike OLTP, the main operation of OLAP is data retrieval. OLAP data is often generated by a process called Extract, Transform and Load (ETL). ETL is used to load data into the OLAP database from different data sources and different formats. PostgreSQL can be used out of the box for OLTP applications. For OLAP, there are many extensions and tools to support it, such as the PostgreSQL COPY command and Foreign Data Wrappers (FDW). Success stories PostgreSQL is used in many application domains, including communication, media, geographical, and e-commerce applications. Many companies provide consultation as well as commercial services, such as migrating proprietary RDBMS to PostgreSQL in order to cut off licensing costs. These companies often influence and enhance PostgreSQL by developing and submitting new features. The following are a few companies that use PostgreSQL: Skype uses PostgreSQL to store user chats and activities. Skype has also affected PostgreSQL by developing many tools called Skytools. Instagram is a social networking service that enables its user to share pictures and photos. Instagram has more than 100 million active users. The American Chemical Society (ACS): More than one terabyte of data for their journal archive is stored using PostgreSQL. In addition to the preceding list of companies, PostgreSQL is used by HP, VMware, and Heroku. PostgreSQL is used by many scientific communities and organizations, such as NASA, due to its extensibility and rich data types. Forks There are more than 20 PostgreSQL forks; PostgreSQL extensible APIs makes postgres a great candidate to fork. Over years, many groups have forked PostgreSQL and contributed their findings to PostgreSQL. The following is a list of popular PostgreSQL forks: HadoopDB is a hybrid between the PostgreSQL, RDBMS, and MapReduce technologies to target analytical workload. Greenplum is a proprietary DBMS that was built on the foundation of PostgreSQL. It utilizes the shared-nothing and massively parallel processing (MPP) architectures. It is used as a data warehouse and for analytical workloads. The EnterpriseDB advanced server is a proprietary DBMS that provides Oracle capabilities to cap Oracle fees. Postgres-XC (eXtensible Cluster) is a multi-master PostgreSQL cluster based on the shared-nothing architecture. It emphasis write-scalability and provides the same APIs to applications that PostgreSQL provides. Vertica is a column-oriented database system, which was started by Michael Stonebraker in 2005 and acquisitioned by HP in 2011. Vertica reused the SQL parser, semantic analyzer, and standard SQL rewrites from the PostgreSQL implementation. Netzza is a popular data warehouse appliance solution that was started as a PostgreSQL fork. Amazon Redshift is a popular data warehouse management system based on PostgreSQL 8.0.2. It is mainly designed for OLAP applications. The PostgreSQL architecture PostgreSQL uses the client/server model; the client and server programs could be on different hosts. The communication between the client and server is normally done via TCP/IP protocols or Linux sockets. PostgreSQL can handle multiple connections from a client. A common PostgreSQL program consists of the following operating system processes: Client process or program (frontend): The database frontend application performs a database action. The frontend could be a web server that wants to display a web page or a command-line tool to perform maintenance tasks. PostgreSQL provides frontend tools, such as psql, createdb, dropdb, and createuser. Server process (backend): The server process manages database files, accepts connections from client applications, and performs actions on behalf of the client; the server process name is postgres. PostgreSQL forks a new process for each new connection; thus, the client and server processes communicate with each other without the intervention of the server main process (postgres), and they have a certain lifetime determined by accepting and terminating a client connection. The abstract architecture of PostgreSQL The aforementioned abstract, conceptual PostgreSQL architecture can give an overview of PostgreSQL's capabilities and interactions with the client as well as the operating system. The PostgreSQL server can be divided roughly into four subsystems as follows: Process manager: The process manager manages client connections, such as the forking and terminating processes. Query processor: When a client sends a query to PostgreSQL, the query is parsed by the parser, and then the traffic cop determines the query type. A Utility query is passed to the utilities subsystem. The Select, insert, update, and delete queries are rewritten by the rewriter, and then an execution plan is generated by the planner; finally, the query is executed, and the result is returned to the client. Utilities: The utilities subsystem provides the means to maintain the database, such as claiming storage, updating statistics, exporting and importing data with a certain format, and logging. Storage manager: The storage manager handles the memory cache, disk buffers, and storage allocation. Almost all PostgreSQL components can be configured, including the logger, planner, statistical analyzer, and storage manager. PostgreSQL configuration is governed by the application nature, such as OLAP and OLTP. The following diagram shows the PostgreSQL abstract, conceptual architecture: PostgreSQL's abstract, conceptual architecture The PostgreSQL community PostgreSQL has a very cooperative, active, and organized community. In the last 8 years, the PostgreSQL community published eight major releases. Announcements are brought to developers via the PostgreSQL weekly newsletter. There are dozens of mailing lists organized into categories, such as users, developers, and associations. Examples of user mailing lists are pgsql-general, psql-doc, and psql-bugs. pgsql-general is a very important mailing list for beginners. All non-bug-related questions about PostgreSQL installation, tuning, basic administration, PostgreSQL features, and general discussions are submitted to this list. The PostgreSQL community runs a blog aggregation service called Planet PostgreSQL—https://planet.postgresql.org/. Several PostgreSQL developers and companies use this service to share their experience and knowledge. Summary PostgreSQL is an open source, object-oriented relational database system. It supports many advanced features and complies with the ANSI-SQL:2008 standard. It has won industry recognition and user appreciation. The PostgreSQL slogan "The world's most advanced open source database" reflects the sophistication of the PostgreSQL features. PostgreSQL is a result of many years of research and collaboration between academia and industry. Companies in their infancy often favor PostgreSQL due to licensing costs. PostgreSQL can aid profitable business models. PostgreSQL is also favoured by many developers because of its capabilities and advantages. Resources for Article: Further resources on this subject: Introducing PostgreSQL 9 [article] PostgreSQL – New Features [article] Installing PostgreSQL [article]
Read more
  • 0
  • 0
  • 4653

article-image-how-to-run-code-in-the-cloud-with-aws-lambda
Ankit Patial
11 Sep 2015
5 min read
Save for later

How to Run Code in the Cloud with AWS Lambda

Ankit Patial
11 Sep 2015
5 min read
AWS Lambda is a new compute service introduced by AWS to run a piece of code in response to events. The source of these events can be AWS S3, AWS SNS, AWS Kinesis, AWS Cognito and User Application using AWS-SDK. The idea behind this is to create backend services that are cost effective and highly scaleable. If you believe in the Unix Philosophy and you build your applications as components, then AWS Lambda is a nice feature that you can make use of. Some of Its Benefits Cost-effective: AWS Lambdas are not always executing, they are triggered on certain events and have a maximum execution time of 60 seconds (it's a lots of time to do many operations, but not all). There is zero wastage, and a maximum savings on resources used. No hassle of maintaining infrastructure: Create Lambda and forget. There is no need to worry about scaling infrastructure as load increases. It will be all done automatically by AWS. Integrations with other AWS service: The AWS Lambda function can be triggered in response to various events of other AWS Services. The following are services that can trigger a Lambda: AWS S3 AWS SNS(Publish) AWS Kinesis AWS Cognito Custom call using aws-sdk Creating a Lambda function First, login to your AWS account(create one if you haven't got one). Under Compute Services click on the Lambda option. You will see a screen with a "Get Started Now" button. Click on it, and then you will be on a screen to write your first Lambda function. Choose a name for it that will describe it best. Give it a nice description and move on to the code. We can code it in one of the following two ways: Inline code or Upload a zip file. Inline Code Inline code will be very helpful for writing simple scripts like image editing. The AMI (Amazon Machine Image) that Lambda runs on comes with preinstalled Ghostscript and ImageMagick libraries and NodeJs packages like aws-sdk and imagemagick. Let's create a Lambda that can list install packages on AMI and that runs Lambda. I will name it ls-packages The description will be list installed packages on AMI For code entry, type Edit Code Inline For the code template None, paste the below code in: var cp = require('child_process'); exports.handler = function(event, context) { cp.exec('rpm -qa', function (err, stdout, stderr ) { if (err) { return context.fail(err); } console.log(stdout); context.succeed('Done'); }); }; Handler name handler, this will be the entry point function name. You can change it as you like. Role, select Create new role Basic execution role. You will be prompted to create an IAM role with the required permission i.e. access to create logs. Press "Allow." For the Memory(MB), I am going to keep it low 128 Timeout(s), keep it default 3 Press Create Lambda function You will see your first Lambda created and showing up in Lambda: Function list, select it if it is not already selected, and click on the Actions drop-down. On the top select the Edit/Test option. You will see your Lambda function in edit mode, ignore the left side Sample event section just client Invoke button on the right bottom, wait for a few seconds and you will see nice details in Execution result. The "Execution logs" is where you will find out the list of installed packages on the machine that you can utilize. I wish there was a way to install custom packages, or at least have the latest version running of installed packages. I mean, look at ghostscript-8.70-19.23.amzn1.x86_64. It is an old version published in 2009. Maybe AWS will add such features in the future. I certainly hope so. Upload a zip file You now have created something complicated that is included in multiple code files and NPM packages that are not available on Lambda AMI. No worries, just create a simple NodeJs app, install you packages in write up your code and we are good to deploy it. Few things that need to be take care of are: Zip node_modules folder along with code don't exclude it while zipping your code. Steps will be the same as are of Inline Code online, but one addition is File name. File name will be path to entry file, so if you have lib dir in your code with index.js file then you can mention it as bin/index.js. Monitoring On the Lambda Dashboard you will see a nice graph of various events like Invocation Count, Invocation Duration, Invocation failures and Throttled invocations. You will also view the logs created by Lambda functions in AWS Cloud Watch(Administration & Security) Conclusion AWS Lambda is a unique, and very useful service. It can help us build nice scaleable backends for mobile applications. It can also help you to centralize many components that can be shared across applications that you are running on and off the AWS infrastructure. About the author Ankit Patial has a Masters in Computer Applications, and nine years of experience with custom APIs, web and desktop applications using .NET technologies, ROR and NodeJs. As a CTO with SimSaw Inc and Pink Hand Technologies, his job is to learn and and help his team to implement the best practices of using Cloud Computing and JavaScript technologies.
Read more
  • 0
  • 0
  • 32855

article-image-deploying-zabbix-proxy
Packt
11 Sep 2015
12 min read
Save for later

Deploying a Zabbix proxy

Packt
11 Sep 2015
12 min read
In this article by Andrea Dalle Vacche, author of the book Mastering Zabbix, Second Edition, you will learn the basics on how to deploy a Zabbix proxy on a Zabbix server. (For more resources related to this topic, see here.) A Zabbix proxy is compiled together with the main server if you add --enable-proxy to the compilation options. The proxy can use any kind of database backend, just as the server does, but if you don't specify an existing DB, it will automatically create a local SQLite database to store its data. If you intend to rely on SQLite, just remember to add --with-sqlite3 to the options as well. When it comes to proxies, it's usually advisable to keep things light and simple as much as we can; of course, this is valid only if the network design permits us to take this decision. A proxy DB will just contain configuration and measurement data that, under normal circumstances, is almost immediately synchronized with the main server. Dedicating a full-blown database to it is usually an overkill, so unless you have very specific requirements, the SQLite option will provide the best balance between performance and ease of management. If you didn't compile the proxy executable the first time you deployed Zabbix, just run configure again with the options you need for the proxies: $ ./configure --enable-proxy --enable-static --with-sqlite3 --with-net-snmp --with-libcurl --with-ssh2 --with-openipmi In order to build the proxy statically, you must have a static version of every external library needed. The configure script doesn't do this kind of check. Compile everything again using the following command: $ make Be aware that this will compile the main server as well; just remember not to run make install, nor copy the new Zabbix server executable over the old one in the destination directory. The only files you need to take and copy over to the proxy machine are the proxy executable and its configuration file. The $PREFIX variable should resolve to the same path you used in the configuration command (/usr/local by default): # cp src/zabbix_proxy/zabbix_proxy $PREFIX/sbin/zabbix_proxy # cp conf/zabbix_proxy.conf $PREFIX/etc/zabbix_proxy.conf Next, you need to fill out relevant information in the proxy's configuration file. The default values should be fine in most cases, but you definitely need to make sure that the following options reflect your requirements and network status: ProxyMode=0 This means that the proxy machine is in an active mode. Remember that you need at least as many Zabbix trappers on the main server as the number of proxies you deploy. Set the value to 1 if you need or prefer a proxy in the passive mode. The following code captures this discussion: Server=n.n.n.n This should be the IP number of the main Zabbix server or of the Zabbix node that this proxy should report to: Hostname=Zabbix proxy This must be a unique, case-sensitive name that will be used in the main Zabbix server's configuration to refer to the proxy: LogFile=/tmp/zabbix_proxy.log LogFileSize=1 DebugLevel=2 If you are using a small, embedded machine, you may not have much disk space to spare. In that case, you may want to comment all the options regarding the log file and let syslog send the proxy's log to another server on the Internet: # DBHost= # DBSchema= # DBUser= # DBPassword= # DBSocket= # DBPort= We need now create the SQLite database; this can be done with the following commands: $ mkdir –p /var/lib/sqlite/ $ sqlite3 /var/lib/sqlite/zabbix.db < /usr/share/doc/zabbix-proxy-sqlite3-2.4.4/create/schema.sql Now, in the DBName parameter, we need to specify the full path to our SQLite database: DBName=/var/lib/sqlite/zabbix.db The proxy will automatically populate and use a local SQLite database. Fill out the relevant information if you are using a dedicated, external database: ProxyOfflineBuffer=1 This is the number of hours that a proxy will keep monitored measurements if communications with the Zabbix server go down. Once the limit has been reached, the proxy will housekeep away the old data. You may want to double or triple it if you know that you have a faulty, unreliable link between the proxy and server. CacheSize=8M This is the size of the configuration cache. Make it bigger if you have a large number of hosts and items to monitor. Zabbix's runtime proxy commands There is a set of commands that you can run against the proxy to change runtime parameters. This set of commands is really useful if your proxy is struggling with items, in the sense that it is taking longer to deliver the items and maintain our Zabbix proxy up and running. You can force the configuration cache to get refreshed from the Zabbix server with the following: $ zabbix_proxy -c /usr/local/etc/zabbix_proxy.conf -R config_cache_reload This command will invalidate the configuration cache on the proxy side and will force the proxy to ask for the current configuration to our Zabbix server. We can also increase or decrease the log level quite easily at runtime with log_level_increase and log_level_decrease: $ zabbix_proxy -c /usr/local/etc/zabbix_proxy.conf –R log_level_increase This command will increase the log level for the proxy process; the same command also supports a target that can be PID, process type or process type, number here. What follow are a few examples. Increase the log level of the three poller process: $ zabbix_proxy -c /usr/local/etc/zabbix_proxy.conf -R log_level_increase=poller,3 Increase the log level of the PID to 27425: $ zabbix_proxy -c /usr/local/etc/zabbix_proxy.conf -R log_level_increase=27425 Increase or decrease the log level of icmp pinger or any other proxy processes with: $ zabbix_proxy -c /usr/local/etc/zabbix_proxy.conf -R log_level_increase="icmp pinger" zabbix_proxy [28064]: command sent successfully $ zabbix_proxy -c /usr/local/etc/zabbix_proxy.conf -R log_level_decrease="icmp pinger" zabbix_proxy [28070]: command sent successfully We can quickly see the changes reflected in the log file here: 28049:20150412:021435.841 log level has been increased to 4 (debug) 28049:20150412:021443.129 Got signal [signal:10(SIGUSR1),sender_pid:28034,sender_uid:501,value_int:770(0x00000302)]. 28049:20150412:021443.129 log level has been decreased to 3 (warning) Deploying a Zabbix proxy using RPMs Deploying a Zabbix proxy using the RPM is a very simple task. Here, there are fewer steps required as Zabbix itself distributes a prepackaged Zabbix proxy that is ready to use. What you need to do is simply add the official Zabbix repository with the following command that must be run from root: $ rpm –ivh http://repo.zabbix.com/zabbix/2.4/rhel/6/x86_64/zabbix-2.4.4-1.el6.x86_64.rpm Now, you can quickly list all the available zabbix-proxy packages with the following command, again from root: $ yum search zabbix-proxy ============== N/S Matched: zabbix-proxy ================ zabbix-proxy.x86_64 : Zabbix Proxy common files zabbix-proxy-mysql.x86_64 : Zabbix proxy compiled to use MySQL zabbix-proxy-pgsql.x86_64 : Zabbix proxy compiled to use PostgreSQL zabbix-proxy-sqlite3.x86_64 : Zabbix proxy compiled to use SQLite3 In this example, the command is followed by the relative output that lists all the available zabbix-proxy packages; here, all you have to do is choose between them and install your desired package: $ yum install zabbix-proxy-sqlite3 Now, you've already installed the Zabbix proxy, which can be started up with the following command: $ service zabbix-proxy start Starting Zabbix proxy: [ OK ] Please also ensure that you enable your Zabbix proxy when the server boots with the $ chkconfig zabbix-proxy on command. That done, if you're using iptables, it is important to add a rule to enable incoming traffic on the 10051 port (that is the standard Zabbix proxy port) or, in any case, against the port that is specified in the configuration file: ListenPort=10051 To do that, you simply need to edit the iptables configuration file /etc/sysconfig/iptables and add the following line right on the head of the file: -A INPUT -m state --state NEW -m tcp -p tcp --dport 10051 -j ACCEPT Then, you need to restart your local firewall from root using the following command: $ service iptables restart The log file is generated at /var/log/zabbix/zabbix_proxy.log: $ tail -n 40 /var/log/zabbix/zabbix_proxy.log 62521:20150411:003816.801 **** Enabled features **** 62521:20150411:003816.801 SNMP monitoring: YES 62521:20150411:003816.801 IPMI monitoring: YES 62521:20150411:003816.801 WEB monitoring: YES 62521:20150411:003816.801 VMware monitoring: YES 62521:20150411:003816.801 ODBC: YES 62521:20150411:003816.801 SSH2 support: YES 62521:20150411:003816.801 IPv6 support: YES 62521:20150411:003816.801 ************************** 62521:20150411:003816.801 using configuration file: /etc/zabbix/zabbix_proxy.conf As you can quickly spot, the default configuration file is located at /etc/zabbix/zabbix_proxy.conf. The only thing that you need to do is make the proxy known to the server and add monitoring objects to it. All these tasks are performed through the Zabbix frontend by just clicking on Admin | Proxies and then Create. This is shown in the following screenshot: Please take care to use the same Proxy name that you've used in the configuration file, which, in this case, is ZabbixProxy; you can quickly check with: $ grep Hostname= /etc/zabbix/zabbix_proxy.conf # Hostname= Hostname=ZabbixProxy Note how, in the case of an Active proxy, you just need to specify the proxy's name as already set in zabbix_proxy.conf. It will be the proxy's job to contact the main server. On the other hand, a Passive proxy will need an IP address or a hostname for the main server to connect to, as shown in the following screenshot: You don't have to assign hosts to proxies at creation time or only in the proxy's edit screen. You can also do that from a host configuration screen, as follows: One of the advantages of proxies is that they don't need much configuration or maintenance; once they are deployed and you have assigned some hosts to one of them, the rest of the monitoring activities are fairly transparent. Just remember to check the number of values per second that every proxy has to guarantee as expressed by the Required performance column in the proxies' list page: Values per second (VPS) is the number of measurements per second that a single Zabbix server or proxy has to collect. It's an average value that depends on the number of items and the polling frequency for every item. The higher the value, the more powerful the Zabbix machine must be. Depending on your hardware configuration, you may need to redistribute the hosts among proxies or add new ones if you notice degraded performances coupled with high VPS. Considering a different Zabbix proxy database Nowadays, from Zabbix 2.4 the support for nodes has been discontinued, and the only distributed scenario available is limited to the Zabbix proxy; those proxies now play a truly critical role. Also, with proxies deployed in many different geographic locations, the infrastructure is more subject to network outages. That said, there is a case to consider which database we want to use for those critical remote proxies. Now SQLite3 is a good product as a standalone and lightweight setup, but if, in our scenario, the proxy we've deployed needs to retain a considerable amount of metrics, we need to consider the fact that SQLite3 has certain weak spots: The atomic-locking mechanism on SQLite3 is not the most robust ever SQLite3 suffers during high-volume writes SQLite3 does not implement any kind of user authentication mechanism Apart from the point that SQLite3 does not implement any kind of authentication mechanism, the database files are created with the standard unmask, due to which, they are readable by everyone, In the event of a crash during high load it is not the best database to use. Here is an example of the sqlite3 database and how to access it using a third-party account: $ ls -la /tmp/zabbix_proxy.db -rw-r--r--. 1 zabbix zabbix 867328 Apr 12 09:52 /tmp/zabbix_proxy.db ]# su - adv [adv@localhost ~]$ sqlite3 /tmp/zabbix_proxy.db SQLite version 3.6.20 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite> Then, for all the critical proxies, it is advisable to use a different database. Here, we will use MySQL, which is a well-known database. To install the Zabbix proxy with MySQL, if you're compiling it from source, you need to use the following command line: $ ./configure --enable-proxy --enable-static --with-mysql --with-net-snmp --with-libcurl --with-ssh2 --with-openipmi This should be followed by the usual: $ make Instead, if you're using the precompiled rpm, you can simply run from root: $ yum install zabbix-proxy-mysql Now, you need to start up your MySQL database and create the required database for your proxy: $ mysql -uroot -p<password> $ create database zabbix_proxy character set utf8 collate utf8_bin; $ grant all privileges on zabbix_proxy.* to zabbix@localhost identified by '<password>'; $ quit; $ mysql -uzabbix -p<password> zabbix_proxy < database/mysql/schema.sql If you've installed using rpm, the previous command will be: $ mysql -uzabbix -p<password> zabbix_proxy < /usr/share/doc/zabbix-proxy-mysql-2.4.4/create/schema.sql/schema.sql Now, we need to configure zabbix_proxy.conf and add the proper value to those parameters: DBName=zabbix_proxy DBUser=zabbix DBPassword=<password> Please note that there is no need to specify DBHost as the socket used for MySQL. Finally, we can start up our Zabbix proxy with the following command from root: $ service zabbix-proxy start Starting Zabbix proxy: [ OK ] Summary In this article, you learned how to start up a Zabbix proxy over a Zabbix server. Resources for Article: Further resources on this subject: Zabbix Configuration[article] Bar Reports in Zabbix 1.8[article] Going beyond Zabbix agents [article]
Read more
  • 0
  • 0
  • 32058

article-image-creating-slash-commands-slack-using-bottle
Ellison Leao
10 Sep 2015
4 min read
Save for later

Creating slash commands for Slack using Bottle

Ellison Leao
10 Sep 2015
4 min read
In this post I will show you how to make a custom slack command for your organizational chat using Python's microframework Bottle. This post is not a Bottle tutorial and I will assume that you have at least a basic amount of Python knowledge. If you want to learn more about Python, click here. For learning about Bottle, click here. We will deploy our app on Heroku, so you will need git installed as well. On our application, we will create a simple "Hello World!" command to be outputted on slack when typing the /hello command. Installing and Creating the Application We will need to install Bottle inside a Python virtualenv. Make sure you have virtualenvwrapper installed and configured on your system. After the virtualenvwrapper install, create a new virtualenv called slash by typing the following: mkvirtualenv slash After that, install Bottle project using python's pip command: pip install bottle The choice for Bottle is that you can create web applications with a few lines of code. You can use another web framework if you want, like Flask, web.py, web2py or even Django. Now, moving to the app. First let's create its structure. mkdir myslash touch myslash/app.py Open your favorite editor, and add the following lines to the app.py file. We will explain step by step how they work and what are they doing. #!/usr/bin/env python # encoding: utf-8 from bottle import run, post @post('/hello') def hello(): return'Hello World!' if__name__ == '__main__': run(host='0.0.0.0', port=5000) Explaining what this code does: from bottle import run, post` Here, we import the necessary methods we will need for our app. run method, and will create a web server that will run our application. post method is a Python decorator that will create a POST route that will be used for outputting the "Hello world!" message. @post('/hello') def hello(): return'Hello World!' This is our app's main method. You can see the post decorator creating a /hello route, which will be handled by the hello() method. if__name__ == '__main__': run(host='0.0.0.0', port=5000) The run method will be called when we run the python app.py command. For the host we need to listen on all addresses, which is why we add 0.0.0.0 as the param. You can change the port param if you want, but the default is 5000. Now open another terminal on the app folder and type: python app.py To test if the app is running okay, use the cURL command to make a POST test request curl -X POST localhost:5000/hello You should see the Hello World! message printed out. Deploying If you don't have a Heroku account yet, please go to https://signup.heroku.com/www-header. After that, go to https://dashboard.heroku.com/new to create a new application. Type your favorite app name and click on Create App. We will need to create a Procfile so the app could run on Heroku side. Create a file called Procfile on your app's main directory and add the following: web: python app.py Now, on the app's main directory, create a git repository and send the files to the new application you just created. Heroku will know this is a python app and will make the proper configuration to run it. git init git remote add heroku git@heroku.com:YOURAPPNAME.git git push heroku master Make sure your public key is configured on your account's SSH Keys (https://dashboard.heroku.com/account). If everything went well you should see the app running on YOURAPPNAME.herokuapp.com Configuring Slack Now to the Slack part. We will need to add a custom slash command on our organization settings. Go to https://YOURORGNAME.slack.com/services/new/slash-commands and on the Choose your command input, type hello. For the configurations we will have: Command: /hello URL: http://YOURAPPNAME.herokuapp.com/hello (Important: WITHOUT TRAILING SLASH!) Method: POST Check Show this command in the autocomplete list and add a Description and usage hint Click in Save integration Testing Go to your slack org chat and type /hello on any chat. You should see the "Hello world!" message printed out. And that's it! You can see the app code here. If you have any questions or suggestions you can reach me out on twitter @ellisonleao. About The Author Ellison Leao is a passionate software engineer with more than 6 years of experience in web projects and a contributor to the MelonJS framework and other open source projects. When he is not writing games, he loves to play drums.
Read more
  • 0
  • 0
  • 7622

article-image-introduction-spring-web-application-no-time
Packt
10 Sep 2015
8 min read
Save for later

Introduction to Spring Web Application in No Time

Packt
10 Sep 2015
8 min read
 Many official Spring tutorials have both a Gradle build and a Maven build, so you will find examples easily if you decide to stick with Maven. Spring 4 is fully compatible with Java 8, so it would be a shame not to take advantage of lambdas to simplify our code base. In this article by Geoffroy Warin, author of the book Mastering Spring MVC 4, we will see some Git commands. It's a good idea to keep track of your progress and commit when you are in a stable state. (For more resources related to this topic, see here.) Getting started with Spring Tool Suite One of the best ways to get started with Spring and discover the numerous tutorials and starter projects that the Spring community offers is to download Spring Tool Suite (STS). STS is a custom version of eclipse designed to work with various Spring projects, as well as Groovy and Gradle. Even if, like me, you have another IDE that you would rather work with, we recommend that you give STS a shot because it gives you the opportunity to explore Spring's vast ecosystem in a matter of minutes with the "Getting Started" projects. So, let's visit https://Spring.io/tools/sts/all and download the latest release of STS. Before we generate our first Spring Boot project we will need to install the Gradle support for STS. You can find a Manage IDE Extensions button on the dashboard. You will then need to download the Gradle Support software in the Language and framework tooling section. Its recommend installing the Groovy Eclipse plugin along with the Groovy 2.4 compiler, as shown in the following screenshot. These will be needed later in this article when we set up acceptance tests with geb: We now have two main options to get started. The first option is to navigate to File | New | Spring Starter Project, as shown in the following screenshot. This will give you the same options as http://start.Spring.io, embedded in your IDE: The second way is to navigate to File | New | Import Getting Started Content. This will give you access to all the tutorials available on Spring.io. You will have the choice of working with either Gradle or Maven, as shown in the following screenshot: You can also check out the starter code to follow along with the tutorial, or get the complete code directly. There is a lot of very interesting content available in the Getting Started Content. It will demonstrate the integration of Spring with various technologies that you might be interested in. For the moment, we will generate a web project as shown in the preceding image. It will be a Gradle application, producing a JAR file and using Java 8. Here is the configuration we want to use: Property Value Name masterSpringMvc Type Gradle project Packaging Jar Java version 1.8 Language Java Group masterSpringMvc Artifact masterSpringMvc Version 0.0.1-SNAPSHOT Description Be creative! Package masterSpringMvc On the second screen you will be asked for the Spring Boot version you want to use and the the dependencies that should be added to the project. At the time of writing this, the latest version of Spring boot was 1.2.5. Ensure that you always check out the latest release. The latest snapshot version of Spring boot will also be available by the time you read this. If Spring boot 1.3 isn't released by then, you can probably give it a shot. One of its big features is the awesome devs tools. Refer to https://spring.io/blog/2015/06/17/devtools-in-spring-boot-1-3 for more details. At the bottom the configuration window you will see a number of checkboxes representing the various boot starter libraries. These are dependencies that can be appended to your build file. They provide autoconfigurations for various Spring projects. We are only interested in Spring MVC for the moment, so we will check only the Web checkbox. A JAR for a web application? Some of you might find it odd to package your web application as a JAR file. While it is still possible to use WAR files for packaging, it is not always the recommended practice. By default, Spring boot will create a fat JAR, which will include all the application's dependencies and provide a convenient way to start a web server using Java -jar. Our application will be packaged as a JAR file. If you want to create a war file, refer to http://spring.io/guides/gs/convert-jar-to-war/. Have you clicked on Finish yet? If you have, you should get the following project structure: We can see our main class MasterSpringMvcApplication and its test suite MasterSpringMvcApplicationTests. There are also two empty folders, static and templates, where we will put our static web assets (images, styles, and so on) and obviously our templates (jsp, freemarker, Thymeleaf). The last file is an empty application.properties file, which is the default Spring boot configuration file. It's a very handy file and we'll see how Spring boot uses it throughout this article. The last is build.gradle file, the build file that we will detail in a moment. If you feel ready to go, run the main method of the application. This will launch a web server for us. To do this, go to the main method of the application and navigate to Run as | Spring Application in the toolbar either by right-clicking on the class or clicking on the green play button in the toolbar. Doing so and navigating to http://localhost:8080 will produce an error. Don't worry, and read on. Now we will show you how to generate the same project without STS, and we will come back to all these files. Getting started with IntelliJ IntelliJ IDEA is a very popular tool among Java developers. For the past few years I've been very pleased to pay Jetbrains a yearly fee for this awesome editor. IntelliJ also has a way of creating Spring boot projects very quickly. Go to the new project menu and select the Spring Initializr project type: This will give us exactly the same options as STS. You will need to import the Gradle project into IntelliJ. we recommend generating the Gradle wrapper first (refer to the following Gradle build section). If needed, you can reimport the project by opening its build.gradle file again. Getting started with start.Spring.io Go to http://start.Spring.io to get started with start.Spring.io. The system behind this remarkable Bootstrap-like website should be familiar to you! You will see the following screenshot when you go to the previously mentioned link: Indeed, the same options available with STS can be found here. Clicking on Generate Project will download a ZIP file containing our starter project. Getting started with the command line For those of you who are addicted to the console, it is possible to curl http://start.Spring.io. Doing so will display instructions on how to structure your curl request. For instance, to generate the same project as earlier, you can issue the following command: $ curl http://start.Spring.io/starter.tgz -d name=masterSpringMvc -d dependencies=web -d language=java -d JavaVersion=1.8 -d type=gradle-project -d packageName=masterSpringMvc -d packaging=jar -d baseDir=app | tar -xzvf - % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1255 100 1119 100 136 1014 123 0:00:01 0:00:01 --:--:-- 1015 x app/ x app/src/ x app/src/main/ x app/src/main/Java/ x app/src/main/Java/com/ x app/src/main/Java/com/geowarin/ x app/src/main/resources/ x app/src/main/resources/static/ x app/src/main/resources/templates/ x app/src/test/ x app/src/test/Java/ x app/src/test/Java/com/ x app/src/test/Java/com/geowarin/ x app/build.Gradle x app/src/main/Java/com/geowarin/AppApplication.Java x app/src/main/resources/application.properties x app/src/test/Java/com/geowarin/AppApplicationTests.Java And viola! You are now ready to get started with Spring without leaving the console, a dream come true. You might consider creating an alias with the previous command, it will help you prototype the Spring application very quickly. Summary In this article, we leveraged Spring Boot's autoconfiguration capabilities to build an application with zero boilerplate or configuration files. We configured Spring Boot tool suite, IntelliJ,and start.spring.io and how to configure it! Resources for Article: Further resources on this subject: Welcome to the Spring Framework[article] Mailing with Spring Mail[article] Creating a Spring Application [article]
Read more
  • 0
  • 0
  • 2433
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-sabermetrics-apache-spark
Packt
09 Sep 2015
22 min read
Save for later

Sabermetrics with Apache Spark

Packt
09 Sep 2015
22 min read
 In this article by Rindra Ramamonjison, the author of the book called Apache Spark Graph Processing, we will gain useful insights that are required to quickly process big data, and handle its complexities. It is not the secret analytics that have made a big impact in sports. The quest for an objective understanding of the game has a name even—"sabermetrics". Analytics has proven invaluable in many aspects, from building dream teams under tight cap constraints, to selecting game-specific strategies, to actively engaging with fans, and so on. In the following sections, we will analyze NCAA Men's college basketball game stats, gathered during a single season. As sports data experts, we are going to leverage Spark's graph processing library to answer several questions for retrospection. Apache Spark is a fast, general-purpose technology, which greatly simplifies the parallel processing of large data that is distributed over a computing cluster. While Spark handles different types of processing, here, we will focus on its graph-processing capability. In particular, our goal is to expose the powerful yet generic graph-aggregation operator of Spark—aggregateMessages. We can think of this operator as a version of MapReduce for aggregating the neighborhood information in graphs. In fact, many graph-processing algorithms, such as PageRank rely on iteratively accessing the properties of neighboring vertices and adjacent edges. By applying aggregateMessages on the NCAA College Basketball datasets, we will: Identify the basic mechanisms and understand the patterns for using aggregateMessages Apply aggregateMessages to create custom graph aggregation operations Optimize the performance and efficiency of aggregateMessages (For more resources related to this topic, see here.) NCAA College Basketball datasets As an illustrative example, the NCAA College Basketball datasets consist of two CSV datasets. This first one called teams.csv contains the list of all the college teams that played in NCAA Division I competition. Each team is associated with a 4-digit ID number. The second dataset called stats.csv contains the score and statistics of every game played during the 2014-2015 regular season. Loading team data into RDDs To start with, we parse and load these datasets into RDDs (Resilient Distributed Datasets), which are the core Spark abstraction for any data that is distributed and stored over a cluster. First, we create a class called GameStats that records a team's statistics during a game: case class GameStats( val score: Int, val fieldGoalMade: Int, val fieldGoalAttempt: Int, val threePointerMade: Int, val threePointerAttempt: Int, val threeThrowsMade: Int, val threeThrowsAttempt: Int, val offensiveRebound: Int, val defensiveRebound: Int, val assist: Int, val turnOver: Int, val steal: Int, val block: Int, val personalFoul: Int ) Loading game stats into RDDs We also add the following methods to GameStats in order to know how efficient a team's offense was: // Field Goal percentage def fgPercent: Double = 100.0 * fieldGoalMade / fieldGoalAttempt // Three Point percentage def tpPercent: Double = 100.0 * threePointerMade / threePointerAttempt // Free throws percentage def ftPercent: Double = 100.0 * threeThrowsMade / threeThrowsAttempt override def toString: String = "Score: " + score Next, we create a couple of classes for the games' result: abstract class GameResult( val season: Int, val day: Int, val loc: String ) case class FullResult( override val season: Int, override val day: Int, override val loc: String, val winnerStats: GameStats, val loserStats: GameStats ) extends GameResult(season, day, loc) FullResult has the year and day of the season, the location where the game was played, and the game statistics of both the winning and losing teams. Next, we will create a statistics graph of the regular seasons. In this graph, the nodes are the teams, whereas each edge corresponds to a specific game. To create the graph, let's parse the CSV file called teams.csv into the RDD teams: val teams: RDD[(VertexId, String)] = sc.textFile("./data/teams.csv"). filter(! _.startsWith("#")). map {line => val row = line split ',' (row(0).toInt, row(1)) } We can check the first few teams in this new RDD: scala> teams.take(3).foreach{println} (1101,Abilene Chr) (1102,Air Force) (1103,Akron) We do the same thing to obtain an RDD of the game results, which will have a type called RDD[Edge[FullResult]]. We just parse stats.csv, and record the fields that we need: The ID of the winning team The ID of the losing team The game statistics of both the teams val detailedStats: RDD[Edge[FullResult]] = sc.textFile("./data/stats.csv"). filter(! _.startsWith("#")). map {line => val row = line split ',' Edge(row(2).toInt, row(4).toInt, FullResult( row(0).toInt, row(1).toInt, row(6), GameStats( score = row(3).toInt, fieldGoalMade = row(8).toInt, fieldGoalAttempt = row(9).toInt, threePointerMade = row(10).toInt, threePointerAttempt = row(11).toInt, threeThrowsMade = row(12).toInt, threeThrowsAttempt = row(13).toInt, offensiveRebound = row(14).toInt, defensiveRebound = row(15).toInt, assist = row(16).toInt, turnOver = row(17).toInt, steal = row(18).toInt, block = row(19).toInt, personalFoul = row(20).toInt ), GameStats( score = row(5).toInt, fieldGoalMade = row(21).toInt, fieldGoalAttempt = row(22).toInt, threePointerMade = row(23).toInt, threePointerAttempt = row(24).toInt, threeThrowsMade = row(25).toInt, threeThrowsAttempt = row(26).toInt, offensiveRebound = row(27).toInt, defensiveRebound = row(28).toInt, assist = row(20).toInt, turnOver = row(30).toInt, steal = row(31).toInt, block = row(32).toInt, personalFoul = row(33).toInt ) ) ) } We can avoid typing all this by using the nice spark-csv package that reads CSV files into SchemaRDD. Let's check what we got: scala> detailedStats.take(3).foreach(println) Edge(1165,1384,FullResult(2006,8,N,Score: 75-54)) Edge(1393,1126,FullResult(2006,8,H,Score: 68-37)) Edge(1107,1324,FullResult(2006,9,N,Score: 90-73)) We then create our score graph using the collection of teams (of the type called RDD[(VertexId, String)]) as vertices, and the collection called detailedStats (of the type called RDD[(VertexId, String)]) as edges: scala> val scoreGraph = Graph(teams, detailedStats) For curiosity, let's see which team has won against the 2015 NCAA national champ Duke during the regular season. It seems Duke has lost only four games during the regular season: scala> scoreGraph.triplets.filter(_.dstAttr == "Duke").foreach(println)((1274,Miami FL),(1181,Duke),FullResult(2015,71,A,Score: 90-74)) ((1301,NC State),(1181,Duke),FullResult(2015,69,H,Score: 87-75)) ((1323,Notre Dame),(1181,Duke),FullResult(2015,86,H,Score: 77-73)) ((1323,Notre Dame),(1181,Duke),FullResult(2015,130,N,Score: 74-64)) Aggregating game stats After we have our graph ready, let's start aggregating the stats data in scoreGraph. In Spark, aggregateMessages is the operator for such a kind of jobs. For example, let's find out the average field goals made per game by the winners. In other words, the games that a team has lost will not be counted. To get the average for each team, we first need to have the number of games won by the team, and the total field goals that the team made in these games: // Aggregate the total field goals made by winning teams type Msg = (Int, Int) type Context = EdgeContext[String, FullResult, Msg] val winningFieldGoalMade: VertexRDD[Msg] = scoreGraph aggregateMessages( // sendMsg (ec: Context) => ec.sendToSrc(1, ec.attr.winnerStats.fieldGoalMade), // mergeMsg (x: Msg, y: Msg) => (x._1 + y._1, x._2+ y._2) ) The aggregateMessage operator There is a lot going on in the previous call to aggregateMessages. So, let's see it working in slow motion. When we called aggregateMessages on the scoreGraph, we had to pass two functions as arguments. SendMsg The first function has a signature called EdgeContext[VD, ED, Msg] => Unit. It takes an EdgeContext as input. Since it does not return anything, its return type is Unit. This function is needed for sending message between the nodes. Okay, but what is the EdgeContext type? EdgeContext represents an edge along with its neighboring nodes. It can access both the edge attribute, and the source and destination nodes' attributes. In addition, EdgeContext has two methods to send messages along the edge to its source node, or to its destination node. These methods are called sendToSrc and sendToDst respectively. Then, the type of messages being sent through the graph is defined by Msg. Similar to vertex and edge types, we can define the concrete type that Msg takes as we wish. Merge In addition to sendMsg, the second function that we need to pass to aggregateMessages is a mergeMsg function with the (Msg, Msg) => Msg signature. As its name implies, mergeMsg is used to merge two messages, received at each node into a new one. Its output must also be of the Msg type. Using these two functions, aggregateMessages returns the aggregated messages inside VertexRDD[Msg]. Example In our example, we need to aggregate the number of games played and the number of field goals made. Therefore, Msg is simply a pair of Int. Furthermore, each edge context needs to send a message to only its source node, that is, the winning team. This is because we want to compute the total field goals made by each team for only the games that it has won. The actual message sent to each "winner" node is the pair of integers (1, ec.attr.winnerStats.fieldGoalMade). Here, 1 serves as a counter for the number of games won by the source node. The second integer, which is the number of field goals in one game, is extracted from the edge attribute. As we set out to compute the average field goals per winning game for all teams, we need to apply the mapValues operator to the output of aggregateMessages, which is as follows: // Average field goals made per Game by the winning teams val avgWinningFieldGoalMade: VertexRDD[Double] = winningFieldGoalMade mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Int) => total.toDouble/count }) Here is the output: scala> avgWinningFieldGoalMade.take(5).foreach(println) (1260,24.71641791044776) (1410,23.56578947368421) (1426,26.239436619718308) (1166,26.137614678899084) (1434,25.34285714285714) Abstracting out the aggregation This was kind of cool! We can surely do the same thing for the average points per game scored by the winning teams: // Aggregate the points scored by winning teams val winnerTotalPoints: VertexRDD[(Int, Int)] = scoreGraph.aggregateMessages( // sendMsg triplet => triplet.sendToSrc(1, triplet.attr.winnerStats.score), // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) // Average field goals made per Game by winning teams var winnersPPG: VertexRDD[Double] = winnerTotalPoints mapValues ( (id: VertexId, x: (Int, Int)) => x match { case (count: Int, total: Int) => total.toDouble/count }) Let's check the output: scala> winnersPPG.take(5).foreach(println) (1260,71.19402985074628) (1410,71.11842105263158) (1426,76.30281690140845) (1166,76.89449541284404) (1434,74.28571428571429) What if the coach wants to know the top five teams with the highest average three pointers made per winning game? By the way, he might also ask about the teams that are the most efficient in three pointers. Keeping things DRY We can copy and modify the previous code, but that would be quite repetitive. Instead, let's abstract out the average aggregation operator so that it can work on any statistics that the coach needs. Luckily, Scala's higher-order functions are there to help in this task. Let's define the functions that take a team's GameStats as an input, and return specific statistic that we are interested in. For now, we will need the number of three pointer made, and the average three pointer percentage: // Getting individual stats def threePointMade(stats: GameStats) = stats.threePointerMade def threePointPercent(stats: GameStats) = stats.tpPercent Then, we create a generic function that takes as an input a stats graph, and one of the functions defined previously, which has a signature called GameStats => Double: // Generic function for stats averaging def averageWinnerStat(graph: Graph[String, FullResult])(getStat: GameStats => Double): VertexRDD[Double] = { type Msg = (Int, Double) val winningScore: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg triplet => triplet.sendToSrc(1, getStat(triplet.attr.winnerStats)), // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) winningScore mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double) => total/count }) } Now, we can get the average stats by passing the threePointMade and threePointPercent to averageWinnerStat functions: val winnersThreePointMade = averageWinnerStat(scoreGraph)(threePointMade) val winnersThreePointPercent = averageWinnerStat(scoreGraph)(threePointPercent) With little efforts, we can tell the coach which five winning teams score the highest number of threes per game: scala> winnersThreePointMade.sortBy(_._2,false).take(5).foreach(println) (1440,11.274336283185841) (1125,9.521929824561404) (1407,9.008849557522124) (1172,8.967441860465117) (1248,8.915384615384616) While we are at it, let's find out the five most efficient teams in three pointers: scala> winnersThreePointPercent.sortBy(_._2,false).take(5).foreach(println) (1101,46.90555728464225) (1147,44.224282479431224) (1294,43.754532434101534) (1339,43.52308905887638) (1176,43.080814169045105) Interestingly, the teams that made the most three pointers per winning game are not always the one who are the most efficient ones at it. But it is okay because at least they have won these games. Coach wants more numbers The coach seems to argue against this argument. He asks us to get the same statistics, but he wants the average over all the games that each team has played. We then have to aggregate the information at all the nodes, and not only at the destination nodes. To make our previous abstraction more flexible, let's create the following types: trait Teams case class Winners extends Teams case class Losers extends Teams case class AllTeams extends Teams We modify the previous higher-order function to have an extra argument called Teams, which will help us specify those nodes where we want to collect and aggregate the required game stats. The new function becomes as the following: def averageStat(graph: Graph[String, FullResult])(getStat: GameStats => Double, tms: Teams): VertexRDD[Double] = { type Msg = (Int, Double) val aggrStats: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg tms match { case _ : Winners => t => t.sendToSrc((1, getStat(t.attr.winnerStats))) case _ : Losers => t => t.sendToDst((1, getStat(t.attr.loserStats))) case _ => t => { t.sendToSrc((1, getStat(t.attr.winnerStats))) t.sendToDst((1, getStat(t.attr.loserStats))) } } , // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) aggrStats mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double) => total/count }) } Now, aggregateStat allows us to choose if we want to aggregate the stats for winners only, for losers only, or for the all teams. Since the coach wants the overall stats averaged over all the games played, we aggregate the stats by passing the AllTeams() flag in aggregateStat. In this case, we define the sendMsg argument in aggregateMessages to send the required stats to both source (the winner) and destination (the loser) using the EdgeContext class's sendToSrc and sendToDst functions respectively. This mechanism is pretty straightforward. We just need to make sure that we send the right information to the right node. In this case, we send winnerStats to the winner, and loserStatsto the loser. Okay, you get the idea now. So, let's apply it to please our coach. Here are the teams with the overall highest three pointers per page: // Average Three Point Made Per Game for All Teams val allThreePointMade = averageStat(scoreGraph)(threePointMade, AllTeams()) scala> allThreePointMade.sortBy(_._2, false).take(5).foreach(println) (1440,10.180811808118081) (1125,9.098412698412698) (1172,8.575657894736842) (1184,8.428571428571429) (1407,8.411149825783973) And here are the five most efficient teams overall in three pointers per game: // Average Three Point Percent for All Teams val allThreePointPercent = averageStat(scoreGraph)(threePointPercent, AllTeams()) Let's check the output: scala> allThreePointPercent.sortBy(_._2,false).take(5).foreach(println) (1429,38.8351815824302) (1323,38.522819895594) (1181,38.43052051444854) (1294,38.41227053353959) (1101,38.097896464168954) Actually, there is only a 2 percent difference between the most efficient team and the one in the fiftieth position. Most NCAA teams are therefore pretty efficient behind the line. I bet coach knew this already! Average points per game We can also reuse the averageStat function to get the average points per game for the winners. In particular, let's take a look at the two teams that won games with the highest and lowest scores: // Winning teams val winnerAvgPPG = averageStat(scoreGraph)(score, Winners()) Let's check the output: scala> winnerAvgPPG.max()(Ordering.by(_._2)) res36: (org.apache.spark.graphx.VertexId, Double) = (1322,90.73333333333333) scala> winnerAvgPPG.min()(Ordering.by(_._2)) res39: (org.apache.spark.graphx.VertexId, Double) = (1197,60.5) Apparently, the most defensive team can win game by scoring only 60 points, whereas the most offensive team can score an average of 90 points. Next, let's average the points per game for all games played and look at the two teams with the best and worst offense during the 2015 season: // Average Points Per Game of All Teams val allAvgPPG = averageStat(scoreGraph)(score, AllTeams()) Let's see the output: scala> allAvgPPG.max()(Ordering.by(_._2)) res42: (org.apache.spark.graphx.VertexId, Double) = (1322,83.81481481481481) scala> allAvgPPG.min()(Ordering.by(_._2)) res43: (org.apache.spark.graphx.VertexId, Double) = (1212,51.111111111111114) To no one's surprise, the best offensive team is the same as the one who scores the most in winning games. To win the games, 50 points are not enough in an average for a team to win the games. Defense stats – the D matters as in direction Previously, we obtained some statistics such as field goals or a three-point percentage that a team achieves. What if we want to aggregate instead the average points or rebounds that each team concedes to their opponents? To compute this, we define a new higher-order function called averageConcededStat. Compared to averageStat, this function needs to send loserStats to the winning team, and the winnerStats function to the losing team. To make things more interesting, we are going to make the team name as a part of the message Msg: def averageConcededStat(graph: Graph[String, FullResult])(getStat: GameStats => Double, rxs: Teams): VertexRDD[(String, Double)] = { type Msg = (Int, Double, String) val aggrStats: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg rxs match { case _ : Winners => t => t.sendToSrc((1, getStat(t.attr.loserStats), t.srcAttr)) case _ : Losers => t => t.sendToDst((1, getStat(t.attr.winnerStats), t.dstAttr)) case _ => t => { t.sendToSrc((1, getStat(t.attr.loserStats),t.srcAttr)) t.sendToDst((1, getStat(t.attr.winnerStats),t.dstAttr)) } } , // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2, x._3) ) aggrStats mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double, name: String) => (name, total/count) }) } With this, we can calculate the average points conceded by the winning and losing teams as follows: val winnersAvgConcededPoints = averageConcededStat(scoreGraph)(score, Winners()) val losersAvgConcededPoints = averageConcededStat(scoreGraph)(score, Losers()) Let's check the output: scala> losersAvgConcededPoints.min()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1101,(Abilene Chr,74.04761904761905)) scala> winnersAvgConcededPoints.min()(Ordering.by(_._2)) res: (org.apache.spark.graphx.VertexId, (String, Double)) = (1101,(Abilene Chr,74.04761904761905)) scala> losersAvgConcededPoints.max()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1464,(Youngstown St,78.85714285714286)) scala> winnersAvgConcededPoints.max()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1464,(Youngstown St,71.125)) The previous tells us that Abilene Christian University is the most defensive team. They concede the least points whether they win a game or not. On the other hand, Youngstown has the worst defense. Joining aggregated stats into graphs The previous example shows us how flexible the aggregateMessages operator is. We can define the Msg type of the messages to be aggregated to fit our needs. Moreover, we can select which nodes receive the messages. Finally, we can also define how we want to merge the messages. As a final example, let's aggregate many statistics about each team, and join this information into the nodes of the graph. To start, we create its own class for the team stats: // Average Stats of All Teams case class TeamStat( wins: Int = 0 // Number of wins ,losses: Int = 0 // Number of losses ,ppg: Int = 0 // Points per game ,pcg: Int = 0 // Points conceded per game ,fgp: Double = 0 // Field goal percentage ,tpp: Double = 0 // Three point percentage ,ftp: Double = 0 // Free Throw percentage ){ override def toString = wins + "-" + losses } Then, we collect the average stats for all teams using aggregateMessages in the following. For this, we define the type of the message to be an 8-element tuple that holds the counter for games played, wins, losses, and other statistics that will be stored in TeamStat as listed previously: type Msg = (Int, Int, Int, Int, Int, Double, Double, Double) val aggrStats: VertexRDD[Msg] = scoreGraph.aggregateMessages( // sendMsg t => { t.sendToSrc(( 1, 1, 0, t.attr.winnerStats.score, t.attr.loserStats.score, t.attr.winnerStats.fgPercent, t.attr.winnerStats.tpPercent, t.attr.winnerStats.ftPercent )) t.sendToDst(( 1, 0, 1, t.attr.loserStats.score, t.attr.winnerStats.score, t.attr.loserStats.fgPercent, t.attr.loserStats.tpPercent, t.attr.loserStats.ftPercent )) } , // mergeMsg (x, y) => ( x._1 + y._1, x._2 + y._2, x._3 + y._3, x._4 + y._4, x._5 + y._5, x._6 + y._6, x._7 + y._7, x._8 + y._8 ) ) Given the aggregate message called aggrStats, we map them into a collection of TeamStat: val teamStats: VertexRDD[TeamStat] = aggrStats mapValues { (id: VertexId, m: Msg) => m match { case ( count: Int, wins: Int, losses: Int, totPts: Int, totConcPts: Int, totFG: Double, totTP: Double, totFT: Double) => TeamStat( wins, losses, totPts/count, totConcPts/count, totFG/count, totTP/count, totFT/count) } } Next, let's join teamStats into the graph. For this, we first create a class called Team as a new type for the vertex attribute. Team will have a name and TeamStat: case class Team(name: String, stats: Option[TeamStat]) { override def toString = name + ": " + stats } Next, we use the joinVertices operator that we have seen in the previous chapter: // Joining the average stats to vertex attributes def addTeamStat(id: VertexId, t: Team, stats: TeamStat) = Team(t.name, Some(stats)) val statsGraph: Graph[Team, FullResult] = scoreGraph.mapVertices((_, name) => Team(name, None)). joinVertices(teamStats)(addTeamStat) We can see that the join has worked well by printing the first three vertices in the new graph called statsGraph: scala> statsGraph.vertices.take(3).foreach(println) (1260,Loyola-Chicago: Some(17-13)) (1410,TX Pan American: Some(7-21)) (1426,UT Arlington: Some(15-15)) To conclude this task, let's find out the top 10 teams in the regular seasons. To do so, we define an ordering for Option[TeamStat] as follows: import scala.math.Ordering object winsOrdering extends Ordering[Option[TeamStat]] { def compare(x: Option[TeamStat], y: Option[TeamStat]) = (x, y) match { case (None, None) => 0 case (Some(a), None) => 1 case (None, Some(b)) => -1 case (Some(a), Some(b)) => if (a.wins == b.wins) a.losses compare b.losses else a.wins compare b.wins }} Finally, we get the following: import scala.reflect.classTag import scala.reflect.ClassTag scala> statsGraph.vertices.sortBy(v => v._2.stats,false)(winsOrdering, classTag[Option[TeamStat]]). | take(10).foreach(println) (1246,Kentucky: Some(34-0)) (1437,Villanova: Some(32-2)) (1112,Arizona: Some(31-3)) (1458,Wisconsin: Some(31-3)) (1211,Gonzaga: Some(31-2)) (1320,Northern Iowa: Some(30-3)) (1323,Notre Dame: Some(29-5)) (1181,Duke: Some(29-4)) (1438,Virginia: Some(29-3)) (1268,Maryland: Some(27-6)) Note that the ClassTag parameter is required in sortBy to make use of Scala's reflection. This is why we had the previous imports. Performance optimization with tripletFields In addition to sendMsg and mergeMsg, aggregateMessages can also take an optional argument called tripletsFields, which indicates what data is accessed in the EdgeContext. The main reason for explicitly specifying such information is to help optimize the performance of the aggregateMessages operation. In fact, TripletFields represents a subset of the fields of EdgeTriplet, and it enables GraphX to populate only thse fields when necessary. The default value is TripletFields. All which means that the sendMsg function may access any of the fields in the EdgeContext. Otherwise, the tripletFields argument is used to tell GraphX that only part of the EdgeContext will be required so that an efficient join strategy can be used. All the possible options for the tripletsFields are listed here: TripletFields.All: Expose all the fields (source, edge, and destination) TripletFields.Dst: Expose the destination and edge fields, but not the source field TripletFields.EdgeOnly: Expose only the edge field. TripletFields.None: None of the triplet fields are exposed TripletFields.Src: Expose the source and edge fields, but not the destination field Using our previous example, if we are interested in computing the total number of wins and losses for each team, we will not need to access any field of the EdgeContext. In this case, we should use TripletFields. None to indicate so: // Number of wins of the teams val numWins: VertexRDD[Int] = scoreGraph.aggregateMessages( triplet => { triplet.sendToSrc(1) // No attribute is passed but an integer }, (x, y) => x + y, TripletFields.None ) // Number of losses of the teams val numLosses: VertexRDD[Int] = scoreGraph.aggregateMessages( triplet => { triplet.sendToDst(1) // No attribute is passed but an integer }, (x, y) => x + y, TripletFields.None ) To see that this works, let's print the top five and bottom five teams: scala> numWins.sortBy(_._2,false).take(5).foreach(println) (1246,34) (1437,32) (1112,31) (1458,31) (1211,31) scala> numLosses.sortBy(_._2, false).take(5).foreach(println) (1363,28) (1146,27) (1212,27) (1197,27) (1263,27) Should you want the name of the top five teams, you need to access the srcAttr attribute. In this case, we need to set tripletFields to TripletFields.Src: Kentucky as undefeated team in regular season: val numWinsOfTeams: VertexRDD[(String, Int)] = scoreGraph.aggregateMessages( t => { t.sendToSrc(t.srcAttr, 1) // Pass source attribute only }, (x, y) => (x._1, x._2 + y._2), TripletFields.Src ) Et voila! scala> numWinsOfTeams.sortBy(_._2._2, false).take(5).foreach(println) (1246,(Kentucky,34)) (1437,(Villanova,32)) (1112,(Arizona,31)) (1458,(Wisconsin,31)) (1211,(Gonzaga,31)) scala> numWinsOfTeams.sortBy(_._2._2).take(5).foreach(println) (1146,(Cent Arkansas,2)) (1197,(Florida A&M,2)) (1398,(Tennessee St,3)) (1263,(Maine,3)) (1420,(UMBC,4)) Kentucky has not lost any of its 34 games during the regular season. Too bad that they could not make it into the championship final. Warning about the MapReduceTriplets operator Prior to Spark 1.2, there was no aggregateMessages method in graph. Instead, the now deprecated mapReduceTriplets was the primary aggregation operator. The API for mapReduceTriplets is: class Graph[VD, ED] { def mapReduceTriplets[Msg]( map: EdgeTriplet[VD, ED] => Iterator[(VertexId, Msg)], reduce: (Msg, Msg) => Msg) : VertexRDD[Msg] } Compared to mapReduceTriplets, the new operator called aggregateMessages is more expressive as it employs the message passing mechanism instead of returning an iterator of messages as mapReduceTriplets does. In addition, aggregateMessages explicitly requires the user to specify the TripletFields object for performance improvement as we explained previously. In addition to the API improvements, aggregateMessages is optimized for performance. Because mapReduceTriplets is now deprecated, we will not discuss it further. If you have to use it with earlier versions of Spark, you can refer to the Spark programming guide. Summary In brief, AggregateMessages is a useful and generic operator that provides a functional abstraction for aggregating neighborhood information in the Spark graphs. Its definition is summarized here: class Graph[VD, ED] { def aggregateMessages[Msg: ClassTag]( sendMsg: EdgeContext[VD, ED, Msg] => Unit, mergeMsg: (Msg, Msg) => Msg, tripletFields: TripletFields = TripletFields.All) : VertexRDD[Msg] } This operator applies a user-defined sendMsg function to each edge in the graph using an EdgeContext. Each EdgeContext access the required information about the edge and passes this information to its source node and/or destination node using the sendToSrc and/or sendToDst respectively. After all the messages are received by the nodes, the mergeMsg function is used to aggregate these messages at each node. Some interesting reads Six keys to sports analytics Moneyball: The Art Of Winning An Unfair Game Golden State Warriors at the forefront of NBA data analysis How Data and Analytics Have Changed 'The Beautiful Game' NHL, SAP partnership to lead statistical revolution Resources for Article: Further resources on this subject: The Spark programming model[article] Apache Karaf – Provisioning and Clusters[article] Machine Learning Using Spark MLlib [article]
Read more
  • 0
  • 0
  • 2289

Packt
08 Sep 2015
17 min read
Save for later

The Symfony Framework – Installation and Configuration

Packt
08 Sep 2015
17 min read
 In this article by Wojciech Bancer, author of the book, Symfony2 Essentials, we will learn the basics of Symfony, its installation, configuration, and use. The Symfony framework is currently one of the most popular PHP frameworks existing within the PHP developer's environment. Version 2, which was released a few years ago, has been a great improvement, and in my opinion was one of the key elements for making the PHP ecosystem suitable for larger enterprise projects. The framework version 2.0 not only required the modern PHP version (minimal version required for Symfony is PHP 5.3.8), but also uses state-of-the-art technology — namespaces and anonymous functions. Authors also put a lot of efforts to provide long term support and to minimize changes, which break the compatibility between versions. Also, Symfony forced developers to use a few useful design concepts. The key one, introduced in Symfony, was DependencyInjection. (For more resources related to this topic, see here.) In most cases, the article will refer to the framework as Symfony2. If you want to look over the Internet or Google about this framework, apart from using Symfony keyword you may also try to use the Symfony2 keyword. This was the way recommended some time ago by one of the creators to make searching or referencing to the specific framework version easier in future. Key reasons to choose Symfony2 Symfony2 is recognized in the PHP ecosystem as a very well-written and well-maintained framework. Design patterns that are recommended and forced within the framework allow work to be more efficient in the group, this allows better tests and the creation of reusable code. Symfony's knowledge can also be verified through a certificate system, and this allows its developers to be easily found and be more recognized on the market. Last but not least, the Symfony2 components are used as parts of other projects, for example, look at the following: Drupal phpBB Laravel eZ Publish and more Over time, there is a good chance that you will find the parts of the Symfony2 components within other open source solutions. Bundles and extendable architecture are also some of the key Symfony2 features. They not only allow you to make your work easier through the easy development of reusable code, but also allows you to find smaller or larger pieces of code that you can embed and use within your project to speed up and make your work faster. The standards of Symfony2 also make it easier to catch errors and to write high-quality code; its community is growing every year. The history of Symfony There are many Symfony versions around, and it's good to know the differences between them to learn how the framework was evolving during these years. The first stable Symfony version — 1.0 — was released in the beginning of 2007 and was supported for three years. In mid-2008, version 1.1 was presented, which wasn't compatible with the previous release, and it was difficult to upgrade any old project to this. Symfony 1.2 version was released shortly after this, at the end of 2008. Migrating between these versions was much easier, and there were no dramatic changes in the structure. The final versions of Symfony 1's legacy family was released nearly one year later. Simultaneously, there were two version releases, 1.3 and 1.4. Both were identical, but Symfony 1.4 did not have deprecated features, and it was recommended to start new projects with it. Version 1.4 had 3 years of support. If you look into the code, version 1.x was very different from version 2. The company that was behind Symfony (the French company, SensioLabs) made a bold move and decided to rewrite the whole framework from scratch. The first release of Symfony2 wasn't perfect, but it was very promising. It relied on Git submodules (the composer did not exist back then). The 2.1 and 2.2 versions were closer to the one we use now, although it required a lot of effort to migrate to the upper level. Finally, the Symfony 2.3 was released — the first long-term support version within the 2.x branch. After this version, the changes provided within the next major versions (2.4, 2.5, and 2.6) are not so drastic and usually they do not break compatibility. This article was written based on the latest stable Symfony 2.7.4 version and was tested with PHP 5.5). This Symfony version is marked as the so called long-term support version, and updates for it will be released for 3 years since the first 2.7 version release. Installation Prior to installing Symfony2, you don't need to have a configured web server. If you have at least PHP version 5.4, you can use the standalone server provided by Symfony2. This server is suitable for development purposes and should not be used for production. It is strongly recommend to work with a Linux/UNIX system for both development and production deployment of Symfony2 framework applications. While it is possible to install and operate on a Windows box, due to its different nature, working with Windows can sometimes force you to maintain a separate fragment of code for this system. Even if your primary OS is Windows, it is strongly recommended to configure Linux system in a virtual environment. Also, there are solutions that will help you in automating the whole process. As an example, see more on https://www.vagrantup.com/ website. To install Symfony2, you can use a few methods as follows: Use a new Symfony2 installer script (currently, the only officially recommended). Please note that installer requires at least PHP 5.4. Use a composer dependency manager to install a Symfony project. Download a zip or tgz package and unpack it. It does not really matter which method you choose, as they all give you similar results. Installing Symfony2 by using an installer To install Symfony2 through an installer, go to the Symfony website at http://symfony.com/download, and install the Symfony2 installer by issuing the following commands: $ sudo curl -LsS http://symfony.com/installer -o /usr/local/bin/symfony $ sudo chmod +x /usr/local/bin/symfony After this, you can install Symfony by just typing the following command: $ symfony new <new_project_folder> To install the Symfony2 framework for a to-do application, execute the following command: $ symfony new <new_project_folder> This command installs the latest Symfony2 stable version on the newly created todoapp folder, creates the Symfony2 application, and prepares some basic structure for you to work with. After the app creation, you can verify that your local PHP is properly configured for Symfony2 by typing the following command: $ php app/check.php If everything goes fine, the script should complete with the following message: [OK] Your system is ready to run Symfony projects Symfony2 is equipped with a standalone server. It makes development easier. If you want to run this, type the following command: $ php app/console server:run If everything went alright, you will see a message that your server is working on the IP 127.0.0.1 and port 8000. If there is an error, make sure you are not running anything else that is listening on port 8000. It is also possible to run the server on a different port or IP, if you have such a requirement, by adding the address and port as a parameter, that is: $ php app/console server:run 127.0.0.1:8080 If everything works, you can now type the following: http://127.0.0.1:8000/ Now, you will visit Symfony's welcome page. This page presents you with a nice welcome information and useful documentation link. The Symfony2 directory structure Let's dive in to the initial directory structure within the typical Symfony application. Here it is: app bin src vendor web While Symfony2 is very flexible in terms of directory structure, it is recommended to keep the basic structure mentioned earlier. The following table describes their purpose: Directory Used for app This holds information about general configuration, routing, security configuration, database parameters, and many others. It is also the recommended place for putting new view files. This directory is a starting point. bin It holds some helper executables. It is not really important during the development process, and rarely modified. src This directory holds the project PHP code (usually your bundles). vendor These are third-party libraries used within the project. Usually, this directory contains all the open source third-party bundles, libraries, and other resources. It's worth to mention that it's recommended to keep the files within this directory outside the versioning system. It means that you should not modify them under any circumstances. Fortunately, there are ways to modify the code, if it suits your needs more. This will be demonstrated when we implement user management within our to-do application. web This is the directory that is accessible through the web server. It holds the main entry point to the application (usually the app.php and app_dev.php files), CSS files, JavaScript files, and all the files that need to be available through the web server (user uploadable files). So, in most cases, you will be usually modifying and creating the PHP files within the src/ directory, the view and configuration files within the app/ directory, and the JS/CSS files within the web/ directory. The main directory also holds a few files as follows: .gitignore README.md composer.json composer.lock The .gitignore file's purpose is to provide some preconfigured settings for the Git repository, while the composer.json and composer.lock files are the files used by the composer dependency manager. What is a bundle? Within the Symfony2 application, you will be using the "bundle" term quite often. Bundle is something similar to plugins. So it can literally hold any code controllers, views, models, and services. A bundle can integrate other non-Symfony2 libraries and hold some JavaScript/CSS code as well. We can say that almost everything is a bundle in Symfony2; even some of the core framework features together form a bundle. A bundle usually implements a single feature or functionality. The code you are writing when you write a Symfony2 application is also a bundle. There are two types of bundles. The first kind of bundle is the one you write within the application, which is project-specific and not reusable. For this purpose, there is a special bundle called AppBundle created for you when you install the Symfony2 project. Also, there are reusable bundles that are shared across the various projects either written by you, your team, or provided by a third-party vendors. Your own bundles are usually stored within the src/ directory, while the third-party bundles sit within the vendor/ directory. The vendor directory is used to store third-party libraries and is managed by the composer. As such, it should never be modified by you. There are many reusable open source bundles, which help you to implement various features within the application. You can find many of them to help you with User Management, writing RESTful APIs, making better documentation, connecting to Facebook and AWS, and even generating a whole admin panel. There are tons of bundles, and everyday brings new ones. If you want to explore open source bundles, and want to look around what's available, I recommend you to start with the http://knpbundles.com/ website. The bundle name is correlated with the PHP namespace. As such, it needs to follow some technical rules, and it needs to end with the Bundle suffix. A few examples of correct names are AppBundle and AcmeDemoBundle, CompanyBlogBundle or CompanySocialForumBundle, and so on. Composer Symfony2 is built based on components, and it would be very difficult to manage the dependencies between them and the framework without a dependency manager. To make installing and managing these components easier, Symfony2 uses a manager called composer. You can get it from the https://getcomposer.org/ website. The composer makes it easy to install and check all dependencies, download them, and integrate them to your work. If you want to find additional packages that can be installed with the composer, you should visit https://packagist.org/. This site is the main composer repository, and contains information about most of the packages that are installable with the composer. To install the composer, go to https://getcomposer.org/download/ and see the download instruction. The download instruction should be similar to the following: $ curl -sS https://getcomposer.org/installer | php If the download was successful, you should see the composer.phar file in your directory. Move this to the project location in the same place where you have the composer.json and composer.lock files. You can also install it globally, if you prefer to, with these two commands: $ curl -sS https://getcomposer.org/installer | php $ sudo mv composer.phar /usr/local/bin/composer You will usually need to use only three composer commands: require, install, and update. The require command is executed when you need to add a new dependency. The install command is used to install the package. The update command is used when you need to fetch the latest version of your dependencies as specified within the JSON file. The difference between install and update is subtle, but very important. If you are executing the update command, your composer.lock file gets updated with the version of the code you just fetched and downloaded. The install command uses the information stored in the composer.lock file and the fetch version stored in this file. When to use install? For example, if you deploy the code to the server, you should use install rather than update, as it will deploy the version of the code stored in composer.lock, rather than download the latest version (which may be untested by you). Also, if you work in a team and you just got an update through Git, you should use install to fetch the vendor code updated by other developers. You should use the update command if you want to check whether there is an updated version of the package you have installed, that is, whether a new minor version of Symfony2 will be released, then the update command will fetch everything. As an example, let's install one extra package for user management called FOSUserBundle (FOS is a shortcut of Friends of Symfony). We will only install it here; we will not configure it. To install FOSUserBundle, we need to know the correct package name and version. The easiest way is to look in the packagist site at https://packagist.org/ and search for the package there. If you type fosuserbundle, the search should return a package called friendsofsymfony/user-bundle as one of the top results. The download counts visible on the right-hand side might be also helpful in determining how popular the bundle is. If you click on this, you will end up on the page with the detailed information about that bundle, such as homepage, versions, and requirements of the package. Type the following command: $ php composer.phar require friendsofsymfony/user-bundle ^1.3 Using version ^1.3 for friendsofsymfony/user-bundle ./composer.json has been updated Loading composer repositories with package information Updating dependencies (including require-dev) - Installing friendsofsymfony/user-bundle (v1.3.6) Loading from cache friendsofsymfony/user-bundle suggests installing willdurand/propel-typehintable-behavior (Needed when using the propel implementation) Writing lock file Generating autoload files ... Which version of the package you choose is up to you. If you are interested in package versioning standards, see the composer website at https://getcomposer.org/doc/01-basic-usage.md#package-versions to get more information on it. The composer holds all the configurable information about dependencies and where to install them in a special JSON file called composer.json. Let's take a look at this: { "name": "wbancer/todoapp", "license": "proprietary", "type": "project", "autoload": { "psr-0": { "": "src/", "SymfonyStandard": "app/SymfonyStandard/" } }, "require": { "php": ">=5.3.9", "symfony/symfony": "2.7.*", "doctrine/orm": "~2.2,>=2.2.3,<2.5", // [...] "incenteev/composer-parameter-handler": "~2.0", "friendsofsymfony/user-bundle": "^1.3" }, "require-dev": { "sensio/generator-bundle": "~2.3" }, "scripts": { "post-root-package-install": [ "SymfonyStandard\\Composer::hookRootPackageInstall" ], "post-install-cmd": [ // post installation steps ], "post-update-cmd": [ // post update steps ] }, "config": { "bin-dir": "bin" }, "extra": { // [...] } } The most important section is the one with the require key. It holds all the information about the packages we want to use within the project. The key scripts contain a set of instructions to run post-install and post-update. The extra key in this case contains some settings specific to the Symfony2 framework. Note that one of the values in here points out to the parameter.yml file. This file is the main file holding the custom machine-specific parameters. The meaning of the other keys is rather obvious. If you look into the vendor/ directory, you will notice that our package has been installed in the vendor/friendsofsymfony/user-bundle directory. The configuration files Each application has a need to hold some global and machine-specific parameters and configurations. Symfony2 holds configuration within the app/config directory and it is split into a few files as follows: config.yml config_dev.yml config_prod.yml config_test.yml parameters.yml parameters.yml.dist routing.yml routing_dev.yml security.yml services.yml All the files except the parameters.yml* files contain global configuration, while the parameters.yml file holds machine-specific information such as database host, database name, user, password, and SMTP configuration. The default configuration file generated by the new Symfony command will be similar to the following one. This file is auto-generated during the composer install: parameters: database_driver: pdo_mysql database_host: 127.0.0.1 database_port: null database_name: symfony database_user: root database_password: null mailer_transport: smtp mailer_host: 127.0.0.1 mailer_user: null mailer_password: null secret: 93b0eebeffd9e229701f74597e10f8ecf4d94d7f As you can see, it mostly holds the parameters related to database, SMTP, locale settings, and secret key that are used internally by Symfony2. Here, you can add your custom parameters using the same syntax. It is a good practice to keep machine-specific data such as passwords, tokens, api-keys, and access keys within this file only. Putting passwords in the general config.yml file is considered as a security risk bug. The global configuration file (config.yml) is split into a few other files called routing*.yml that contain information about routing on the development and production configuration. The file called as security.yml holds information related to authentication and securing the application access. Note that some files contains information for development, production, or test mode. You can define your mode when you run Symfony through the command-line console and when you run it through the web server. In most cases, while developing you will be using the dev mode. The Symfony2 console To finish, let's take a look at the Symfony console script. We used it before to fire up the development server, but it offers more. Execute the following: $ php app/console You will see a list of supported commands. Each command has a short description. Each of the standard commands come with help, so I will not be describing each of them here, but it is worth to mention a few commonly used ones: Command Description app/console: cache:clear Symfony in production uses a lot of caching. Therefore, if you need to change values within a template (twig) or within configuration files while in production mode, you will need to clear the cache. Cache is also one of the reasons why it's worth to work in the development mode. app/console container:debug Displays all configured public services app/console router:debug Displays all routing configuration along with method, scheme, host, and path. app/console security:check Checks your composer and packages version against known security vulnerabilities. You should run this command regularly. Summary In this article, we have demonstrated how to use the Symfony2 installer, test the configuration, run the deployment server, and play around with the Symfony2 command line. We have also installed the composer and learned how to install a package using it. To demonstrate how Symfony2 enables you to make web applications faster, we will try to learn through examples that can be found in real life. To make this task easier, we will try to produce a real to-do web application with modern look and a few working features. In case you are interested in knowing other Symfony books that Packt has in store for you, here is the link: Symfony 1.3 Web Application Development, Tim Bowler, Wojciech Bancer Extending Symfony2 Web Application Framework, Sébastien Armand Resources for Article: Further resources on this subject: A Command-line Companion Called Artisan[article] Creating and Using Composer Packages[article] Services [article]
Read more
  • 0
  • 0
  • 3674

article-image-netbeans-developers-life-cycle
Packt
08 Sep 2015
30 min read
Save for later

The NetBeans Developer's Life Cycle

Packt
08 Sep 2015
30 min read
In this article by David Salter, the author of Mastering NetBeans, we'll cover the following topics: Running applications Debugging applications Profiling applications Testing applications On a day-to-day basis, developers spend much of their time writing and running applications. While writing applications, they typically debug, test, and profile them to ensure that they provide the best possible application to customers. Running, debugging, profiling, and testing are all integral parts of the development life cycle, and NetBeans provides excellent tooling to help us in all these areas. (For more resources related to this topic, see here.) Running applications Executing applications from within NetBeans is as simple as either pressing the F6 button on the keyboard or selecting the Run menu item or Project Context menu item. Choosing either of these options will launch your application without specifying any additional Java command-line parameters using the default platform JDK that NetBeans is currently using. Sometimes we want to change the options that are used for launching applications. NetBeans allows these options to be easily specified by a project's properties. Right-clicking on a project in the Projects window and selecting the Properties menu option opens the Project Properties dialog. Selecting the Run category allows the configuration options to be defined for launching an application. From this dialog, we can define and select multiple run configurations for the project via the Configuration dropdown. Selecting the New… button to the right of the Configuration dropdown allows us to enter a name for a new configuration. Once a new configuration is created, it is automatically selected as the active configuration. The Delete button can be used for removing any unwanted configurations. The preceding screenshot shows the Project Properties dialog for a standard Java project. Different project types (for example, web or mobile projects) have different options in the Project Properties window. As can be seen from the preceding Project Properties dialog, several pieces of information can be defined for a standard Java project, which together make up the launch configuration for a project: Runtime Platform: This option allows us to define which Java platform we will use when launching the application. From here, we can select from all the Java platforms that are configured within NetBeans. Selecting the Manage Platforms… button opens the Java Platform Manager dialog, allowing full configuration of the different Java platforms available (both Java Standard Edition and Remote Java Standard Edition). Selecting this button has the same effect as selecting the Tools and then Java Platforms menu options. Main Class: This option defines the main class that is used to launch the application. If the project has more than one main class, selecting the Browse… button will cause the Browse Main Classes dialog to be displayed, listing all the main classes defined in the project. Arguments: Different command-line arguments can be passed to the main class as defined in this option. Working Directory: This option allows the working directory for the application to be specified. VM Options: If different VM options (such as heap size) require setting, they can be specified by this option. Selecting the Customize button displays a dialog listing the different standard VM options available which can be selected (ticked) as required. Custom VM properties can also be defined in the dialog. For more information on the different VM properties for Java, check out http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html. From here, the VM properties for Java 7 (and earlier versions) and Java 8 for Windows, Solaris, Linux, and Mac OS X can be referenced. Run with Java Web Start: Selecting this option allows the application to be executed using Java Web Start technologies. This option is only available if Web Start is enabled in the Application | Web Start category. When running a web application, the project properties are different from those of a standalone Java application. In fact, the project properties for a Maven web application are different from those of a standard NetBeans web application. The following screenshot shows the properties for a Maven-based web application; as discussed previously, Maven is the standard project management tool for Java applications, and the recommended tool for creating and managing web applications: Debugging applications In the previous section, we saw how NetBeans provides the easy-to-use features to allow developers to launch their applications, but then it also provides more powerful additional features. The same is true for debugging applications. For simple debugging, NetBeans provides the standard facilities you would expect, such as stepping into or over methods, setting line breakpoints, and monitoring the values of variables. When debugging applications, NetBeans provides several different windows, enabling different types of information to be displayed and manipulated by the developer: Breakpoints Variables Call stack Loaded classes Sessions Threads Sources Debugging Analyze stack All of these windows are accessible from the Window and then Debugging main menu within NetBeans. Breakpoints NetBeans provides a simple approach to set breakpoints and a more comprehensive approach that provides many more useful features. Breakpoints can be easily added into Java source code by clicking on the gutter on the left-hand side of a line of Java source code. When a breakpoint is set, a small pink square is shown in the gutter and the entire line of source code is also highlighted in the same color. Clicking on the breakpoint square in the gutter toggles the breakpoint on and off. Once a breakpoint has been created, instead of removing it altogether, it can be disabled by right-clicking on the bookmark in the gutter and selecting the Breakpoint and then Enabled menu options. This has the effect of keeping the breakpoint within your codebase, but execution of the application does not stop when the breakpoint is hit. Creating a simple breakpoint like this can be a very powerful way of debugging applications. It allows you to stop the execution of an application when a line of code is hit. If we want to add a bit more control onto a simple breakpoint, we can edit the breakpoint's properties by right-clicking on the breakpoint in the gutter and selecting the Breakpoint and then Properties menu options. This causes the Breakpoint Properties dialog to be displayed: In this dialog, we can see the line number and the file that the breakpoint belongs to. The line number can be edited to move the breakpoint if it has been created on the wrong line. However, what's more interesting is the conditions that we can apply to the breakpoint. The Condition entry allows us to define a condition that has to be met for the breakpoint to stop the code execution. For example, we can stop the code when the variable i is equal to 20 by adding a condition, i==20. When we add conditions to a breakpoint, the breakpoint becomes known as a conditional breakpoint, and the icon in the gutter changes to a square with the lower-right quadrant removed. We can also cause the execution of the application to halt at a breakpoint when the breakpoint has been hit a certain number of times. The Break when hit count is condition can be set to Equal to, Greater than, or Multiple of to halt the execution of the application when the breakpoint has been hit the requisite number of times. Finally, we can specify what actions occur when a breakpoint is hit. The Suspend dropdown allows us to define what threads are suspended when a breakpoint is hit. NetBeans can suspend All threads, Breakpoint thread, or no threads at all. The text that is displayed in the Output window can be defined via the Print Text edit box and different breakpoint groups can be enabled or disabled via the Enable Group and Disable Group drop-down boxes. But what exactly is a breakpoint group? Simply put, a breakpoint group is a collection of breakpoints that can all be set or unset at the same time. It is a way of categorizing breakpoints into similar collections, for example, all the breakpoints in a particular file, or all the breakpoints relating to exceptions or unit tests. Breakpoint groups are created in the Breakpoints window. This is accessible by selecting the Debugging and then Breakpoints menu options from within the main NetBeans Window menu. To create a new breakpoint group, simply right-click on an existing breakpoint in the Breakpoints window and select the Move Into Group… and then New… menu options. The Set the Name of Breakpoints Group dialog is displayed in which the name of the new breakpoint group can be entered. After creating a breakpoint group and assigning one or more breakpoints into it, the entire group of breakpoints can be enabled or disabled, or even deleted by right-clicking on the group in the Breakpoints window and selecting the appropriate option. Any newly created breakpoint groups will also be available in the Breakpoint Properties window. So far, we've seen how to create breakpoints that stop on a single line of code, and also how to create conditional breakpoints so that we can cause an application to stop when certain conditions occur for a breakpoint. These are excellent techniques to help debug applications. NetBeans, however, also provides the ability to create more advanced breakpoints so that we can get even more control of when the execution of applications is halted by breakpoints. So, how do we create these breakpoints? These different types of breakpoints are all created from in the Breakpoints window by right-clicking and selecting the New Breakpoint… menu option. In the New Breakpoint dialog, we can create different types of breakpoints by selecting the appropriate entry from the Breakpoint Type drop-down list. The preceding screenshot shows an example of creating a Class breakpoint. The following types of breakpoints can be created: Class: This creates a breakpoint that halts execution when a class is loaded, unloaded, or either event occurs. Exception: This stops execution when the specified exception is caught, uncaught, or either event occurs. Field: This creates a breakpoint that halts execution when a field on a class is accessed, modified, or either event occurs. Line: This stops execution when the specified line of code is executed. It acts the same way as creating a breakpoint by clicking on the gutter of the Java source code editor window. Method: This creates a breakpoint that halts execution when a method is entered, exited, or when either event occurs. Optionally, the breakpoint can be created for all methods inside a specified class rather than a single method. Thread: This creates a breakpoint that stops execution when a thread is started, finished, or either event occurs. AWT/Swing Component: This creates a breakpoint that stops execution when a GUI component is accessed. For each of these different types of breakpoints, conditions and actions can be specified in the same way as on simple line-based breakpoints. The Variables debug window The Variables debug window lists all the variables that are currently within  the scope of execution of the application. This is therefore thread-specific, so if multiple threads are running at one time, the Variables window will only display variables in scope for the currently selected thread. In the Variables window, we can see the variables currently in scope for the selected thread, their type, and value. To display variables for a different thread to that currently selected, we must select an alternative thread via the Debugging window. Using the triangle button to the left of each variable, we can expand variables and drill down into the properties within them. When a variable is a simple primitive (for example, integers or strings), we can modify it or any property within it by altering the value in the Value column in the Variables window. The variable's value will then be changed within the running application to the newly entered value. By default, the Variables window shows three columns (Name, Type, and Value). We can modify which columns are visible by pressing the selection icon () at the top-right of the window. Selecting this displays the Change Visible Columns dialog, from which we can select from the Name, String value, Type, and Value columns: The Watches window The Watches window allows us to see the contents of variables and expressions during a debugging session, as can be seen in the following screenshot: In this screenshot, we can see that the variable i is being displayed along with the expressions 10+10 and i+20. New expressions can be watched by clicking on the <Enter new watch> option or by right-clicking on the Java source code editor and selecting the New Watch… menu option. Evaluating expressions In addition to watching variables in a debugging session, NetBeans also provides the facility to evaluate expressions. Expressions can contain any Java code that is valid for the running scope of the application. So, for example, local variables, class variables, or new instances of classes can be evaluated. To evaluate variables, open the Evaluate Expression window by selecting the Debug and then Evaluate Expression menu options. Enter an expression to be evaluated in this window and press the Evaluate Code Fragment button at the bottom-right corner of the window. As a shortcut, pressing the Ctrl + Enter keys will also evaluate the code fragment. Once an expression has been evaluated, it is shown in the Evaluation Result window. The Evaluation Result window shows a history of each expression that has previously been evaluated. Expressions can be added to the list of watched variables by right-clicking on the expression and selecting the Create Fixed Watch expression. The Call Stack window The Call Stack window displays the call stack for the currently executing thread: The call stack is displayed from top to bottom with the currently executing frame at the top of the list. Double-clicking on any entry in the call stack opens up the corresponding source code in the Java editor within NetBeans. Right-clicking on an entry in the call stack displays a pop-up menu with the choice to: Make Current: This makes the selected thread the current thread Pop To Here: This pops the execution of the call stack to the selected location Go To Source: This displays the selected code within the Java source editor Copy Stack: This copies the stack trace to the clipboard for use elsewhere When debugging, it can be useful to change the stack frame of the currently executing thread by selecting the Pop To Here option from within the stack trace window. Imagine the following code: // Get some magic int magic = getSomeMagicNumber(); // Perform calculation performCalculation(magic); During a debugging session, if after stepping over the getSomeMagicNumber() method, we decided that the method has not worked as expected, our course of action would probably be to debug into the getSomeMagicNumber() method. But, we've just stepped over the method, so what can we do? Well, we can stop the debugging session and start again or repeat the operation that called this section of code and hope there are no changes to the application state that affect the method we want to debug. A better solution, however, would be to select the line of code that calls the getSomeMagicNumber() method and pop the stack frame using the Pop To Here option. This would have the effect of rewinding the code execution so that we can then step into the method and see what is happening inside it. As well as using the Pop To Here functionality, NetBeans also offers several menu options for manipulating the stack frame, namely: Make Callee Current: This makes the callee of the current method the currently executing stack frame Make Caller Current: This makes the caller of the current method the currently executing stack frame Pop Topmost Call: This pops one stack frame, making the calling method the currently executing stack frame When moving around the call stack using these techniques, any operations performed by the currently executing method are not undone. So, for example, strange results may be seen if global or class-based variables are altered within a method and then an entry is popped from the call stack. Popping entries in the call stack is safest when no state changes are made within a method. The call stack displayed in the Debugging window for each thread behaves in the same way as in the Call Stack window itself. The Loaded Classes window The Loaded Classes window displays a list of all the classes that are currently loaded, showing how many instances there are of each class as a number and as a percentage of the total number of classes loaded. Depending upon the number of external libraries (including the standard Java runtime libraries) being used, you may find it difficult to locate instances of your own classes in this window. Fortunately, the filter at the bottom of the window allows the list of classes to be filtered, based upon an entered string. So, for example, entering the filter String will show all the classes with String in the fully qualified class name that are currently loaded, including java.lang.String and java.lang.StringBuffer. Since the filter works on the fully qualified name of a class, entering a package name will show all the classes listed in that package and subpackages. So, for example, entering a filter value as com.davidsalter.multithread would show only the classes listed in that package and subpackages. The Sessions window Within NetBeans, it is possible to perform multiple debugging sessions where either one project is being debugged multiple times, or more commonly, multiple projects are being debugged at the same time, where one is acting as a client application and the other is acting as a server application. The Sessions window displays a list of the currently running debug sessions, allowing the developer control over which one is the current session. Right-clicking on any of the sessions listed in the window provides the following options: Make Current: This makes the selected session the currently active debugging session Scope: This debugs the current thread or all the threads in the selected session Language: This options shows the language of the application being debugged—Java Finish: This finishes the selected debugging session Finish All: This finishes all the debugging sessions The Sessions window shows the name of the debug session (for example the main class being executed), its state (whether the application is Stopped or Running) and language being debugged. Clicking the selection icon () at the top-right of the window allows the user to choose which columns are displayed in the window. The default choice is to display all columns except for the Host Name column, which displays the name of the computer the session is running on. The Threads window The Threads window displays a hierarchical list of threads in use by the application currently being debugged. The current thread is displayed in bold. Double-clicking on any of the threads in the hierarchy makes the thread current. Similar to the Debugging window, threads can be made current, suspended, or interrupted by right-clicking on the thread and selecting the appropriate option. The default display for the Threads window is to show the thread's name and its state (Running, Waiting, or Sleeping). Clicking the selection icon () at the top-right of the window allows the user to choose which columns are displayed in the window. The Sources window The Sources window simply lists all of the source roots that NetBeans considers for the selected project. These are the only locations that NetBeans will search when looking for source code while debugging an application. If you find that you are debugging an application, and you cannot step into code, the most likely scenario is that the source root for the code you wish to debug is not included in the Sources window. To add a new source root, right-click in the Sources window and select the Add Source Root option. The Debugging window The Debugging window allows us to see which threads are running while debugging our application. This window is, therefore, particularly useful when debugging multithreaded applications. In this window, we can see the different threads that are running within our application. For each thread, we can see the name of the thread and the call stack leading to the breakpoint. The current thread is highlighted with a green band along the left-hand side edge of the window. Other threads created within our application are denoted with a yellow band along the left-hand side edge of the window. System threads are denoted with a gray band. We can make any of the threads the current thread by right-clicking on it and selecting the Make Current menu option. When we do this, the Variables and Call Stack windows are updated to show new information for the selected thread. The current thread can also be selected by clicking on the Debug and then Set Current Thread… menu options. Upon selecting this, a list of running threads is shown from which the current thread can be selected. Right-clicking on a thread and selecting the Resume option will cause the selected thread to continue execution until it hits another breakpoint. For each thread that is running, we can also Suspend, Interrupt, and Resume the thread by right-clicking on the thread and choosing the appropriate action. In each thread listing, the current methods call stack is displayed for each thread. This can be manipulated in the same way as from the Call Stack window. When debugging multithreaded applications, new breakpoints can be hit within different threads at any time. NetBeans helps us with multithreaded debugging by not automatically switching the user interface to a different thread when a breakpoint is hit on the non-current thread. When a breakpoint is hit on any thread other than the current thread, an indication is displayed at the bottom of the Debugging window, stating New Breakpoint Hit (an example of this can be seen in the previous window). Clicking on the icon to the right of the message shows all the breakpoints that have been hit together with the thread name in which they occur. Selecting the alternate thread will cause the relevant breakpoint to be opened within NetBeans and highlighted in the appropriate Java source code file. NetBeans provides several filters on the Debugging window so that we can show more/less information as appropriate. From left to right, these images allow us to: Show less (suspended and current threads only) Show thread groups Show suspend/resume table Show system threads Show monitors Show qualified names Sort by suspended/resumed state Sort by name Sort by default Debugging multithreaded applications can be a lot easier if you give your threads names. The thread's name is displayed in the Debugging window, and it's a lot easier to understand what a thread with a proper name is doing as opposed to a thread called Thread-1. Deadlock detection When debugging multithreaded applications, one of the problems that we can see is that a deadlock occurs between executing threads. A deadlock occurs when two or more threads become blocked forever because they are both waiting for a shared resource to become available. In Java, this typically occurs when the synchronized keyword is used. NetBeans allows us to easily check for deadlocks using the Check for Deadlock tool available on the Debug menu. When a deadlock is detected using this tool, the state of the deadlocked threads is set to On Monitor in the Threads window. Additionally, the threads are marked as deadlocked in the Debugging window. Each deadlocked thread is displayed with a red band on the left-hand side border and the Deadlock detected warning message is displayed. Analyze Stack Window When running an application within NetBeans, if an exception is thrown and not caught, the stack trace will be displayed in the Output window, allowing the developer to see exactly where errors have occurred. From the following screenshot, we can easily see that a NullPointerException was thrown from within the FaultyImplementation class in the doUntestedOperation() method at line 16. Looking before this in the stack trace (that is the entry underneath), we can see that the doUntestedOperation() method was called from within the main() method of the Main class at line 21: In the preceding example, the FaultyImplementation class is defined as follows: public class FaultyImplementation { public void doUntestedOperation() { throw new NullPointerException(); } } Java is providing an invaluable feature to developers, allowing us to easily see where exceptions are thrown and what the sequence of events was that led to the exception being thrown. NetBeans, however, enhances the display of the stack traces by making the class and line numbers clickable hyperlinks which, when clicked on, will navigate to the appropriate line in the code. This allows us to easily delve into a stack trace and view the code at all the levels of the stack trace. In the previous screenshot, we can click on the hyperlinks FaultyImplementation.java:16 and Main.java:21 to take us to the appropriate line in the appropriate Java file. This is an excellent time-saving feature when developing applications, but what do we do when someone e-mails us a stack trace to look at an error in a production system? How do we manage stack traces that are stored in log files? Fortunately, NetBeans provides an easy way to allow a stack trace to be turned into clickable hyperlinks so that we can browse through the stack trace without running the application. To load and manage stack traces into NetBeans, the first step is to copy the stack trace onto the system clipboard. Once the stack trace has been copied onto the clipboard, Analyze Stack Window can be opened within NetBeans by selecting the Window and then Debugging and then Analyze Stack menu options (the default installation for NetBeans has no keyboard shortcut assigned to this operation). Analyze Stack Window will default to showing the stack trace that is currently in the system clipboard. If no stack trace is in the clipboard, or any other data is in the system's clipboard, Analyze Stack Window will be displayed with no contents. To populate the window, copy a stack trace into the system's clipboard and select the Insert StackTrace From Clipboard button. Once a stack trace has been displayed in Analyze Stack Window, clicking on the hyperlinks in it will navigate to the appropriate location in the Java source files just as it does from the Output window when an exception is thrown from a running application. You can only navigate to source code from a stack trace if the project containing the relevant source code is open in the selected project group. Variable formatters When debugging an application, the NetBeans debugger can display the values of simple primitives in the Variables window. As we saw previously, we can also display the toString() representation of a variable if we select the appropriate columns to display in the window. Sometimes when debugging, however, the toString() representation is not the best way to display formatted information in the Variables window. In this window, we are showing the value of a complex number class that we have used in high school math. The ComplexNumber class being debugged in this example is defined as: public class ComplexNumber { private double realPart; private double imaginaryPart; public ComplexNumber(double realPart, double imaginaryPart) { this.realPart = realPart; this.imaginaryPart = imaginaryPart; } @Override public String toString() { return "ComplexNumber{" + "realPart=" + realPart + ", imaginaryPart=" + imaginaryPart + '}'; } // Getters and Setters omitted for brevity… } Looking at this class, we can see that it essentially holds two members—realPart and imaginaryPart. The toString() method outputs a string, detailing the name of the object and its parameters which would be very useful when writing ComplexNumbers to log files, for example: ComplexNumber{realPart=1.0, imaginaryPart=2.0} When debugging, however, this is a fairly complicated string to look at and comprehend—particularly, when there is a lot of debugging information being displayed. NetBeans, however, allows us to define custom formatters for classes that detail how an object will be displayed in the Variables window when being debugged. To define a custom formatter, select the Java option from the NetBeans Options dialog and then select the Java Debugger tab. From this tab, select the Variable Formatters category. On this screen, all the variable formatters that are defined within NetBeans are shown. To create a new variable formatter, select the Add… button to display the Add Variable Formatter dialog. In the Add Variable Formatter dialog, we need to enter Formatter Name and a list of Class types that NetBeans will apply the formatting to when displaying values in the debugger. To apply the formatter to multiple classes, enter the different classes, separated by commas. The value that is to be formatted is entered in the Value formatted as a result of code snippet field. This field takes the scope of the object being debugged. So, for example, to output the ComplexNumber class, we can enter the custom formatter as: "("+realPart+", "+imaginaryPart+"i)" We can see that the formatter is built up from concatenating static strings and the values of the members realPart and imaginaryPart. We can see the results of debugging variables using custom formatters in the following screenshot: Debugging remote applications The NetBeans debugger provides rapid access for debugging local applications that are executing within the same JVM as NetBeans. What happens though when we want to debug a remote application? A remote application isn't necessarily hosted on a separate server to your development machine, but is defined as any application running outside of the local JVM (that is the one that is running NetBeans). To debug a remote application, the NetBeans debugger can be "attached" to the remote application. Then, to all intents, the application can be debugged in exactly the same way as a local application is debugged, as described in the previous sections of this article. To attach to a remote application, select the Debug and then Attach Debugger… menu options. On the Attach dialog, the connector (SocketAttach, ProcessAttach, or SocketListen) must be specified to connect to the remote application. The appropriate connection details must then be entered to attach the debugger. For example, the process ID must be entered for the ProcessAttach connector and the host and port must be specified for the SocketAttach connector. Profiling applications Learning how to debug applications is an essential technique in software development. Another essential technique that is often overlooked is profiling applications. Profiling applications involves measuring various metrics such as the amount of heap memory used or the number of loaded classes or running threads. By profiling applications, we can gain an understanding of what our applications are actually doing and as such we can optimize them and make them function better. NetBeans provides first class profiling tools that are easy to use and provide results that are easy to interpret. The NetBeans profiler allows us to profile three specific areas: Application monitoring Performance monitoring Memory monitoring Each of these monitoring tools is accessible from the Profile menu within NetBeans. To commence profiling, select the Profile and then Profile Project menu options. After instructing NetBeans to profile a project, the profiler starts providing the choice of the type of profiling to perform. Testing applications Writing tests for applications is probably one of the most important aspects of modern software development. NetBeans provides the facility to write and run both JUnit and TestNG tests and test suites. In this section, we'll provide details on how NetBeans allows us to write and run these types of tests, but we'll assume that you have some knowledge of either JUnit or TestNG. TestNG support is provided by default with NetBeans, however, due to license concerns, JUnit may not have been installed when you installed NetBeans. If JUnit support is not installed, it can easily be added through the NetBeans Plugins system. In a project, NetBeans creates two separate source roots: one for application sources and the other for test sources. This allows us to keep tests separate from application source code so that when we ship applications, we do not need to ship tests with them. This separation of application source code and test source code enables us to write better tests and have less coupling between tests and applications. The best situation is for the test source root to have a dependency on application classes and the application classes to have no dependency on the tests that we have written. To write a test, we must first have a project. Any type of Java project can have tests added into it. To add tests into a project, we can use the New File wizard. In the Unit Tests category, there are templates for: JUnit Tests Tests for Existing Class (this is for JUnit tests) Test Suite (this is for JUnit tests) TestNG Test Case TestNG Test Suite When creating classes for these types of tests, NetBeans provides the option to automatically generate code; this is usually a good starting point for writing classes. When executing tests, NetBeans iterates through the test packages in a project looking for the classes that are suffixed with the word Test. It is therefore essential to properly name tests to ensure they are executed correctly. Once tests have been created, NetBeans provides several methods for running the tests. The first method is to run all the tests that we have defined for an application. Selecting the Run and then Test Project menu options runs all of the tests defined for a project. The type of the project doesn't matter (Java SE or Java EE), nor whether a project uses Maven or the NetBeans project build system (Ant projects are even supported if they have a valid test activity), all tests for the project will be run when selecting this option. After running the tests, the Test Results window will be displayed, highlighting successful tests in green and failed tests in red. In the Test Results window, we have several options to help categorize and manage the tests: Rerun all of the tests Rerun the failed tests Show only the passed tests Show only the failed tests Show errors Show aborted tests Show skipped tests Locate previous failure Locate next failure Always open test result window Always open test results in a new tab The second option within NetBeans for running tests it to run all the tests in a package or class. To perform these operations, simply right-click on a package in the Projects window and select Test Package or right-click on a Java class in the Projects window and select Test File. The final option for running tests it to execute a single test in a class. To perform this operation, right-click on a test in the Java source code editor and select the Run Focussed Test Method menu option. After creating tests, how do we keep them up to date when we add new methods to application code? We can keep tests suites up to date by manually editing them and adding new methods corresponding to new application code or we can use the Create/Update Tests menu. Selecting the Tools and then Create/Update Tests menu options displays the Create Tests dialog that allows us to edit the existing test classes and add new methods into them, based upon the existing application classes. Summary In this article, we looked at the typical tasks that a developer does on a day-to-day basis when writing applications. We saw how NetBeans can help us to run and debug applications and how to profile applications and write tests for them. Finally, we took a brief look at TDD, and saw how the Red-Green-Refactor cycle can be used to help us develop more stable applications. Resources for Article: Further resources on this subject: Contexts and Dependency Injection in NetBeans [article] Creating a JSF composite component [article] Getting to know NetBeans [article]
Read more
  • 0
  • 0
  • 12174

article-image-application-development-workflow
Packt
08 Sep 2015
15 min read
Save for later

Application Development Workflow

Packt
08 Sep 2015
15 min read
 In this article by Ivan Turkovic, author of the book PhoneGap Essentials, you will learn some of the basics on how to work with the PhoneGap application development and how to start building the application. We will go over some useful steps and tips to get the most out of your PhoneGap application. In this article, you will learn the following topics: An introduction to a development workflow Best practices Testing (For more resources related to this topic, see here.) An introduction to a development workflow PhoneGap solves a great problem of developing mobile applications for multiple platforms at the same time, but still it is pretty much open about how you want to approach the creation of an application. You do not have any predefined frameworks that come out of-the-box by default. It just allows you to use the standard web technologies such as the HTML5, CSS3, and JavaScript languages for hybrid mobile application development. The applications are executed in wrappers that are custom-built to work on every platform and the underlying web view behaves in the same way on all the platforms. For accessing device APIs, it relies on the standard API bindings to access every device's sensors or the other features. The developers who start using PhoneGap usually come from different backgrounds, as shown in the following list: Mobile developers who want to expand the functionality of their application on other platforms but do not want to learn a new language for each platform Web developers who want to port their existing desktop web application to a mobile application; if they are using a responsive design, it is quite simple to do this Experienced mobile developers who want to use both the native and web components in their application, so that the web components can communicate with the internal native application code as well The PhoneGap project itself is pretty simple. By default, it can open an index.html page and load the initial CSS file, JavaScript, and other resources needed to run it. Besides the user's resources, it needs to refer the cordova.js file, which provides the API bindings for all the plugins. From here onwards, you can take different steps but usually the process falls in two main workflows: web development workflow and native platform development. Web project development A web project development workflow can be used when you want to create a PhoneGap application that runs on many mobile operating systems with as little as possible changes to a specific one. So there is a single codebase that is working along with all the different devices. It has become possible with the latest versions since the introduction of the command-line interface (CLI). This automates the tedious work involved in a lot of the functionalities while taking care of each platform, such as building the app, copying the web assets in the correct location for every supported platform, adding platform-specific changes, and finally running build scripts to generate binaries. This process can be automated even more with build system automating tasks such as Gulp or Grunt. You can run these tasks before running PhoneGap commands. This way you can optimize the assets before they are used. Also you can run JSLint automatically for any change or doing automatic builds for every platform that is available. Native platform development A native platform development workflow can be imagined as a focus on building an application for a single platform and the need to change the lower-level platform details. The benefit of using this approach is that it gives you more flexibility and you can mix the native code with a WebView code and impose communication between them. This is appropriate for those functionalities that contain a section of the features that are not hard to reproduce with web views only; for example, a video app where you can do the video editing in the native code and all the social features and interaction can be done with web views. Even if you want to start with this approach, it is better to start the new project as a web project development workflow and then continue to separate the code for your specific needs. One thing to keep in mind is that, to develop with this approach, it is better to develop the application in more advanced IDE environments, which you would usually use for building native applications. Best practices                            The running of hybrid mobile applications requires some sacrifices in terms of performance and functionality; so it is good to go over some useful tips for new PhoneGap developers. Use local assets for the UI As mobile devices are limited by the connection speeds and mobile data plans are not generous with the bandwidth, you need to prepare all the UI components in the application before deploying to the app store. Nobody will want to use an application that takes a few seconds to load the server-rendered UI when the same thing could be done on the client. For example, the Google Fonts or other non-UI assets that are usually loaded from the server for the web applications are good enough as for the development process, but for the production; you need to store all the assets in the application's container and not download them during its run process. You do not want the application to wait while an important part is being loaded. The best advice on the UI that I can give you is to adopt the Single Page Application (SPA) design; it is a client-side application that is run from one request from a web page. Initial loading means taking care of loading all the assets that are required for the application in order to function, and any further updates are done via AJAX (such as loading data). When you use SPA, not only do you minimize the amount of interaction with the server, you also organize your application in a more efficient manner. One of the benefits is that the application doesn't need to wait for every deviceready event for each additional page that it loads from the start. Network access for data As you have seen in the previous section, there are many limitations that mobile applications face with the network connection—from mobile data plans to the network latency. So you do not want it to rely on the crucial elements, unless real-time communication is required for the application. Try to keep the network access only to access crucial data and everything else that is used frequently can be packed into assets. If the received data does not change often, it is advisable to cache it for offline use. There are many ways to achieve this, such as localStorage, sessionStorage, WebSQL, or a file. When loading data, try to load only the data you need at that moment. If you have a comment section, it will make sense if you load all thousand comments; the first twenty comments should be enough to start with. Non-blocking UI When you are loading additional data to show in the application, don't try to pause the application until you receive all the data that you need. You can add some animation or a spinner to show the progress. Do not let the user stare at the same screen when he presses the button. Try to disable the actions once they are in motion in order to prevent sending the same action multiple times. CSS animations As most of the modern mobile platforms now support CSS3 with a more or less consistent feature set, it is better to make the animations and transitions with CSS rather than with the plain JavaScript DOM manipulation, which was done before CSS3. CSS3 is much faster as the browser engine supports the hardware acceleration of CSS animations and is more fluid than the JavaScript animations. CSS3 supports translations and full keyframe animations as well, so you can be really creative in making your application more interactive. Click events You should avoid click events at any cost and use only touch events. They work in the same way as they do in the desktop browser. They take a longer time to process as the mobile browser engine needs to process the touch or touchhold events before firing a click event. This usually takes 300 ms, which is more than enough to give an additional impression of slow responses. So try to start using touchstart or touchend events. There is a solution for this called FastClick.js. It is a simple, easy-to-use library for eliminating the 300 ms delay between a physical tap and the firing of a click event on mobile browsers. Performance The performance that we get on the desktops isn't reflected in mobile devices. Most of the developers assume that the performance doesn't change a lot, especially as most of them test the applications on the latest mobile devices and a vast majority of the users use mobile devices that are 2-3 years old. You have to keep in mind that even the latest mobile devices have a slower CPU, less RAM, and a weaker GPU. Recently, mobile devices are catching up in the sheer numbers of these components but, in reality, they are slower and the maximum performance is limited due to the battery life that prevents it from using the maximum performance for a prolonged time. Optimize the image assets We are not limited any more by the app size that we need to deploy. However, you need to optimize the assets, especially images, as they take a large part of the assets, and make them appropriate for the device. You should prepare images in the right size; do not add the biggest size of the image that you have and force the mobile device to scale the image in HTML. Choosing the right image size is not an easy task if you are developing an application that should support a wide array of screens, especially for Android that has a very fragmented market with different screen sizes. The scaled images might have additional artifacts on the screen and they might not look so crisp. You will be hogging additional memory just for an image that could leave a smaller memory footprint. You should remember that mobile devices still have limited resources and the battery doesn't last forever. If you are going to use PhoneGap Build, you will need to make sure you do not exceed the limit as the service still has a limited size. Offline status As we all know, the network access is slow and limited, but the network coverage is not perfect so it is quite possible that your application will be working in the offline mode even in the usual locations. Bad reception can be caused by being inside a building with thick walls or in the basement. Some weather conditions can affect the reception too. The application should be able to handle this situation and respond to it properly, such as by limiting the parts of the application that require a network connection or caching data and syncing it when you are online once again. This is one of the aspects that developers usually forget to test in the offline mode to see how the app behaves under certain conditions. You should have a plugin available in order to detect the current state and the events when it passes between these two modes. Load only what you need There are a lot of developers that do this, including myself. We need some part of the library or a widget from a framework, which we don't need for anything other than this, and yet we are a bit lazy about loading a specific element and the full framework. This can load an immense amount of resources that we will never need but they will still run in the background. It might also be the root cause of some of the problems as some libraries do not mix well and we can spend hours trying to solve this problem. Transparency You should try to use as little as possible of the elements that have transparent parts as they are quite processor-intensive because you need to update screen on every change behind them. The same things apply to the other visual elements that are processor-intensive such as shadows or gradients. The great thing is that all the major platforms have moved away from flashy graphical elements and started using the flat UI design. JSHint If you use JSHint throughout the development, it will save you a lot of time when developing things in JavaScript. It is a static code analysis tool for checking whether the JavaScript source code complies with the coding rules. It will detect all the common mistakes done with JavaScript, as JavaScript is not a compiled language and you can't see the error until you run the code. At the same time, JSHint can be a very restrictive and demanding tool. Many beginners in JavaScript, PhoneGap, or mobile programming could be overwhelmed with the number of errors or bad practices that JSHint will point out. Testing The testing of applications is an important aspect of build applications, and mobile applications are no exception. With a slight difference for most of the development that doesn't require native device APIs, you can use the platform simulators and see the results. However, if you are using the native device APIs that are not supported through simulators, then you need to have a real device in order to run a test on it. It is not unusual to use desktop browsers resized to mobile device screen resolution to emulate their screen while you are developing the application just to test the UI screens, since it is much faster and easier than building and running the application on a simulator or real device for every small change. There is a great plugin for the Google Chrome browser called Apache Ripple. It can be run without any additional tools. The Apache Ripple simulator runs as a web app in the Google Chrome browser. In Cordova, it can be used to simulate your app on a number of iOS and Android devices and it provides basic support for the core Cordova plugins such as Geolocation and Device Orientation. You can run the application in a real device browser or use the PhoneGap developer app. This simplifies the workflow as you can test the application on your mobile device without the need to re-sign, recompile, or reinstall your application to test the code. The only disadvantage is that with simulators, you cannot access the device APIs that aren't available in the regular web browsers. The PhoneGap developer app allows you to access device APIs as long as you are using one of the supplied APIs. It is good if you remember to always test the application on real devices at least before deploying to the app store. Computers have almost unlimited resources as compared to mobile devices, so the application that runs flawlessly on the computer might fail on mobile devices due to low memory. As simulators are faster than the real device, you might get the impression that it will work on every device equally fast, but it won't—especially with older devices. So, if you have an older device, it is better to test the response on it. Another reason to use the mobile device instead of the simulator is that it is hard to get a good usability experience from clicking on the interface on the computer screen without your fingers interfering and blocking the view on the device. Even though it is rare that you would get some bugs with the plain PhoneGap that was introduced with the new version, it might still happen. If you use the UI framework, it is good if you try it on the different versions of the operating systems as they might not work flawlessly on each of them. Even though hybrid mobile application development has been available for some time, it is still evolving, and as yet there are no default UI frameworks to use. Even the PhoneGap itself is still evolving. As with the UI, the same thing applies to the different plugins. Some of the features might get deprecated or might not be supported, so it is good if you implement alternatives or give feedback to the users about why this will not work. From experience, the average PhoneGap application will use at least ten plugins or different libraries for the final deployment. Every additional plugin or library installed can cause conflicts with another one. Summary In this article, we learned more advanced topics that any PhoneGap developer should get into more detail once he/she has mastered the essential topics. Resources for Article: Further resources on this subject: Building the Middle-Tier[article] Working with the sharing plugin[article] Getting Ready to Launch Your PhoneGap App in the Real World [article]
Read more
  • 0
  • 0
  • 8258
article-image-commands-where-wild-things-are
Packt
08 Sep 2015
30 min read
Save for later

Commands (Where the Wild Things Are)

Packt
08 Sep 2015
30 min read
 In this article by Maxwell Dayvson Da Silva and Hugo Lopes Tavares, the authors of Redis Essentials, we will get an overview of many different Redis commands and features, from techniques to reduce network latency to extending Redis with Lua scripting. At the end of this article, we will explain optimizations further. (For more resources related to this topic, see here.) Pub/Sub Pub/Sub stands for Publish-Subscribe, which is a pattern where messages are not sent directly to specific receivers. Publishers send messages to channels, and subscribers receive these messages if they are listening to a given channel. Redis supports the Pub/Sub pattern and provides commands to publish messages and subscribe to channels. Here are some examples of Pub/Sub applications: News and weather dashboards Chat applications Push notifications, such as subway delay alerts Remote code execution, similar to what the SaltStack tool supports The following examples implement a remote command execution system, where a command is sent to a channel and the server that is subscribed to that channel executes the command. The command PUBLISH sends a message to the Redis channel, and it returns the number of clients that received that message. A message gets lost if there are no clients subscribed to the channel when it comes in. Create a file called publisher.js and save the following code into it: var redis = require("redis"); var client = redis.createClient(); var channel = process.argv[2]; // 1 var command = process.argv[3]; // 2 client.publish(channel, command); // 3 client.quit(); Assign the third argument from the command line to the variable channel (the first argument is node and the second is publisher.js). Assign the fourth argument from the command line to the variable command. Execute the command PUBLISH, passing the variables channel and command. The command SUBSCRIBE subscribes a client to one or many channels. The command UNSUBSCRIBE unsubscribes a client from one or many channels. The commands PSUBSCRIBE and PUNSUBSCRIBE work the same way as the SUBSCRIBE and UNSUBSCRIBE commands, but they accept glob-style patterns as channel names. Once a Redis client executes the command SUBSCRIBE or PSUBSCRIBE, it enters the subscribe mode and stops accepting commands, except for the commands SUBSCRIBE, PSUBSCRIBE, UNSUBSCRIBE, and PUNSUBSCRIBE. Create a file called subscriber.js and save the following: var os = require("os"); // 1 var redis = require("redis"); var client = redis.createClient(); var COMMANDS = {}; // 2 COMMANDS.DATE = function() { // 3 var now = new Date(); console.log("DATE " + now.toISOString()); }; COMMANDS.PING = function() { // 4 console.log("PONG"); }; COMMANDS.HOSTNAME = function() { // 5 console.log("HOSTNAME " + os.hostname()); }; client.on("message", function(channel, commandName) { // 6 if (COMMANDS.hasOwnProperty(commandName)) { // 7 var commandFunction = COMMANDS[commandName]; // 8 commandFunction(); // 9 } else { // 10 console.log("Unknown command: " + commandName); } }); client.subscribe("global", process.argv[2]); // 11 Require the Node.js module os. Create the variable COMMANDS, which is a JavaScript object. All command functions in this module will be added to this object. This object is intended to act as a namespace. Create the function DATE, which displays the current date. Then create the function PING, which displays PONG. Create the function HOSTNAME, which displays the server hostname. Register a channel listener, which is a function that executes commands based on the channel message. Check whether the variable commandName is a valid command. Create the variable commandFunction and assign the function to it. Execute commandFunction. Display an error message if the variable commandName contains a command that is not available. Execute the command SUBSCRIBE, passing "global", which is the channel that all clients subscribe to, and a channel name from the command line. Open three terminal windows and run the previous files, as shown the following screenshot (from left to right and top to bottom): terminal-1: A subscriber that listens to the global channel and channel-1 terminal-2: A subscriber that listens to the global channel and channel-2 terminal-3: A publisher that publishes the message PING to the global channel (both subscribers receive the message), the message DATE to channel-1 (the first subscriber receives it), and the message HOSTNAME to channel-2 (the second subscriber receives it) The command PUBSUB introspects the state of the Redis Pub/Sub system. This command accepts three subcommands: CHANNELS, NUMSUB, and NUMPAT. The CHANNELS subcommand returns all active channels (channels with at least one subscriber). This command accepts an optional parameter, which is a glob-style pattern. If the pattern is specified, all channel names that match the pattern are returned; if no pattern is specified, all channel names are returned. The command syntax is as follows: PUBSUB CHANNELS [pattern] The NUMSUB subcommand returns the number of clients connected to channels via the SUBSCRIBE command. This command accepts many channel names as arguments. Its syntax is as follows: PUBSUB NUMSUB [channel-1 … channel-N] The NUMPAT subcommand returns the number of clients connected to channels via the PSUBSCRIBE command. This command does not accept channel patterns as arguments. Its syntax is as follows: PUBSUB NUMPAT Redis contributor Pieter Noordhuis created a web chat implementation in Ruby using Redis and Pub/Sub. It can be found at https://gist.github.com/pietern/348262. Transactions A transaction in Redis is a sequence of commands executed in order and atomically. The command MULTI marks the beginning of a transaction, and the command EXEC marks its end. Any commands between the MULTI and EXEC commands are serialized and executed as an atomic operation. Redis does not serve any other client in the middle of a transaction. All commands in a transaction are queued in the client and are only sent to the server when the EXEC command is executed. It is possible to prevent a transaction from being executed by using the DISCARD command instead of EXEC. Usually, Redis clients prevent a transaction from being sent to Redis if it contains command syntax errors. Unlike in traditional SQL databases, transactions in Redis are not rolled back if they produce failures. Redis executes the commands in order, and if any of them fail, it proceeds to the next command. Another downside of Redis transactions is that it is not possible to make any decisions inside the transaction, since all the commands are queued. For example, the following code simulates a bank transfer. Here, money is transferred from a source account to a destination account inside a Redis transaction. If the source account has enough funds, the transaction is executed. Otherwise, it is discarded. Save the following code in a file called bank-transaction.js: var redis = require("redis"); var client = redis.createClient(); function transfer(from, to, value, callback) { // 1 client.get(from, function(err, balance) { // 2 var multi = client.multi(); // 3 multi.decrby(from, value); // 4 multi.incrby(to, value); // 5 if (balance >= value) { // 6 multi.exec(function(err, reply) { // 7 callback(null, reply[0]); // 8 }); } else { multi.discard(); // 9 callback(new Error("Insufficient funds"), null); // 10 } }); } Create the function transfer, which receives an account ID from which to withdraw money, another account ID from which to receive money, the monetary value to transfer, and a callback function to call after the transfer. Retrieve the current balance of the source account. Create a Multi object, which represents the transaction. All commands sent to it are queued and executed after the EXEC command is issued. Enqueue the command DECRBY into the Multi object. Then enqueue the command INCRBY into the Multi object. Check whether the source account has sufficient funds. Execute the EXEC command, which triggers sequential execution of the queued transaction commands. Execute the callback function and pass the value null as an error, and the balance of the source account after the command DECRBY is executed. Execute the DISCARD command to discard the transaction. No commands from the transaction will be executed in Redis. Execute the function callback and pass an error object if the source account has insufficient funds. The following code uses the previous example, transferring $40 from Max's account to Hugo's account (both accounts had $100 before the transfer). Append the following to the file bank-transaction.js: client.mset("max:checkings", 100, "hugo:checkings", 100, function(err, reply) { // 1 console.log("Max checkings: 100"); console.log("Hugo checkings: 100"); transfer("max:checkings", "hugo:checkings", 40, function(err, balance) { // 2 if (err) { console.log(err); } else { console.log("Transferred 40 from Max to Hugo") console.log("Max balance:", balance); } client.quit(); }); }); Set the initial balance of each account to $100. Execute the function transfer to transfer $40 from max:checkings to hugo:checkings. Then execute the file using the following command: $ node bank-transaction.js Max checkings: 100 Hugo checkings: 100 Transferred 40 from Max to Hugo Max balance: 60 It is possible to make the execution of a transactionconditional using the WATCH command, which implements an optimistic lock on a group of keys. The WATCH command marks keys as being watched so that EXEC executes the transaction only if the keys being watched were not changed. Otherwise, it returns a null reply and the operation needs to be repeated; this is the reason it is called an optimistic lock. The command UNWATCH removes keys from the watch list. The following code implements a zpop function, which removes the first element of a Sorted Set and passes it to a callback function, using a transaction with WATCH. A race condition could exist if the WATCH command is not used. Create a file called watch-transaction.js with the following code: var redis = require("redis"); var client = redis.createClient(); function zpop(key, callback) { // 1 client.watch(key, function(watchErr, watchReply) { // 2 client.zrange(key, 0, 0, function(zrangeErr, zrangeReply) { // 3 var multi = client.multi(); // 4 multi.zrem(key, zrangeReply); // 5 multi.exec(function(transactionErr, transactionReply) { // 6 if (transactionReply) { callback(zrangeReply[0]); // 7 } else { zpop(key, callback); // 8 } }); }); }); } client.zadd("presidents", 1732, "George Washington"); client.zadd("presidents", 1809, "Abraham Lincoln"); client.zadd("presidents", 1858, "Theodore Roosevelt"); zpop("presidents", function(member) { console.log("The first president in the group is:", member); client.quit(); }); Create the function zpop, which receives a key and a callback function as arguments. Execute the WATCH command on the key passed as an argument. Then execute the ZRANGE command to retrieve the first element of the Sorted Set. Create a multi object. Enqueue the ZREM command in the transaction. Execute the transaction. Execute the callback function if the key being watched has not been changed. Execute the function zpop with the same parameters as before if the key being watched has not been changed. Then execute the file using the following command: $ node watch-transaction.js The first president in the group is: George Washington Pipelines In Redis, a pipeline is a way to send multiple commands together to the Redis server without waiting for individual replies. The replies are read all at once by the client. The time taken for a Redis client to send a command and obtain a reply from the Redis server is called Round Trip Time (RTT). When multiple commands are sent, there are multiple RTTs. Pipelines can decrease the number of RTTs because commands are grouped, so a pipeline with 10 commands will have only one RTT. This can improve the network's performance significantly. For instance, if the network link between a client and server has an RTT of 100 ms, the maximum number of commands that can be sent per second is 10, no matter how many commands can be handled by the Redis server. Usually, a Redis server can handle hundreds of thousands of commands per second, and not using pipelines may be a waste of resources. When Redis is used without pipelines, each command needs to wait for a reply. Assume the following: var redis = require("redis"); var client = redis.createClient(); client.set("key1", "value1"); client.set("key2", "value2"); client.set("key3", "value3"); Three separate commands are sent to Redis, and each command waits for its reply. The following diagram shows what happens when Redis is used without pipelines: Redis commands sent in a pipeline must be independent. They run sequentially in the server (the order is preserved), but they do not run as a transaction. Even though pipelines are neither transactional nor atomic (this means that different Redis commands may occur between the ones in the pipeline), they are still useful because they can save a lot of network time, preventing the network from becoming a bottleneck as it often does with heavy load applications. By default, node_redis, the Node.js library we are using, sends commands in pipelines and automatically chooses how many commands will go into each pipeline. Therefore, you don't need to worry about this. However, other Redis clients may not use pipelines by default; you will need to check out the client documentation to see how to take advantage of pipelines. The PHP, Python, and Ruby clients do not use pipelines by default. This is what happens when commands are sent to Redis in a pipeline: When sending many commands, it might be a good idea to use multiple pipelines rather than one big pipeline. Pipelines are not a new idea or an exclusive feature or command in Redis; they are just a technique of sending a group of commands to a server at once. Commands inside a transaction may not be sent as a pipeline by default. This will depend on the Redis client you are using. For example, node_redis sends everything automatically in pipelines (as we mentioned before), but different clients may require additional configuration. It is a good idea to send transactions in a pipeline to avoid an extra round trip. Scripting Redis 2.6 introduced the scripting feature, and the language that was chosen to extend Redis was Lua. Before Redis 2.6, there was only one way to extend Redis—changing its source code, which was written in C. Lua was chosen because it is very small and simple, and its C API is very easy to integrate with other libraries. Although it is lightweight, Lua is a very powerful language (it is commonly used in game development). Lua scripts are atomically executed, which means that the Redis server is blocked during script execution. Because of this, Redis has a default timeout of 5 seconds to run any script, although this value can be changed through the configuration lua-time-limit. Redis will not automatically terminate a Lua script when it times out. Instead, it will start to reply with a BUSY message to every command, stating that a script is running. The only way to make the server return to normalcy is by aborting the script execution with the command SCRIPT KILL or SHUTDOWN NOSAVE. Ideally, scripts should be simple, have a single responsibility, and run fast. The popular games Civilization V, Angry Birds, and World of Warcraft use Lua as their scripting language. Lua syntax basics Lua is built around basic types such as booleans, numbers, strings, tables (the only composite data type), and functions. Let's see some basics of Lua's syntax: Comments: -- this is a comment Global variable declaration: x = 123 Local variable declaration: local y = 456 Function definition: function hello_world() return "Hello World" end Iteration: for i = 1, 10 do print(i) end Conditionals: if x == 123 then print("x is the magic number") else print("I have no idea what x is") end String concatenation: print("Hello" .. " World") Using a table as an array — arrays in Lua start indexing at 1, not at 0 (as in most languages): data_types = {1.0, 123, "redis", true, false, hello_world} print(data_types[3]) -- the output is "redis" Using a table as a hash: languages = {lua = 1993, javascript = 1995, python = 1991, ruby = 1995} print("Lua was created in " .. languages["lua"]) print("JavaScript was created in " .. languages.javascript) Redis meets Lua A Redis client must send Lua scripts as strings to the Redis server. Therefore, this section will have JavaScript strings that contain Lua code. Redis can evaluate any valid Lua code, and a few libraries are available (for example, bitop, cjson, math, and string). There are also two functions that execute Redis commands: redis.call and redis.pcall. The function redis.call requires the command name and all its parameters, and it returns the result of the executed command. If there are errors, redis.call aborts the script. The function redis.pcall is similar to redis.call, but in the event of an error, it returns the error as a Lua table and continues the script execution. Every script can return a value through the keyword return, and if there is no explicit return, the value nil is returned. It is possible to pass Redis key names and parameters to a Lua script, and they will be available inside the Lua script through the variables KEYS and ARGV, respectively. Both redis.call and redis.pcall automatically convert the result of a Redis command to a Lua type, which means that if the Redis command returns an integer, it will be converted into a Lua number. The same thing happens to commands that return a string or an array. Since every script will return a value, this value will be converted from a Lua type to a Redis type. There are two commands for running Lua scripts: EVAL and EVALSHA. The next example will use EVAL, and its syntax is the following: EVAL script numkeys key [key ...] arg [arg ...] The parameters are as follows: script: The Lua script itself, as a string numkeys: The number of Redis keys being passed as parameters to the script key: The key name that will be available through the variable KEYS inside the script arg: An additional argument that will be available through the variable ARGV inside the script The following code uses Lua to run the command GET and retrieve a key value. Create a file called intro-lua.js with the following code: var redis = require("redis"); var client = redis.createClient(); client.set("mykey", "myvalue"); // 1 var luaScript = 'return redis.call("GET", KEYS[1])'; // 2 client.eval(luaScript, 1, "mykey", function(err, reply) { // 3 console.log(reply); // 4 client.quit(); }); Execute the command SET to create a key called mykey. Create the variable luaScript and assign the Lua code to it. This Lua code uses the redis.call function to execute the Redis command GET, passing a parameter. The KEYS variable is an array with all key names passed to the script. Execute the command EVAL to execute a Lua script. Display the return of the Lua script execution. Then execute it: $ node intro-lua.js myvalue Avoid using hardcoded key names inside a Lua script; pass all key names as parameters to the commands EVAL/EVALSHA. Previously in this article, in the Transactions section, we presented an implementation of a zpop function using WATCH/MULTI/EXEC. That implementation was based on an optimistic lock, which meant that the entire operation had to be retried if a client changed the Sorted Set before the MULTI/EXEC was executed. The same zpop function can be implemented as a Lua script, and it will be simpler and atomic, which means that retries will not be necessary. Redis will always guarantee that there are no parallel changes to the Sorted Set during script execution. Create a file called zpop-lua.js and save the following code into it: var redis = require("redis"); var client = redis.createClient(); client.zadd("presidents", 1732, "George Washington"); client.zadd("presidents", 1809, "Abraham Lincoln"); client.zadd("presidents", 1858, "Theodore Roosevelt"); var luaScript = [ 'local elements = redis.call("ZRANGE", KEYS[1], 0, 0)', 'redis.call("ZREM", KEYS[1], elements[1])', 'return elements[1]' ].join('n'); // 1 client.eval(luaScript, 1, "presidents", function(err, reply){ // 2 console.log("The first president in the group is:", reply); client.quit(); }); Create the variable luaScript and assign the Lua code to it. This Lua code uses the redis.call function to execute the Redis command ZRANGE to retrieve an array with only the first element in the Sorted Set. Then, it executes the command ZREM to remove the first element of the Sorted Set, before returning the removed element. Execute the command EVAL to execute a Lua script. Then, execute the file using the following command: $ node zpop-lua.js The first president in the group is: George Washington Many Redis users have replaced their transactional code in the form of WATCH/MULTI/EXEC with Lua scripts. It is possible to save network bandwidth usage by using the commands SCRIPT LOAD and EVALSHA instead of EVAL when executing the same script multiple times. The command SCRIPT LOAD caches a Lua script and returns an identifier (which is the SHA1 hash of the script). The command EVALSHA executes a Lua script based on an identifier returned by SCRIPT LOAD. With EVALSHA, only a small identifier is transferred over the network, rather than a Lua code snippet: var redis = require("redis"); var client = redis.createClient(); var luaScript = 'return "Lua script using EVALSHA"'; client.script("load", luaScript, function(err, reply) { var scriptId = reply; client.evalsha(scriptId, 0, function(err, reply) { console.log(reply); client.quit(); }) }); Then execute the script: $ node zpop-lua-evalsha.js Lua script using EVALSHA In order to make scripts play nicely with Redis replication, you should write scripts that do not change Redis keys in non-deterministic ways (that is, do not use random values). Well-written scripts behave the same way when they are re-executed with the same data. Miscellaneous commands This section covers the most important Redis commands that we have not previously explained. These commands are very helpful in a variety of situations, including obtaining a list of clients connected to the server, monitoring the health of a Redis server, expiring keys, and migrating keys to a remote server. All the examples in this section use redis-cli. INFO The INFO command returns all Redis server statistics, including information about the Redis version, operating system, connected clients, memory usage, persistence, replication, and keyspace. By default, the INFO command shows all available sections: memory, persistence, CPU, command, cluster, clients, and replication. You can also restrict the output by specifying the section name as a parameter: 127.0.0.1:6379> INFO memory # Memory used_memory:354923856 used_memory_human:338.48M used_memory_rss:468979712 used_memory_peak:423014496 used_memory_peak_human:403.42M used_memory_lua:33792 mem_fragmentation_ratio:1.32 mem_allocator:libc 127.0.0.1:6379> INFO cpu # CPU used_cpu_sys:3.71 used_cpu_user:40.36 used_cpu_sys_children:0.00 used_cpu_user_children:0.00 DBSIZE The DBSIZE command returns the number of existing keys in a Redis server: 127.0.0.1:6379> DBSIZE (integer) 50 DEBUG SEGFAULT The DEBUG SEGFAULT command crashes the Redis server process by performing an invalid memory access. It can be quite interesting to simulate bugs during the development of your application: 127.0.0.1:6379> DEBUG SEGFAULT MONITOR The command MONITOR shows all the commands processed by the Redis server in real time. It can be helpful for seeing how busy a Redis server is: 127.0.0.1:6379> MONITOR The following screenshot shows the MONITOR command output (left side) after running the leaderboard.js example (right side): While the MONITOR command is very helpful for debugging, it has a cost. In the Redis documentation page for MONITOR, an unscientific benchmark test says that MONITOR could reduce Redis's throughput by over 50%. CLIENT LIST and CLIENT SET NAME The CLIENT LIST command returns a list of all clients connected to the server, as well as relevant information and statistics about the clients (for example, IP address, name, and idle time). The CLIENT SETNAME command changes a client name; it is only useful for debugging purposes. CLIENT KILL The CLIENT KILL command terminates a client connection. It is possible to terminate client connections by IP, port, ID, or type: 127.0.0.1:6379> CLIENT KILL ADDR 127.0.0.1:51167 (integer) 1 127.0.0.1:6379> CLIENT KILL ID 22 (integer) 1 127.0.0.1:6379> CLIENT KILL TYPE slave (integer) 0 FLUSHALL The FLUSHALL command deletes all keys from Redis—this cannot be undone: 127.0.0.1:6379> FLUSHALL OK RANDOMKEY The command RANDOMKEY returns a random existing key name. This may help you get an overview of the available keys in Redis. The alternative would be to run the KEYS command, but it analyzes all the existing keys in Redis. If the keyspace is large, it may block the Redis server entirely during its execution: 127.0.0.1:6379> RANDOMKEY "mykey" EXPIRE and EXPIREAT The command EXPIRE sets a timeout in seconds for a given key. The key will be deleted after the specified amount of seconds. A negative timeout will delete the key instantaneously (just like running the command DEL). The command EXPIREAT sets a timeout for a given key based on a Unix timestamp. A timestamp of the past will delete the key instantaneously. These commands return 1 if the key timeout is set successfully or 0 if the key does not exist: 127.0.0.1:6379> MSET key1 value1 key2 value2 OK 127.0.0.1:6379> EXPIRE key1 30 (integer) 1 127.0.0.1:6379> EXPIREAT key2 1435717600 (integer) 1 TTL and PTTL The TTL command returns the remaining time to live (in seconds) of a key that has an associated timeout. If the key does not have an associated TTL, it returns -1, and if the key does not exist, it returns -2. The PTTL command does the same thing, but the return value is in milliseconds rather than seconds: 127.0.0.1:6379> SET redis-essentials:authors "By Maxwell Dayvson da Silva, Hugo Lopes Tavares" EX 30 OK 127.0.0.1:6379> TTL redis-essentials:authors (integer) 18 127.0.0.1:6379> PTTL redis-essentials:authors (integer) 13547 The SET command has optional parameters, and these were not shown before. The complete command syntax is as follows:   SET key value [EX seconds|PX milliseconds] [NX|XX] The parameters are explained as follows: EX: Set an expiration time in seconds PX: Set an expiration time in milliseconds NX: Only set the key if it does not exist XX: Only set the key if it already exists PERSIST The PERSIST command removes the existing timeout of a given key. Such a key will never expire, unless a new timeout is set. It returns 1 if the timeout is removed or 0 if the key does not have an associated timeout: 127.0.0.1:6379> SET mykey value OK 127.0.0.1:6379> EXPIRE mykey 30 (integer) 1 127.0.0.1:6379> PERSIST mykey (integer) 1 127.0.0.1:6379> TTL mykey (integer) -1 SETEX The SETEX command sets a value to a given key and also sets an expiration atomically. It is a combination of the commands, SET and EXPIRE: 127.0.0.1:6379> SETEX mykey 30 value OK 127.0.0.1:6379> GET mykey "value" 127.0.0.1:6379> TTL mykey (integer) 29 DEL The DEL command removes one or many keys from Redis and returns the number of removed keys—this command cannot be undone: 127.0.0.1:6379> MSET key1 value1 key2 value2 OK 127.0.0.1:6379> DEL key1 key2 (integer) 2 EXISTS The EXISTS command returns 1 if a certain key exists and 0 if it does not: 127.0.0.1:6379> SET mykey myvalue OK 127.0.0.1:6379> EXISTS mykey (integer) 1 PING The PING command returns the string PONG. It is useful for testing a server/client connection and verifying that Redis is able to exchange data: 127.0.0.1:6379> PING PONG MIGRATE The MIGRATE command moves a given key to a destination Redis server. This is an atomic command, and during the key migration, both Redis servers are blocked. If the key already exists in the destination, this command fails (unless the REPLACE parameter is specified). The command syntax is as follows: MIGRATE host port key destination-db timeout [COPY] [REPLACE] There are two optional parameters for the command MIGRATE, which can be used separately or combined: COPY: Keep the key in the local Redis server and create a copy in the destination Redis server REPLACE: Replace the existing key in the destination server SELECT Redis has a concept of multiple databases, each of which is identified by a number from 0 to 15 (there are 16 databases by default). It is not recommended to use multiple databases with Redis. A better approach would be to use multiple redis-server processes rather than a single one, because multiple processes are able to use multiple CPU cores and give better insights into bottlenecks. The SELECT command changes the current database that the client is connected to. The default database is 0: 127.0.0.1:6379> SELECT 7 OK 127.0.0.1:6379[7]> AUTH The AUTH command is used to authorize a client to connect to Redis. If authorization is enabled on the Redis server, clients are allowed to run commands only after executing the AUTH command with the right authorization key: 127.0.0.1:6379> GET mykey (error) NOAUTH Authentication required. 127.0.0.1:6379> AUTH mysecret OK 127.0.0.1:6379> GET mykey "value" SCRIPT KILL The SCRIPT KILL command terminates the running Lua script if no write operations have been performed by the script. If the script has performed any write operations, the SCRIPT KILL command will not be able to terminate it; in that case, the SHUTDOWN NOSAVE command must be executed. There are three possible return values for this command: OK NOTBUSY No scripts in execution right now. UNKILLABLE Sorry the script already executed write commands against the dataset. You can either wait the script termination or kill the server in a hard way using the SHUTDOWN NOSAVE command. 127.0.0.1:6379> SCRIPT KILL OK SHUTDOWN The SHUTDOWN command stops all clients, causes data to persist if enabled, and shuts down the Redis server. This command accepts one of the following optional parameters: SAVE: Forces Redis to save all of the data to a file called dump.rdb, even if persistence is not enabled NOSAVE: Prevents Redis from persisting data to the disk, even if persistence is enabled 127.0.0.1:6379> SHUTDOWN SAVE not connected> 127.0.0.1:6379> SHUTDOWN NOSAVE not connected> OBJECT ENCODING The OBJECT ENCODING command returns the encoding used by a given key: 127.0.0.1:6379> HSET myhash field value (integer) 1 127.0.0.1:6379> OBJECT ENCODING myhash "ziplist" Data type optimizations In Redis, all data types can use different encodings to save memory or improve performance. For instance, a String that has only digits (for example, 12345) uses less memory than a string of letters (for example, abcde) because they use different encodings. Data types will use different encodings based on thresholds defined in the Redis server configuration. The redis-cli will be used in this section to inspect the encodings of each data type and to demonstrate how configurations can be tweaked to optimize for memory. When Redis is downloaded, it comes with a file called redis.conf. This file is well documented and has all the Redis configuration directives, although some of them are commented out. Usually, the default values in this file are sufficient for most applications. The Redis configurations can also be specified via the command-line option or the CONFIG command; the most common approach is to use a configuration file. For this section, we have decided to not use a Redis configuration file. The configurations are passed via the command line for simplicity. Start redis-server with low values for all configurations: $ redis-server --hash-max-ziplist-entries 3 --hash-max-ziplist-value 5 --list-max-ziplist-entries 3 --list-max-ziplist-value 5 --set-max-intset-entries 3 --zset-max-ziplist-entries 3 --zset-max-ziplist-value 5 The default redis.conf file is well documented, and we recommend that you read it and discover new directive configurations. String The following are the available encoding for Strings: int: This is used when the string is represented by a 64-bit signed integer embstr: This is used for strings with fewer than 40 bytes raw: This is used for strings with more than 40 bytes These encodings are not configurable. The following redis-cli examples show how the different encodings are chosen: 127.0.0.1:6379> SET str1 12345 OK 127.0.0.1:6379> OBJECT ENCODING str1 "int" 127.0.0.1:6379> SET str2 "An embstr is small" OK 127.0.0.1:6379> OBJECT ENCODING str2 "embstr" 127.0.0.1:6379> SET str3 "A raw encoded String is anything greater than 39 bytes" OK 127.0.0.1:6379> OBJECT ENCODING str3 "raw" List These are the available encodings for Lists: ziplist: This is used when the List size has fewer elements than the configuration list-max-ziplist-entries and each List element has fewer bytes than the configuration list-max-ziplist-value linkedlist: This is used when the previous limits are exceeded 127.0.0.1:6379> LPUSH list1 a b (integer) 2 127.0.0.1:6379> OBJECT ENCODING list1 "ziplist" 127.0.0.1:6379> LPUSH list2 a b c d (integer) 4 127.0.0.1:6379> OBJECT ENCODING list2 "linkedlist" 127.0.0.1:6379> LPUSH list3 "only one element" (integer) 1 127.0.0.1:6379> OBJECT ENCODING list3 "linkedlist" Set The following are the available encodings for Sets: intset: This is used when all elements of a Set are integers and the Set cardinality is smaller than the configuration set-max-intset-entries hashtable: This is used when any element of a Set is not an integer or the Set cardinality exceeds the configuration set-max-intset-entries 127.0.0.1:6379> SADD set1 1 2 (integer) 2 127.0.0.1:6379> OBJECT ENCODING set1 "intset" 127.0.0.1:6379> SADD set2 1 2 3 4 5 (integer) 5 127.0.0.1:6379> OBJECT ENCODING set2 "hashtable" 127.0.0.1:6379> SADD set3 a (integer) 1 127.0.0.1:6379> OBJECT ENCODING set3 "hashtable" Hash The following are the available encodings for Hashes: ziplist: Used when the number of fields in the Hash does not exceed the configuration hash-max-ziplist-entries and each field name and value of the Hash is less than the configuration hash-max-ziplist-value (in bytes). hashtable: Used when a Hash size or any of its values exceed the configurations hash-max-ziplist-entries and hash-max-ziplist-value, respectively: 127.0.0.1:6379> HMSET myhash1 a 1 b 2 OK 127.0.0.1:6379> OBJECT ENCODING myhash1 "ziplist" 127.0.0.1:6379> HMSET myhash2 a 1 b 2 c 3 d 4 e 5 f 6 OK 127.0.0.1:6379> OBJECT ENCODING myhash2 "hashtable" 127.0.0.1:6379> HMSET myhash3 a 1 b 2 c 3 d 4 e 5 f 6 OK 127.0.0.1:6379> OBJECT ENCODING myhash3 "hashtable" Sorted Set The following are the available encodings: ziplist: Used when a Sorted Set has fewer entries than the configuration set-max-ziplist-entries and each of its values are smaller than zset-max-ziplist-value (in bytes) skiplist and hashtable: These are used when the Sorted Set number of entries or size of any of its values exceed the configurations set-max-ziplist-entries and zset-max-ziplist-value 127.0.0.1:6379> ZADD zset1 1 a (integer) 1 127.0.0.1:6379> OBJECT ENCODING zset1 "ziplist" 127.0.0.1:6379> ZADD zset2 1 abcdefghij (integer) 1 127.0.0.1:6379> OBJECT ENCODING zset2 "skiplist" 127.0.0.1:6379> ZADD zset3 1 a 2 b 3 c 4 d (integer) 4 127.0.0.1:6379> OBJECT ENCODING zset3 "skiplist" Measuring memory usage Previously, redis-server was configured to use a ziplist for Hashes with a maximum of three elements, in which each element was smaller than 5 bytes. With that configuration, it was possible to check how much memory Redis would use to store 500 field-value pairs: The total used memory was approximately 68 kB (1,076,864 – 1,008,576 = 68,288 bytes). If redis-server was started with its default configuration of 512 elements and 64 bytes for hash-max-ziplist-entries and hash-max-ziplist-value, respectively, the same 500 field-value pairs would use less memory, as shown here: The total used memory is approximately 16 kB (1,025,104 – 1,008,624 = 16,480 bytes). The default configuration in this case was more than four times more memory-efficient. Forcing a Hash to be a ziplist has a trade-off—the more elements a Hash has, the slower the performance. A ziplist is a dually linked list designed to be memory-efficient, and lookups are performed in linear time (O(n), where n is the number of fields in a Hash). On the other hand, a hashtable's lookup runs in constant time (O(1)), no matter how many elements exist. If you have a large dataset and need to optimize for memory, tweak these configurations until you find a good trade-off between memory and performance. Instagram tweaked their Hash configurations and found that 1,000 elements per Hash was a good trade-off for them. You can learn more about the Instagram solution in the blog post at http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value. The same logic for tweaking configurations and trade-offs applies to all other data type encodings presented previously. Algorithms that run in linear time (O(n)) are not always bad. If the input size is very small, they can run in near-constant time. Summary This article introduced the concepts behind Pub/Sub, transactions, and pipelines. It also showed the basics of the Lua language syntax, along with explanations on how to extend Redis with Lua. A good variety of Redis commands was presented, such as commands that are used to monitor and debug a Redis server. This article also showed how to perform data type optimizations by tweaking the redis-server configuration. Resources for Article: Further resources on this subject: Transactions in Redis[article] Redis in Autosuggest[article] Using Redis in a hostile environment (Advanced) [article]
Read more
  • 0
  • 0
  • 2552

article-image-leveraging-python-world-big-data
Packt
07 Sep 2015
26 min read
Save for later

Leveraging Python in the World of Big Data

Packt
07 Sep 2015
26 min read
 We are generating more and more data day by day. We have generated more data this century than in the previous century and we are currently only 15 years into this century. big data is the new buzz word and everyone is talking about it. It brings new possibilities. Google Translate is able to translate any language, thanks to big data. We are able to decode our human genome due to it. We can predict the failure of a turbine and do the required maintenance on it because of big data. There are three Vs of big data and they are defined as follows: Volume: This defines the size of the data. Facebook has petabytes of data on its users. Velocity: This is the rate at which data is generated. Variety: Data is not only in a tabular form. We can get data from text, images, and sound. Data comes in the form of JSON, XML, and other types as well. Let's take a look at the following screenshot:   In this article by Samir Madhavan, author of Mastering Python for Data Science, we'll learn how to use Python in the world of big data by doing the following: Understanding Hadoop Writing a MapReduce program in Python Using a Hadoop library (For more resources related to this topic, see here.) What is Hadoop? According to the Apache Hadoop's website, Hadoop stores data in a distributed manner and helps in computing it. It has been designed to scale easily to any number of machines with the help of computing power and storage. Hadoop was created by Doug Cutting and Mike Cafarella in the year 2005. It was named after Doug Cutting's son's toy elephant.   The programming model Hadoop is a programming paradigm that takes a large distributed computation as a sequence of distributed operations on large datasets of key-value pairs. The MapReduce framework makes use of a cluster of machines and executes MapReduce jobs across these machines. There are two phases in MapReduce—a mapping phase and a reduce phase. The input data to MapReduce is key value pairs of data. During the mapping phase, Hadoop splits the data into smaller pieces, which is then fed to the mappers. These mappers are distributed across machines within the cluster. Each mapper takes the input key-value pairs and generates intermediate key-value pairs by invoking a user-defined function within them. After the mapper phase, Hadoop sorts the intermediate dataset by key and generates a set of key-value tuples so that all the values belonging to a particular key are together. During the reduce phase, the reducer takes in the intermediate key-value pair and invokes a user-defined function, which then generates a output key-value pair. Hadoop distributes the reducers across the machines and assigns a set of key-value pairs to each of the reducers.  Data processing through MapReduce The MapReduce architecture MapReduce has a master-slave architecture, where the master is the JobTracker and TaskTracker is the slave. When a MapReduce program is submitted to Hadoop, the JobTracker assigns the mapping/reducing task to the TaskTracker and it takes of the task over executing the program. The Hadoop DFS Hadoop's distributed filesystem has been designed to store very large datasets in a distributed manner. It has been inspired by the Google File system, which is a proprietary distributed filesystem designed by Google. The data in HDFS is stored in a sequence of blocks, and all blocks are of the same size except for the last block. The block sizes are configurable in Hadoop. Hadoop's DFS architecture It also has a master/slave architecture where NameNode is the master machine and DataNode is the slave machine. The actual data is stored in the data node. The NameNode keeps a tab on where certain kinds of data are stored and whether it has the required replication. It also helps in managing a filesystem by creating, deleting, and moving directories and files in the filesystem. Python MapReduce Hadoop can be downloaded and installed from https://hadoop.apache.org/. We'll be using the Hadoop streaming API to execute our Python MapReduce program in Hadoop. The Hadoop Streaming API helps in using any program that has a standard input and output as a MapReduce program. We'll be writing three MapReduce programs using Python, they are as follows: A basic word count Getting the sentiment Score of each review Getting the overall sentiment score from all the reviews The basic word count We'll start with the word count MapReduce. Save the following code in a word_mapper.py file: import sys for l in sys.stdin: # Trailing and Leading white space is removed l = l.strip() # words in the line is split word_tokens = l.split() # Key Value pair is outputted for w in word_tokens: print '%st%s' % (w, 1) In the preceding mapper code, each line of the file is stripped of the leading and trailing white spaces. The line is then divided into tokens of words and then these tokens of words are outputted as a key value pair of 1. Save the following code in a word_reducer.py file: from operator import itemgetter import sys current_word_token = None counter = 0 word = None # STDIN Input for l in sys.stdin: # Trailing and Leading white space is removed l = l.strip() # input from the mapper is parsed word_token, counter = l.split('t', 1) # count is converted to int try: counter = int(counter) except ValueError: # if count is not a number then ignore the line continue #Since Hadoop sorts the mapper output by key, the following # if else statement works if current_word_token == word_token: current_counter += counter else: if current_word_token: print '%st%s' % (current_word_token, current_counter) current_counter = counter current_word_token = word_token # The last word is outputed if current_word_token == word_token: print '%st%s' % (current_word_token, current_counter) In the preceding code, we use the current_word_token parameter to keep track of the current word that is being counted. In the for loop, we use the word_token parameter and a counter to get the value out of the key-value pair. We then convert the counter to an int type. In the if/else statement, if the word_token value is same as the previous instance, which is current_word_token, then we keep counting else statement's value. If it's a new word that has come as the output, then we output the word and its count. The last if statement is to output the last word. We can check out if the mapper is working fine by using the following command: $ echo 'dolly dolly max max jack tim max' | ./BigData/word_mapper.py The output of the preceding command is shown as follows: dolly1 dolly1 max1 max1 jack1 tim1 max1 Now, we can check if the reducer is also working fine by piping the reducer to the sorted list of the mapper output: $ echo "dolly dolly max max jack tim max" | ./BigData/word_mapper.py | sort -k1,1 | ./BigData/word_reducer.py The output of the preceding command is shown as follows: dolly2 jack1 max3 tim1 Now, let's try to apply the same code on a local file containing the summary of mobydick: $ cat ./Data/mobydick_summary.txt | ./BigData/word_mapper.py | sort -k1,1 | ./BigData/word_reducer.py The output of the preceding command is shown as follows: a28 A2 abilities1 aboard3 about2 A sentiment score for each review We'll extend this to write a MapReduce program to determine the sentiment score for each review. Write the following code in the senti_mapper.py file: import sys import re positive_words = open('positive-words.txt').read().split('n') negative_words = open('negative-words.txt').read().split('n') def sentiment_score(text, pos_list, neg_list): positive_score = 0 negative_score = 0 for w in text.split(''): if w in pos_list: positive_score+=1 if w in neg_list: negative_score+=1 return positive_score - negative_score for l in sys.stdin: # Trailing and Leading white space is removed l = l.strip() #Convert to lower case l = l.lower() #Getting the sentiment score score = sentiment_score(l, positive_words, negative_words) # Key Value pair is outputted print '%st%s' % (l, score) In the preceding code, we used the sentiment_score function, which was designed to give the sentiment score as output. For each line, we strip the leading and trailing white spaces and then get the sentiment score for a review. Finally, we output a sentence and the score. For this program, we don't require a reducer as we can calculate the sentiment in the mapper itself and we just have to output the sentiment score. Let's test whether the mapper is working fine locally with a file containing the reviews for Jurassic World: $ cat ./Data/jurassic_world_review.txt | ./BigData/senti_mapper.py there is plenty here to divert, but little to leave you enraptored. such is the fate of the sequel: bigger. louder. fewer teeth.0 if you limit your expectations for jurassic world to "more teeth," it will deliver on that promise. if you dare to hope for anything more-relatable characters, narrative coherence-you'll only set yourself up for disappointment.-1 there's a problem when the most complex character in a film is the dinosaur-2 not so much another bloated sequel as it is the fruition of dreams deferred in the previous films. too bad the genre dictates that those dreams are once again destined for disaster.-2 We can see that our program is able to calculate the sentiment score well. The overall sentiment score To calculate the overall sentiment score, we would require the reducer and we'll use the same mapper but with slight modifications. Here is the mapper code that we'll use stored in the overall_senti_mapper.py file: import sys import hashlib positive_words = open('./Data/positive-words.txt').read().split('n') negative_words = open('./Data/negative-words.txt').read().split('n') def sentiment_score(text, pos_list, neg_list): positive_score = 0 negative_score = 0 for w in text.split(''): if w in pos_list: positive_score+=1 if w in neg_list: negative_score+=1 return positive_score - negative_score for l in sys.stdin: # Trailing and Leading white space is removed l = l.strip() #Convert to lower case l = l.lower() #Getting the sentiment score score = sentiment_score(l, positive_words, negative_words) #Hashing the review to use it as a string hash_object = hashlib.md5(l) # Key Value pair is outputted print '%st%s' % (hash_object.hexdigest(), score) This mapper code is similar to the previous mapper code, but here we use the MD5 hash library to review and then to get the output as the key. Here is the reducer code that is utilized to determine the overall sentiments score of the movie. Store the following code in the overall_senti_reducer.py file: from operator import itemgetter import sys total_score = 0 # STDIN Input for l in sys.stdin: # input from the mapper is parsed key, score = l.split('t', 1) # count is converted to int try: score = int(score) except ValueError: # if score is not a number then ignore the line continue #Updating the total score total_score += score print '%s' % (total_score,) In the preceding code, we strip the value containing the score and we then keep adding to the total_score variable. Finally, we output the total_score variable, which shows the sentiment of the movie. Let's locally test the overall sentiment on Jurassic World, which is a good movie, and then test the sentiment for the movie, Unfinished Business, which was critically deemed poor: $ cat ./Data/jurassic_world_review.txt | ./BigData/overall_senti_mapper.py | sort -k1,1 | ./BigData/overall_senti_reducer.py 19 $ cat ./Data/unfinished_business_review.txt | ./BigData/overall_senti_mapper.py | sort -k1,1 | ./BigData/overall_senti_reducer.py -8 We can see that our code is working well and we also see that Jurassic World has a more positive score, which means that people have liked it a lot. On the contrary, Unfinished Business has a negative value, which shows that people haven't liked it much. Deploying the MapReduce code on Hadoop We'll create a directory for data on Moby Dick, Jurassic World, and Unfinished Business in the HDFS tmp folder: $ Hadoop fs -mkdir /tmp/moby_dick $ Hadoop fs -mkdir /tmp/jurassic_world $ Hadoop fs -mkdir /tmp/unfinished_business Let's check if the folders are created: $ Hadoop fs -ls /tmp/ Found 6 items drwxrwxrwx - mapred Hadoop 0 2014-11-14 15:42 /tmp/Hadoop-mapred drwxr-xr-x - samzer Hadoop 0 2015-06-18 18:31 /tmp/jurassic_world drwxrwxrwx - hdfs Hadoop 0 2014-11-14 15:41 /tmp/mapred drwxr-xr-x - samzer Hadoop 0 2015-06-18 18:31 /tmp/moby_dick drwxr-xr-x - samzer Hadoop 0 2015-06-16 18:17 /tmp/temp635459726 drwxr-xr-x - samzer Hadoop 0 2015-06-18 18:31 /tmp/unfinished_business Once the folders are created, let's copy the data files to the respective folders. $ Hadoop fs -copyFromLocal ./Data/mobydick_summary.txt /tmp/moby_dick $ Hadoop fs -copyFromLocal ./Data/jurassic_world_review.txt /tmp/jurassic_world $ Hadoop fs -copyFromLocal ./Data/unfinished_business_review.txt /tmp/unfinished_business Let's verify that the file is copied: $ Hadoop fs -ls /tmp/moby_dick $ Hadoop fs -ls /tmp/jurassic_world $ Hadoop fs -ls /tmp/unfinished_business Found 1 items -rw-r--r-- 3 samzer Hadoop 5973 2015-06-18 18:34 /tmp/moby_dick/mobydick_summary.txt Found 1 items -rw-r--r-- 3 samzer Hadoop 3185 2015-06-18 18:34 /tmp/jurassic_world/jurassic_world_review.txt Found 1 items -rw-r--r-- 3 samzer Hadoop 2294 2015-06-18 18:34 /tmp/unfinished_business/unfinished_business_review.txt We can see that files have been copied successfully. With the following command, we'll execute our mapper and reducer's script in Hadoop. In this command, we define the mapper, reducer, input, and output file locations, and then use Hadoop streaming to execute our scripts. Let's execute the word count program first: $ Hadoop jar /usr/lib/Hadoop-0.20-mapreduce/contrib/streaming/Hadoop-*streaming*.jar -file ./BigData/word_mapper.py -mapper word_mapper.py -file ./BigData/word_reducer.py -reducer word_reducer.py -input /tmp/moby_dick/* -output /tmp/moby_output Let's verify that the word count MapReduce program is working successfully: $ Hadoop fs -cat /tmp/moby_output/* The output of the preceding command is shown as follows: (Queequeg1 A2 Africa1 Africa,1 After1 Ahab13 Ahab,1 Ahab's6 All1 American1 As1 At1 Bedford,1 Bildad1 Bildad,1 Boomer,2 Captain1 Christmas1 Day1 Delight,1 Dick6 Dick,2 The program is working as intended. Now, we'll deploy the program that calculates the sentiment score for each of the reviews. Note that we can add the positive and negative dictionary files to the Hadoop streaming: $ Hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-*streaming*.jar -file ./BigData/word_mapper.py -mapper word_mapper.py -file ./BigData/word_reducer.py -reducer word_reducer.py -input /tmp/moby_dick/* -output /tmp/moby_output In the preceding code, we use the Hadoop command with the Hadoop streaming JAR file and then define the mapper and reducer files, and finally, the input and output directories in Hadoop. Let's check the sentiments score of the movies review: $ Hadoop fs -cat /tmp/jurassic_output/* The output of the preceding command is shown as follows: "jurassic world," like its predecessors, fills up the screen with roaring, slathering, earth-shaking dinosaurs, then fills in mere humans around the edges. it's a formula that works as well in 2015 as it did in 1993.3 a perfectly fine movie and entertaining enough to keep you watching until the closing credits.4 an angry movie with a tragic moral ... meta-adoration and criticism ends with a genetically modified dinosaur fighting off waves of dinosaurs.-3 if you limit your expectations for jurassic world to "more teeth," it will deliver on that promise. if you dare to hope for anything more-relatable characters, narrative coherence-you'll only set yourself up for disappointment.-1 This program is also working as intended. Now, we'll try out the overall sentiment of a movie: $ Hadoop jar /usr/lib/Hadoop-0.20-mapreduce/contrib/streaming/Hadoop-*streaming*.jar -file ./BigData/overall_senti_mapper.py -mapper Let's verify the result: $ Hadoop fs -cat /tmp/unfinished_business_output/* The output of the preceding command is shown as follows: -8 We can see that the overall sentiment score comes out correctly from MapReduce. Here is a screenshot of the JobTracker status page:   The preceding image shows a portal where the jobs submitted to the JobTracker can be viewed and the status can be seen. This can be seen on port 50070 of the master system. From the preceding image, we can see that a job is running, and the status above the image shows that the job has been completed successfully. File handling with Hadoopy Hadoopy is a library in Python, which provides an API to interact with Hadoop to manage files and perform MapReduce on it. Hadoopy can be downloaded from http://www.Hadoopy.com/en/latest/tutorial.html#installing-Hadoopy. Let's try to put a few files in Hadoop through Hadoopy in a directory created within HDFS, called data: $ Hadoop fs -mkdir data Here is the code that puts the data into HDFS: importHadoopy import os hdfs_path = '' def read_local_dir(local_path): for fn in os.listdir(local_path): path = os.path.join(local_path, fn) if os.path.isfile(path): yield path def main(): local_path = './BigData/dummy_data' for file in read_local_dir(local_path): Hadoopy.put(file, 'data') print"The file %s has been put into hdfs"% (file,) if __name__ =='__main__': main() The file ./BigData/dummy_data/test9 has been put into hdfs The file ./BigData/dummy_data/test7 has been put into hdfs The file ./BigData/dummy_data/test1 has been put into hdfs The file ./BigData/dummy_data/test8 has been put into hdfs The file ./BigData/dummy_data/test6 has been put into hdfs The file ./BigData/dummy_data/test5 has been put into hdfs The file ./BigData/dummy_data/test3 has been put into hdfs The file ./BigData/dummy_data/test4 has been put into hdfs The file ./BigData/dummy_data/test2 has been put into hdfs In the preceding code, we list all the files in a directory and then put each of the files into Hadoop using the put() method of Hadoopy. Let's check if all the files have been put into HDFS: $ Hadoop fs -ls data The output of the preceding command is shown as follows: Found 9 items -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test1 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test2 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test3 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test4 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test5 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test6 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test7 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test8 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test9 So, we have successfully been able to put files into HDFS. Pig Pig is a platform that has a very expressive language to perform data transformations and querying. The code that is written in Pig is done in a scripting manner and this gets compiled to MapReduce programs, which execute on Hadoop. The following image is the logo of Pig Latin:  The Pig logo Pig helps in reducing the complexity of raw-level MapReduce programs, and enables the user to perform fast transformations. Pig Latin is the textual language that can be learned from http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html. We'll be covering how to perform the top 10 most occurring words with Pig, and then we'll see how you can create a function in Python that can be used in Pig. Let's start with the word count. Here is the Pig Latin code, which you can save in thepig_wordcount.py file: data = load '/tmp/moby_dick/'; word_token = foreach data generate flatten(TOKENIZE((chararray)$0)) as word; group_word_token = group word_token by word; count_word_token = foreach group_word_token generate COUNT(word_token) as cnt, group; sort_word_token = ORDER count_word_token by cnt DESC; top10_word_count = LIMIT sort_word_token 10; DUMP top10_word_count; In the preceding code, we can load the summary of Moby Dick, which is then tokenized line by line and is basically split into individual elements. The flatten function converts a collection of individual word tokens in a line to a row-by-row form. We then group by the words and then take a count of the words for each word. Finally, we sort the count of words in a descending order and then we limit the count of the words to the first 10 rows to get the top 10 most occurring words. Let's execute the preceding pig script: $ pig ./BigData/pig_wordcount.pig The output of the preceding command is shown as follows: (83,the) (36,and) (28,a) (25,of) (24,to) (15,his) (14,Ahab) (14,Moby) (14,is) (14,in) We are able to get our top 10 words. Let's now create a user-defined function with Python, which will be used in Pig. We'll define two user-defined functions to score positive and negative sentiments of a sentence. The following code is the UDF used to score the positive sentiment and it's available in the positive_sentiment.py file: positive_words = [ 'a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation', 'acco$ ] @outputSchema("pnum:int") def sentiment_score(text): positive_score = 0 for w in text.split(''): if w in positive_words: positive_score+=1 return positive_score In the preceding code, we define the positive word list, which is used by the sentiment_score() function. The function checks for the positive words in a sentence and finally outputs their total count. There is an outputSchema() decorator that is used to tell Pig what type of data is being outputted, which in our case is int. Here is the code to score the negative sentiment and it's available in the negative_sentiment.py file. The code is almost similar to the positive sentiment: negative_words = ['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted', 'ab$....] @outputSchema("nnum:int") def sentiment_score(text): negative_score = 0 for w in text.split(''): if w in negative_words: negative_score-=1 return negative_score The following code is used by Pig to score the sentiments of the Jurassic World reviews and its available in the pig_sentiment.pig file: register 'positive_sentiment.py' using org.apache.pig.scripting.jython.JythonScriptEngine as positive; register 'negative_sentiment.py' using org.apache.pig.scripting.jython.JythonScriptEngine as negative; data = load '/tmp/jurassic_world/*'; feedback_sentiments = foreach data generate LOWER((chararray)$0) as feedback, positive.sentiment_score(LOWER((chararray)$0)) as psenti, negative.sentiment_score(LOWER((chararray)$0)) as nsenti; average_sentiments = foreach feedback,feedback_sentiments generate psenti + nsenti; dump average_sentiments; In the preceding Pig script, we first register the Python UDF scripts using the register command and give them an appropriate name. We then load our Jurassic World review. We then convert our reviews to lowercase and score the positive and negative sentiments of a review. Finally, we add the score to get the overall sentiments of a review. Let's execute the Pig script and see the results: $ pig ./BigData/pig_sentiment.pig The output of the preceding command is shown as follows: (there is plenty here to divert, but little to leave you enraptored. such is the fate of the sequel: bigger. louder. fewer teeth.,0) (if you limit your expectations for jurassic world to "more teeth," it will deliver on that promise. if you dare to hope for anything more-relatable characters, narrative coherence-you'll only set yourself up for disappointment.,-1) (there's a problem when the most complex character in a film is the dinosaur,-2) (not so much another bloated sequel as it is the fruition of dreams deferred in the previous films. too bad the genre dictates that those dreams are once again destined for disaster.,-2) (a perfectly fine movie and entertaining enough to keep you watching until the closing credits.,4) (this fourth installment of the jurassic park film series shows some wear and tear, but there is still some gas left in the tank. time is spent to set up the next film in the series. they will keep making more of these until we stop watching.,0) We have successfully scored the sentiments of the Jurassic World review using the Python UDF in Pig. Python with Apache Spark Apache Spark is a computing framework that works on top of HDFS and provides an alternative way of computing that is similar to MapReduce. It was developed by AmpLab of UC Berkeley. Spark does its computation mostly in the memory because of which, it is much faster than MapReduce, and is well suited for machine learning as it's able to handle iterative workloads really well.   Spark uses the programming abstraction of RDDs (Resilient Distributed Datasets) in which data is logically distributed into partitions, and transformations can be performed on top of this data. Python is one of the languages that is used to interact with Apache Spark, and we'll create a program to perform the sentiment scoring for each review of Jurassic Park as well as the overall sentiment. You can install Apache Spark by following the instructions at https://spark.apache.org/docs/1.0.1/spark-standalone.html. Scoring the sentiment Here is the Python code to score the sentiment: from __future__ import print_function import sys from operator import add from pyspark import SparkContext positive_words = open('positive-words.txt').read().split('n') negative_words = open('negative-words.txt').read().split('n') def sentiment_score(text, pos_list, neg_list): positive_score = 0 negative_score = 0 for w in text.split(''): if w in pos_list: positive_score+=1 if w in neg_list: negative_score+=1 return positive_score - negative_score if __name__ == "__main__": if len(sys.argv) != 2: print("Usage: sentiment <file>", file=sys.stderr) exit(-1) sc = SparkContext(appName="PythonSentiment") lines = sc.textFile(sys.argv[1], 1) scores = lines.map(lambda x: (x, sentiment_score(x.lower(), positive_words, negative_words))) output = scores.collect() for (key, score) in output: print("%s: %i" % (key, score)) sc.stop() In the preceding code, we define our standard sentiment_score() function, which we'll be reusing. The if statement checks whether the Python script and the text file is given. The sc variable is a Spark Context object with the PythonSentiment app name. The filename in the argument is passed into Spark through the textFile() method of the sc variable. In the map() function of Spark, we define a lambda function, where each line of the text file is passed, and then we obtain the line and its respective sentiment score. The output variable gets the result, and finally, we print the result on the screen. Let's score the sentiment of each of the reviews of Jurassic World. Replace the <hostname> with your hostname, this should suffice: $ ~/spark-1.3.0-bin-cdh4/bin/spark-submit --master spark://<hostname>:7077 ./BigData/spark_sentiment.py hdfs://localhost:8020/tmp/jurassic_world/* We'll get the following output for the preceding command: There is plenty here to divert but little to leave you enraptured. Such is the fate of the sequel: Bigger, Louder, Fewer teeth: 0 If you limit your expectations for Jurassic World to more teeth, it will deliver on this promise. If you dare to hope for anything more—relatable characters or narrative coherence—you'll only set yourself up for disappointment:-1 We can see that our Spark program was able to score the sentiment for each of the reviews. The number in the end of the output of the sentiment score shows that if the review has been positive or negative, the higher the number of the sentiment score—the better the review and the more negative the number of the sentiment score—the more negative the review has been. We use the Spark Submit command with the following parameters: A master node of the Spark system A Python script containing the transformation commands An argument to the Python script The overall sentiment Here is a Spark program to score the overall sentiment of all the reviews: from __future__ import print_function import sys from operator import add from pyspark import SparkContext positive_words = open('positive-words.txt').read().split('n') negative_words = open('negative-words.txt').read().split('n') def sentiment_score(text, pos_list, neg_list): positive_score = 0 negative_score = 0 for w in text.split(''): if w in pos_list: positive_score+=1 if w in neg_list: negative_score+=1 return positive_score - negative_score if __name__ =="__main__": if len(sys.argv) != 2: print("Usage: Overall Sentiment <file>", file=sys.stderr) exit(-1) sc = SparkContext(appName="PythonOverallSentiment") lines = sc.textFile(sys.argv[1], 1) scores = lines.map(lambda x: ("Total", sentiment_score(x.lower(), positive_words, negative_words))) .reduceByKey(add) output = scores.collect() for (key, score) in output: print("%s: %i"% (key, score)) sc.stop() In the preceding code, we have added a reduceByKey() method, which reduces the value by adding the output values, and we have also defined the key as Total, so that all the scores are reduced based on a single key. Let's try out the preceding code to get the overall sentiment of Jurassic World. Replace the <hostname> with your hostname, this should suffice: $ ~/spark-1.3.0-bin-cdh4/bin/spark-submit --master spark://<hostname>:7077 ./BigData/spark_overall_sentiment.py hdfs://localhost:8020/tmp/jurassic_world/* The output of the preceding command is shown as follows: Total: 19 We can see that Spark has given an overall sentiment score of 19. The applications that get executed on Spark can be viewed in the browser on the 8080 port of the Spark master. Here is a screenshot of it:   We can see that the number of nodes of Spark, applications that are getting executed currently, and the applications that have been executed. Summary In this article, you were introduced to big data, learned about how the Hadoop software works, and the architecture associated with it. You then learned how to create a mapper and a reducer for a MapReduce program, how to test it locally, and then put it into Hadoop and deploy it. You were then introduced to the Hadoopy library and using this library, you were able to put files into Hadoop. You also learned about Pig and how to create a user-defined function with it. Finally, you learned about Apache Spark, which is an alternative to MapReduce and how to use it to perform distributed computing. With this article, we have come to an end in our journey, and you should be in a state to perform data science tasks with Python. From here on, you can participate in Kaggle Competitions at https://www.kaggle.com/ to improve your data science skills with real-world problems. This will fine-tune your skills and help understand how to solve analytical problems. Also, you can sign up for the Andrew NG course on Machine Learning at https://www.coursera.org/learn/machine-learning to understand the nuances behind machine learning algorithms. Resources for Article: Further resources on this subject: Bizarre Python[article] Predicting Sports Winners with Decision Trees and pandas[article] Optimization in Python [article]
Read more
  • 0
  • 0
  • 4921

article-image-modules-and-templates
Packt
07 Sep 2015
20 min read
Save for later

Modules and Templates

Packt
07 Sep 2015
20 min read
 In this article by Thomas Uphill, author of the book Troubleshooting Puppet, we will look at how the various parts of a module may cause issues. As a Puppet developer or a system administrator, modules are how you deliver your code to the nodes. Modules are great for organizing your code into manageable chunks, but modules are also where you'll see most of your problems when troubleshooting. Most modules contain classes in a manifests directory, but modules can also include custom facts, functions, types, providers, as well as files and templates. Each of these components can be a source of error. We will address each of these components in the following sections, starting with classes. (For more resources related to this topic, see here.) In Puppet, the namespace of classes is referred to as the scope. Classes can have multiple nested levels of subclasses. Each class and subclass defines a scope. Each scope is separate. To refer to variables in a different scope, you must refer to the fully scoped variable name. For instance, in the following example, we have a class and two subclasses with similar names defined within each of the classes: class leader { notify {'Leader-1': } } class autobots { include leader } class autobots::leader { notify {'Optimus Prime': } } class decepticons { include leader } class decepticons::leader { notify {'Megatron': } } We then include the leader, autobots, and decepticons classes in our node, as follows: include leader include autobots include decepticons When we run Puppet, we see the following output: t@mylaptop ~ $ puppet apply leaders.pp Notice: Compiled catalog for mylaptop.example.net in environment production in 0.03 seconds Notice: Optimus Prime Notice: /Stage[main]/Autobots::Leader/Notify[Optimus Prime]/message: defined 'message' as 'Optimus Prime' Notice: Leader-1 Notice: /Stage[main]/Leader/Notify[Leader-1]/message: defined 'message' as 'Leader-1' Notice: Megatron Notice: /Stage[main]/Decepticons::Leader/Notify[Megatron]/message: defined 'message' as 'Megatron' Notice: Finished catalog run in 0.06 seconds If this is the output that you expected, you can safely move on. If you are a little surprised, then read on. The problem here is the scope. Although we have a top scope class named leader, when we include leader from within the autobots and decepticons classes, the local scope is searched first. In both cases, a local match is found first and used. Instead of the three 'Leader-1' notifications, we see only one 'Leader-1', one 'Megatron', and one 'Optimus Prime'. If your normal procedure is to have the leader class defined and you forgot to do so, then you can end up being slightly confused. Consider the following modified example: class leader { notify {'Leader-1': } } class autobots { include leader } include autobots Now, when we apply this manifest, we see the following output: t@mylaptop ~ $ puppet apply leaders2.pp Notice: Compiled catalog for mylaptop.example.net in environment production in 0.02 seconds Notice: Leader-1 Notice: /Stage[main]/Leader/Notify[Leader-1]/message: defined 'message' as 'Leader-1' Notice: Finished catalog run in 0.04 seconds Since the leader class was not available in the scope within the autobot class, the top scope leader class was used. Knowing how Puppet evaluates scope can save you time when your issues turn out to be namespace-related. This example is contrived. The usual situation where people run into this problem is when they have multiple modules organized in the same way. The problem manifests itself when you have many different modules with subclasses in different modules with the same names. For example, two modules named myappa and myappb with config subclasses, myappa::config and myappb::config. This problem occurs when the developer forgets to write the myappc::config subclass and there is a top scope config module available. Metaparameters Metaparameters are parameters that are used by Puppet to compile the catalog but are not used when modifying the target system. Some metaparameters, such as tag, are used to specify or mark resources. Other metaparameters, such as before, require, notify, and subscribe, are used to specify the order in which the resources should be applied to a node. When the catalog is compiled, the resources are evaluated based on their dependencies as opposed to how they are defined in the manifests. The order in which the resources are evaluated can be a little confusing for a person who is new to Puppet. A common paradigm when creating files is to create the containing directory before creating the resource. Consider the following code: class apps { file {'/apps': ensure => 'directory', mode => '0755', } } class myapp { file {'/apps/myapp/config': content => 'on = true', mode => '0644', } file {'/apps/myapp': ensure => 'directory', mode => '0755', } } include myapp include apps When we apply this manifest, even though the order of the resources is not correct in the manifest, the catalog applies correctly, as follows: [root@trouble ~]# puppet apply order.pp Notice: Compiled catalog for trouble.example.com in environment production in 0.13 seconds Notice: /Stage[main]/Apps/File[/apps]/ensure: created Notice: /Stage[main]/Myapp/File[/apps/myapp]/ensure: created Notice: /Stage[main]/Myapp/File[/apps/myapp/config]/ensure: defined content as '{md5}1090eb22d3caa1a3efae39cdfbce5155' Notice: Finished catalog run in 0.05 seconds Recent versions of Puppet will automatically use the require metaparameter for certain resources. In the case of the preceding code, the '/apps/myapp' file has an implied require of the '/apps' file because directories autorequire their parents. We can safely rely on this autorequire mechanism but, when debugging, it is useful to know how to specify the resource order precisely. To ensure that the /apps directory exists before we try to create the /apps/myapp directory, we can use the require metaparameter to have the myapp directory require the /apps directory, as follows: classmyapp { file {'/apps/myapp/config': content => 'on = true', mode => '0644', require => File['/apps/myapp'], } file {'/apps/myapp': ensure => 'directory', mode => '0755', require => File['/apps'], } } The preceding require lines specify that each of the file resources requires its parent directory. Autorequires Certain resource relationships are ubiquitous. When the relationship is implied, a mechanism was developed to reduce resource ordering errors. This mechanism is called autorequire. A list of autorequire relationships is given in the type reference documentation at https://docs.puppetlabs.com/references/latest/type.html. When troubleshooting, you should know that the following autorequire relationships exist: A cron resource will autorequire the specified user. An exec resource will autorequire both the working directory of the exec as a file resource and the user as which the exec runs. A file resource will autorequire its owner and group. A mount will autorequire the mounts that it depends on (a mount resource of /apps/myapp will autorequire a mount resource of /apps). A user resource will autorequire its primary group. Autorequire relationships only work when the resources within the relationship are specified within the catalog. If your catalog does not specify the required resources, then your catalog will fail if those resources are not found on the node. For instance, if you have a mount resource of /apps/myapp but the /apps directory or mount does not exist, then the mount resource will fail. If the /apps mount is specified, then the autorequire mechanism will ensure that the /apps mount is mounted before the /apps/myapp mount. Explicit ordering When you are trying to determine an error in the evaluation of your class, it can be helpful to use the chaining arrow syntax to force your resources to evaluate in the order that you specified. For instance, if you have an exec resource that is failing, you can create another exec resource that outputs the information used within your failing exec. For example, we have the following exec code: file {'arrow': path => '/tmp/arrow', ensure => 'directory', } exec {'arrow_debug_before': command => 'echo debug_before', path => '/usr/bin:/bin', } exec {'arrow_example': command => 'echo arrow', path => '/usr/bin:/bin', require => File['arrow'], } exec {'arrow_debug_after': command => 'echo debug_after', path => '/usr/bin:/bin', } Now, when you apply this catalog, you will see that the arrow_before and arrow_after resources are not applied in the order that we were expecting: [root@trouble ~]# puppet agent -t Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for trouble.example.com Info: Applying configuration version '1431872398' Notice: /Stage[main]/Main/Node[default]/Exec[arrow_debug_before]/returns: executed successfully Notice: /Stage[main]/Main/Node[default]/Exec[arrow_debug_after]/returns: executed successfully Notice: /Stage[main]/Main/Node[default]/File[arrow]/ensure: created Notice: /Stage[main]/Main/Node[default]/Exec[arrow_example]/returns: executed successfully Notice: Finished catalog run in 0.23 seconds To enforce the sequence that we were expecting, you can use the chaining arrow syntax, as follows: exec {'arrow_debug_before': command => 'echo debug_before', path => '/usr/bin:/bin', }-> exec {'arrow_example': command => 'echo arrow', path => '/usr/bin:/bin', require => File['arrow'], }-> exec {'arrow_debug_after': command => 'echo debug_after', path => '/usr/bin:/bin', } Now, when we apply the agent this time, the order is what we expected: [root@trouble ~]# puppet agent -t Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for trouble.example.com Info: Applying configuration version '1431872778' Notice: /Stage[main]/Main/Node[default]/Exec[arrow_debug_before]/returns: executed successfully Notice: /Stage[main]/Main/Node[default]/Exec[arrow_example]/returns: executed successfully Notice: /Stage[main]/Main/Node[default]/Exec[arrow_debug_after]/returns: executed successfully Notice: Finished catalog run in 0.23 seconds A good way to use this sort of arrangement is to create an exec resource that outputs the environment information before your failing resource is applied. For example, you can create a class that runs a debug script and then use chaining arrows to have it applied before your failing resource. If your resource uses variables, then creating a notify that outputs the values of the variables can also help with debugging. Defined types Defined types are great for reducing the complexity and improving the readability of your code. However, they can lead to some interesting problems that may be difficult to diagnose. In the following code, we create a defined type that creates a host entry: define myhost ($short,$ip) { host {"$short": ip => $ip, host_aliases => [ "$title.example.com", "$title.example.net", "$short" ], } } In this define, the namevar for the host resource is an argument of the define, the $short variable. In Puppet, there are two important attributes of any resource—the namevar and the title. The confusion lies in the fact that, sometimes, both of these attributes have the same value. Both values must be unique, but they are used differently. The title is used to uniquely identify the resource to the compiler and need not be related to the actual resource. The namevar uniquely identifies the resource to the agent after the catalog is compiled. The namevar is specific to each resource. For example, the namevar for a package is the package name and the namevar for a file is the full path to the file. The problem with the preceding defined type is that you can end up with a duplicate resource that is difficult to find. The resource is defined within the defined type. So, when Puppet reports the duplicate definition, it will report it as though it were defined on the same line. Let's create the following node definition with two myhost resources: node default { $short = "trb" myhost {'trouble': short => 'trb',ip => '192.168.50.1' } myhost {'tribble': short => "$short",ip => '192.168.50.2' } } Even though the two myhost resources have different titles, when we run Puppet, we see a duplicate definition, as follows: [root@trouble~]# puppet agent -t Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Host[trb] is already declared in file /etc/puppet/environments/production/modules/myhost/manifests/init.pp:5; cannot redeclare at /etc/puppet/environments/production/modules/myhost/manifests/init.pp:5 on node trouble.example.com Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run Tracking down this issue can be difficult if we have several myhost definitions throughout the node definition. To make this problem a lot easier to solve, we should use the title attribute of the defined type as the title attribute of the resources within the define method. The following rewrite shows this difference: define myhost ($short,$ip) { host {"$title": ip => $ip, host_aliases => [ "$title.example.com", "$title.example.net", "$short" ], } } Custom facts When you define custom facts within your modules (in the lib/facter directory), they are automatically transferred to your node via the pluginsync method. The issue here is that the facts are synced to the same directory. So, if you created two facts with the same filename, then it can be difficult to determine which fact will be synced down to your node. Facter is run at the beginning of a Puppet agent run. The results of Facter are used to compile the catalog. If any of your facts take longer than the configured timeout (config_timeout in the [agent] section of puppet.conf) in Puppet, then the agent run will fail. Instead of increasing this timeout, when designing your custom facts keep them simple enough so that they will take no longer than a few seconds to run. You can debug Facter from the command line using the -d switch. To load custom facts that are synced from Puppet, add the -p option as well. If you are having trouble with the output of your fact, then you can also have the output formatted as a JSON document by adding the -j option. Combining all of these options, the following is a good starting point for the debugging of your Facter output: [root@puppet ~]# facter -p -d -j |more Found no suitable resolves of 1 for ec2_metadata value for ec2_metadata is still nil Found no suitable resolves of 1 for gce value for gce is still nil ... { "lsbminordistrelease": "6", "puppetversion": "3.7.5", "blockdevice_sda_vendor": "ATA", "ipaddress_lo": "127.0.0.1", ... Having Facter output to a JSON file is helpful because the returned values are wrapped in quotes. So, any trailing spaces or control characters will be visible. The easiest way to debug custom facts is to run them through Ruby directly. To run a custom fact through Ruby, start with the custom fact in the directory and use the irb command to run interactive Ruby, as follows: [root@puppetfacter]# irb -r facter -r iptables_version.rb irb(main):001:0> puts Facter.value("iptables_version") 1.4.7 =>nil This displays the value of the iptables_version fact. From within IRB, you can check the code line-by-line to figure out your problem. The preceding command was executed on a Linux host. Doing this on a Windows host is not so easy, but it is possible. Locate the irb executable on your system. For the Puppet Enterprise installation, this should be in C:Program Files (x86)/Puppet Labs/Puppet Enterprise/sys/ruby/bin. Run irb and then alter the $LOAD_PATH variable to add the path to facter.rb (the Facter library), as follows: irb(main):001:0>$LOAD_PATH.push("C:/Program Files (x86)/Puppet Labs/Puppet Enterprise/facter/lib") Now require the Facter library, as follows: irb(main):002:0> require 'facter' =>true Finally, run Facter.value with the name of a fact, which is similar to what we did in the previous example: irb(main):003:0>Facter.value("uptime") => "0:08 hours" Pry When debugging any Ruby code, using the Pry library will allow you to inspect the Ruby environment that is running at any breakpoint that you define. In the earlier iptables_version example, we could use the Pry library to inspect the calculation of the fact. To do so, modify the fact definition and comment out the setcode section (the breakpoint definition will not work within a setcode block). Then define a breakpoint by adding binding.pry to the fact at the point that you wish to inspect, as follows: Facter.add(:iptables_version) do confine :kernel => :linux #setcode do version = Facter::Util::Resolution.exec('iptables --version') if version version.match(/d+.d+.d+/).to_s else nil end binding.pry #end end Now run Ruby with the Pry and Facter libraries on the iptables_version fact definition, as follows: root@mylaptop # ruby -r pry -r facteriptables_version.rb From: /var/lib/puppet/lib/facter/iptables_version.rb @ line 10 : 5: if version 6: version.match(/d+.d+.d+/).to_s 7: else 8: nil 9: end => 10: binding.pry 11: #end 12: end This will cause the evaluation of the iptables_version fact to halt at the binding.pry line. We can then inspect the value of the version variable and execute the regular expression matching ourselves to verify that it is working correctly, as follows: [1] pry(#<Facter::Util::Resolution>)> version => "iptables v1.4.21" [2] pry(#<Facter::Util::Resolution>)>version.match(/d+.d+.d+/).to_s => "1.4.21" ok Environment When developing custom facts, it is useful to make your Ruby fact file executable and run the Ruby script from the command line. When you run custom facts from the command line, the environment variables defined in your current shell can affect how the fact is calculated. This can result in different values being returned for the fact when it is run through the Puppet agent. One of the most common variables that cause this sort of problem is JAVA_HOME. This can also be a problem when testing the exec resources. Environment variables and shell aliases will be available for exec when it is run interactively. When run through the agent, these customizations will not be available, which has the potential to cause inconsistency. Files Files are transferred between the master and the node via Puppet's internal fileserver. When working with files, it is important to remember that all the files that are served via Puppet are read into memory by the Puppet Server. Transferring large files via Puppet is inefficient. You should avoid transferring large and/or binary files. Most of the problems with files are related to path and URL syntax errors. The source parameter contains a URL with the following syntax: source => "puppet:///path/to/file" In the preceding syntax, the three slashes specify the beginning of the URL location and the Puppet Server that should be contacted. The following is also valid: source => "puppet://myserver/path/to/file" The path from which we can to download a file depends on the context of the manifest. If the manifest is found within the manifest directory or the manifest is the site.pp manifest, then the path to the file is relative to this location starting at the files subdirectory. If the manifest is found within a module, then the path should start with the modules path; then the files will be found within the files directory of the module. Templates ERB templates are written in Ruby. The current releases of Puppet also support EPP Puppet templates, which are written in Puppet. The debugging of ERB templates can be done by running the templates through Ruby. To simply check the syntax, use the following code: $ erb -P -x -T '-' template.erb |ruby -c Syntax OK If your template does not pass the preceding test, then you know that your syntax is incorrect. The usual error type that you will see is as follows: -:8: syntax error, unexpected end-of-input, expecting keyword_end The problem with the preceding command is that the line number is in the evaluated code that is returned by the erb script, not the original file. When checking for the syntax error, you will have to inspect the intermediate code that is generated by the erb command. Unfortunately, doing anything more than checking simple syntax is a problem. Although the ERB templates can be evaluated using the ERB library, the <%= block markers that are used in the Puppet ERB templates break the normal evaluation. The simplest way to evaluate Ruby templates is by creating a simple manifest with a file resource that applies the template. As an example, the resolv.conf template is shown in the following code: # resolv.conf built by Puppet domain<%= @domain %> search<% searchdomains.each do |domain| -%> <%= domain -%><% end -%><%= @domain %> <% nameservers.each do |server| -%> nameserver<%= server %> <% end -%> This template is then saved into a file named template.erb. We then create a file resource using this template.erb file, as shown in the following code: $searchdomains = ['trouble.example.com','packt.example.com'] $nameservers = ['8.8.8.8','8.8.4.4'] $domains = 'example.com' file {'/tmp/test': content => template('/tmp/template.erb') } We then use puppet apply to apply this template and create the /tmp/test file, as follows: $ puppet apply file.pp Notice: Compiled catalog for mylaptop.example.net in environment production in 0.20 seconds Notice: /Stage[main]/Main/File[/tmp/test]/ensure: defined content as '{md5}4d1c547c40a27c06726ecaf784b99e84' Notice: Finished catalog run in 0.04 seconds The following are the contents of the /tmp/test file: # resolv.conf built by Puppet domainexample.net search trouble.example.com packt.example.com example.net nameserver 8.8.8.8 nameserver 8.8.4.4 Debugging templates Templates can also be used in debugging. You can create a file resource that uses a template that outputs all the defined variables and their values. You can include the following resource in your node definition: file { "/tmp/puppet-debug.txt": content =>inline_template("<% vars = scope.to_hash.reject { |k,v| !( k.is_a?(String) &&v.is_a?(String) ) }; vars.sort.each do |k,v| %><%= k %>=<%= v %>n<% end %>"), } This uses an inline template, which may make it slightly hard to read. The template loops through the output of the scope function and prints the values if the value is a string. Focusing only on the inner loop, this can be shown as follows: vars = scope.to_hash.reject { |k,v| !( k.is_a?(String) && v.is_a?(String) ) }; vars.sort.each do |k,v| k=vn end Summary In this article, we examined metaparameters and how to deal with resource ordering issues. We built custom facts and defines and discussed the issues that may arise when using them. We then moved on to templates and showed how to use templates as an aid in debugging. Resources for Article: Further resources on this subject: My First Puppet Module[article] Puppet Language and Style[article] Puppet and OS Security Tools [article]
Read more
  • 0
  • 0
  • 3559
article-image-working-powershell
Packt
07 Sep 2015
17 min read
Save for later

Working with PowerShell

Packt
07 Sep 2015
17 min read
In this article, you will cover: Retrieving system information – Configuration Service cmdlets Administering hosts and machines – Host and MachineCreation cmdlets Managing additional components – StoreFront Admin and Logging cmdlets (For more resources related to this topic, see here.) Introduction With hundreds or thousands of hosts to configure and machines to deploy, configuring all the components manually could be difficult. As for the previous XenDesktop releases, and also with the XenDesktop 7.6 version, you can find an integrated set of PowerShell modules. With its use, IT technicians are able to reduce the time required to perform management tasks by the creation of PowerShell scripts, which will be used to deploy, manage, and troubleshoot at scale the greatest part of the XenDesktop components. Working with PowerShell instead of the XenDesktop GUI will give you more flexibility in terms of what kind of operations to execute, having a set of additional features to use during the infrastructure creation and configuration phases. Retrieving system information – Configuration Service cmdlets In this recipe, we will use and explain a general-purpose PowerShell cmdlet: the Configuration Service category. This is used to retrieve general configuration parameters, and to obtain information about the implementation of the XenDesktop Configuration Service. Getting ready No preliminary tasks are required. You have already installed the Citrix XenDesktop PowerShell SDK during the installation of the Desktop Controller role machine(s). To be able to run a PowerShell script (.ps1 format), you have to enable the script execution from the PowerShell prompt in the following way, using its application: Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Force How to do it… In this section, we will explain and execute the commands associated with the XenDesktop System and Services configuration area: Connect to one of the Desktop Broker servers, by using a remote Desktop connection, for instance. Right-click on the PowerShell icon installed on the Windows taskbar and select the Run as Administrator option. Load the Citrix PowerShell modules by typing the following command and then press the Enter key: Asnp Citrix* As an alternative to the Asnp command, you can use the Add-PSSnapin command. Retrieve the active and configured Desktop Controller features by running the following command: Get-ConfigEnabledFeature To retrieve the current status of the Config Service, run the following command. The output result will be OK in the absence of configuration issues: Get-ConfigServiceStatus To get the connection string used by the Configuration Service and to connect to the XenDesktop database, run the following command: Get-ConfigDBConnection Starting from the previously received output, it's possible to configure the connection string to let the Configuration Service use the system DB. For this command, you have to specify the Server, Initial Catalog, and Integrated Security parameters: Set-ConfigDBConnection –DBConnection"Server=<ServernameInstanceName>; Initial Catalog=<DatabaseName>; Integrated Security=<True | False>" Starting from an existing Citrix database, you can generate a SQL procedure file to use as a backup to recreate the database. Run the following command to complete this task, specifying the DatabaseName and ServiceGroupName parameters: Get-ConfigDBSchema -DatabaseName<DatabaseName> -ServiceGroupName<ServiceGroupName>> Path:FileName.sql You need to configure a destination database with the same name as that of the source DB, otherwise the script will fail! To retrieve information about the active Configuration Service objects (Instance, Service, and Service Group), run the following three commands respectively: Get-ConfigRegisteredServiceInstance Get-ConfigService Get-ConfigServiceGroup To test a set of operations to check the status of the Configuration Service, run the following script: #------------ Script - Configuration Service #------------ Define Variables $Server_Conn="SqlDatabaseServer.xdseven.localCITRIX,1434" $Catalog_Conn="CitrixXD7-Site-First" #------------ write-Host"XenDesktop - Configuration Service CmdLets" #---------- Clear the existing Configuration Service DB connection $Clear = Set-ConfigDBConnection -DBConnection $null Write-Host "Clearing any previous DB connection - Status: " $Clear #---------- Set the Configuration Service DB connection string $New_Conn = Set-ConfigDBConnection -DBConnection"Server=$Server_Conn; Initial Catalog=$Catalog_Conn; Integrated Security=$true" Write-Host "Configuring the DB string connection - Status: " $New_Conn $Configured_String = Get-configDBConnection Write-Host "The new configured DB connection string is: " $Configured_String You have to save this script with the .ps1 extension, in order to invoke it with PowerShell. Be sure to change the specific parameters related to your infrastructure, in order to be able to run the script. This is shown in the following screenshot: How it works... The Configuration Service cmdlets of XenDesktop PowerShell permit the managing of the Configuration Service and its related information: the Metadata for the entire XenDesktop infrastructure, the Service instances registered within the VDI architecture, and the collections of these services, called Service Groups. This set of commands offers the ability to retrieve and check the DB connection string to contact the configured XenDesktop SQL Server database. These operations are permitted by the Get-ConfigDBConnection command (to retrieve the current configuration) and the Set-ConfigDBConnection command (to configure the DB connection string); both the commands use the DB Server Name with the Instance name, DB name, and Integrated Security as information fields. In the attached script, we have regenerated a database connection string. To be sure to be able to recreate it, first of all we have cleared any existing connection, setting it to null (verify the command associated with the $Clear variable), then we have defined the $New_Conn variable, using the Set-ConfigDBConnection command; all the parameters are defined at the top of the script, in the form of variables. Use the Write-Host command to echo results on the standard output. There's more... In some cases, you may need to retrieve the state of the registered services, in order to verify their availability. You can use the Test-ConfigServiceInstanceAvailability cmdlet, retrieving whether the service is responding or not and its response time. Run the following example to test the use of this command: Get-ConfigRegisteredServiceInstance | Test-ConfigServiceInstanceAvailability | more Use the –ForceWaitForOneOfEachType parameter to stop the check for a service category, when one of its services responds. Administering hosts and machines – Host and MachineCreation cmdlets In this recipe, we will describe how to create the connection between the Hypervisor and the XenDesktop servers, and the way to generate machines to assign to the end users, all by using Citrix PowerShell. Getting ready No preliminary tasks are required. You have already installed the Citrix XenDesktop PowerShell SDK during the installation of the Desktop Controller role machine(s). To be sure to be able to run a PowerShell script (the.ps1 format), you have to enable the script execution from the PowerShell prompt in this way: Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Force How to do it… In this section, we will discuss the PowerShell commands used to connect XenDesktop with the supported hypervisors plus the creation of the machines from the command line: Connect to one of the Desktop Broker servers. Click on the PowerShell icon installed on the Windows taskbar. Load the Citrix PowerShell modules by typing the following command, and then press the Enter key: Asnp Citrix* To list the available Hypervisor types, execute this task: Get-HypHypervisorPlugin –AdminAddress<BrokerAddress> To list the configured properties for the XenDesktop root-level location (XDHyp:), execute the following command: Get-ChildItemXDHyp:HostingUnits Please refer to the PSPath, Storage, and PersonalvDiskStorage output fields to retrieve information on the storage configuration. Execute the following cmdlet to add a storage resource to the XenDesktop Controller host: Add-HypHostingUnitStorage –LiteralPath<HostPathLocation> -StoragePath<StoragePath> -StorageType<OSStorage|PersonalvDiskStorage> - AdminAddress<BrokerAddress> To generate a snapshot for an existing VM, perform the following task: New-HypVMSnapshot –LiteralPath<HostPathLocation> -SnapshotDescription<Description> Use the Get-HypVMMacAddress -LiteralPath<HostPathLocation> command to list the MAC address of specified desktop VMs. To provision machine instances starting from the Desktop base image template, run the following command: New-ProvScheme –ProvisioningSchemeName<SchemeName> -HostingUnitName<HypervisorServer> -IdentityPoolName<PoolName> -MasterImageVM<BaseImageTemplatePath> -VMMemoryMB<MemoryAssigned> -VMCpuCount<NumberofCPU> To specify the creation of instances with the Personal vDisk technology, use the following option: -UsePersonalVDiskStorage. After the creation process, retrieve the provisioning scheme information by running the following command: Get-ProvScheme –ProvisioningSchemeName<SchemeName> To modify the resources assigned to desktop instances in a provisioning scheme, use the Set-ProvScheme cmdlet. The permitted parameters are –ProvisioningSchemeName, -VMCpuCount, and –VMMemoryMB. To update the desktop instances to the latest version of the Desktop base image template, run the following cmdlet: Publish-ProvMasterVmImage –ProvisioningSchemeName<SchemeName> -MasterImageVM<BaseImageTemplatePath> If you do not want to maintain the pre-update instance version to use as a restore checkpoint, use the –DoNotStoreOldImage option. To create machine instances, based on the previously configured provisioning scheme for an MCS architecture, run this command: New-ProvVM –ProvisioningSchemeName<SchemeName> -ADAccountName"DomainMachineAccount" Use the -FastBuild option to make the machine creation process faster. On the other hand, you cannot start up the machines until the process has been completed. Retrieve the configured desktop instances by using the next cmdlet: Get-ProvVM –ProvisioningSchemeName<SchemeName> -VMName<MachineName> To remove an existing virtual desktop, use the following command: Remove-ProvVM –ProvisioningSchemeName<SchemeName> -VMName<MachineName> -AdminAddress<BrokerAddress> The next script will combine the use of part of the commands listed in this recipe: #------------ Script - Hosting + MCS #----------------------------------- #------------ Define Variables $LitPath = "XDHyp:HostingUnitsVMware01" $StorPath = "XDHyp:HostingUnitsVMware01datastore1.storage" $Controller_Address="192.168.110.30" $HostUnitName = "Vmware01" $IDPool = $(Get-AcctIdentityPool -IdentityPoolName VDI-DESKTOP) $BaseVMPath = "XDHyp:HostingUnitsVMware01VMXD7-W8MCS-01.vm" #------------ Creating a storage location Add-HypHostingUnitStorage –LiteralPath $LitPath -StoragePath $StorPath -StorageTypeOSStorage -AdminAddress $Controller_Address #---------- Creating a Provisioning Scheme New-ProvScheme –ProvisioningSchemeName Deploy_01 -HostingUnitName $HostUnitName -IdentityPoolName $IDPool.IdentityPoolName -MasterImageVM $BaseVMPathT0-Post.snapshot -VMMemoryMB 4096 -VMCpuCount 2 -CleanOnBoot #---------- List the VM configured on the Hypervisor Host dir $LitPath*.vm exit How it works... The Host and MachineCreation cmdlet groups manage the interfacing with the Hypervisor hosts, in terms of machines and storage resources. This allows you to create the desktop instances to assign to the end user, starting from an existing and mapped Desktop virtual machine. The Get-HypHypervisorPlugin command retrieves and lists the available hypervisors to use to deploy virtual desktops and to configure the storage types. You need to configure an operating system storage area or a Personal vDisk storage zone. The way to map an existing storage location from the Hypervisor to the XenDesktop controller is by running the Add-HypHostingUnitStorage cmdlet. In this case you have to specify the destination path on which the storage object will be created (LiteralPath), the source storage path on the Hypervisor machine(s) (StoragePath), and the StorageType previously discussed. The storage types are in the form of XDHyp:HostingUnits<UnitName>. To list all the configured storage objects, execute the following command: dirXDHyp:HostingUnits<UnitName> *.storage After configuring the storage area, we have discussed the Machine Creation Service (MCS) architecture. In this cmdlets collection, we have the availability of commands to generate VM snapshots from which we can deploy desktop instances (New-HypVMSnapshot), and specify a name and a description for the generated disk snapshot. Starting from the available disk image, the New-ProvScheme command permits you to create a resource provisioning scheme, on which to specify the desktop base image, and the resources to assign to the desktop instances (in terms of CPU and RAM -VMCpuCount and –VMMemoryMB), and if generating these virtual desktops in a non-persistent mode (-CleanOnBoot option), with or without the use of the Personal vDisk technology (-UsePersonalVDiskStorage). It's possible to update the deployed instances to the latest base image update through the use of the Publish-ProvMasterVmImage command. In the generated script, we have located all the main storage locations (the LitPath and StorPath variables) useful to realize a provisioning scheme, then we have implemented a provisioning procedure for a desktop based on an existing base image snapshot, with two vCPUs and 4GB of RAM for the delivered instances, which will be cleaned every time they stop and start (by using the -CleanOnBoot option). You can navigate the local and remote storage paths configured with the XenDesktop Broker machine; to list an object category (such as VM or Snapshot) you can execute this command: dirXDHyp:HostingUnits<UnitName>*.<category> There's more... The discussed cmdlets also offer you the technique to preserve a virtual desktop from an accidental deletion or unauthorized use. With the Machine Creation cmdlets group, you have the ability to use a particular command, which allows you to lock critical desktops: Lock-ProvVM. This cmdlet requires as parameters the name of the scheme to which they refer (-ProvisioningSchemeName) and the ID of the virtual desktop to lock (-VMID). You can retrieve the Virtual Machine ID by running the Get-ProvVM command discussed previously. To revert the machine lock, and free the desktop instance from accidental deletion or improper use, you have to execute the Unlock-ProvVM cmdlet, using the same parameter showed for the lock procedure. Managing additional components – StoreFrontÔ admin and logging cmdlets In this recipe, we will use and explain how to manage and configure the StoreFront component, by using the available Citrix PowerShell cmdlets. Moreover, we will explain how to manage and check the configurations for the system logging activities. Getting ready No preliminary tasks are required. You have already installed the Citrix XenDesktop PowerShell SDK during the installation of the Desktop Controller role machine(s). To be able to run a PowerShell script (in the.ps1 format), you have to enable the script execution from the PowerShell prompt in this way: Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Force How to do it… In this section, we will explain and execute the commands associated with the Citrix Storefront system: Connect to one of the Desktop Broker servers. Click on the PowerShell icon installed on the Windows taskbar. Load the Citrix PowerShell modules by typing the following command, and then press the Enter key: Asnp Citrix* To execute a command, you have to press the Enter button after completing the right command syntax. Retrieve the currently existing StoreFront service instances, by running the following command: Get-SfService To limit the number of rows as output result, you can add the –MaxRecordCount<value> parameter. To list the detailed information about the StoreFront service(s) currently configured, execute the following command: Get-SfServiceInstance –AdminAddress<ControllerAddress> The status of the currently active StoreFront instances can be retrieved by using the Get-SfServiceStatus command. The OK output will confirm the correct service execution. To list the task history associated with the configured StoreFront instances, you have to run the following command: Get-SfTask You can filter the desired information for the ID of the researched task (-taskid) and sort the results by the use of the –sortby parameter. To retrieve the installed database schema versions, you can execute the following command: Get-SfInstalledDBVersion By applying the –Upgrade and –Downgrade filters, you will receive respectively the schemas for which the database version can be updated or reverted to a previous compatible one. To modify the StoreFront configurations to register its state on a different database, you can use the following command: Set-SfDBConnection-DBConnection<DBConnectionString> -AdminAddress<ControllerAddress> Be careful when you specify the database connection string; if not specified, the existing database connections and configurations will be cleared! To check that the database connection has been correctly configured, the following command is available: Test-SfDBConnection-DBConnection<DBConnectionString>-AdminAddress<ControllerAddress> The second discussed cmdlets allows the logging group to retrieve information about the current status of the logging service and run the following command: Get-LogServiceStatus To verify the language used and whether the logging service has been enabled, run the following command: Get-LogSite The available configurable locales are en, ja, zh-CN, de, es, and fr. The available states are Enabled, Disabled, NotSupported, and Mandatory. The NotSupported state will show you an incorrect configuration for the listed parameters. To retrieve detailed information about the running logging service, you have to use the following command: Get-LogService As discussed earlier for the StoreFront commands, you can filter the output by applying the –MaxRecordCount<value> parameter. In order to get all the operations logged within a specified time range, run the following command; this will return the global operations count: Get-LogSummary –StartDateRange<StartDate>-EndDateRange<EndDate> The date format must be the following: AAAA-MM-GGHH:MM:SS. To list the collected operations per day in the specified time period, run the previous command in the following way: Get-LogSummary –StartDateRange<StartDate> -EndDateRange<EndDate>-intervalSeconds 86400 The value 86400 is the number of seconds that are present in a day. To retrieve the connection string information about the database on which logging data is stored, execute the following command: Get-LogDataStore To retrieve detailed information about the high level operations performed on the XenDesktop infrastructure, you have to run the following command: Get-LogHighLevelOperation –Text <TextincludedintheOperation> -StartTime<FormattedDateandTime> -EndTime<FormattedDateandTime>-IsSuccessful<true | false>-User <DomainUserName>-OperationType<AdminActivity | ConfigurationChange> The indicated filters are not mandatory. If you do not apply any filters, all the logged operations will be returned. This could be a very long output. The same information can be retrieved for the low level system operations in the following way: Get-LogLowLevelOperation-StartTime<FormattedDateandTime> -EndTime<FormattedDateandTime> -IsSuccessful<true | false>-User <DomainUserName> -OperationType<AdminActivity | ConfigurationChange> In the How it works section we will explain the difference between the high and low level operations. To log when a high level operation starts and stops respectively, use the following two commands: Start-LogHighLevelOperation –Text <OperationDescriptionText>- Source <OperationSource> -StartTime<FormattedDateandTime> -OperationType<AdminActivity | ConfigurationChange> Stop-LogHighLevelOperation –HighLevelOperationId<OperationID> -IsSuccessful<true | false> The Stop-LogHighLevelOperation must be related to an existing start high level operation, because they are related tasks. How it works... Here, we have discussed two new PowerShell command collections for the XenDesktop 7 versions: the cmdlet related to the StoreFront platform; and the activities Logging set of commands. The first collection is quite limited in terms of operations, despite the other discussed cmdlets. In fact, the only actions permitted with the StoreFront PowerShell set of commands are retrieving configurations and settings about the configured stores and the linked database. More activities can be performed regarding the modification of existing StoreFront clusters, by using the Get-SfCluster, Add-SfServerToCluster, New-SfCluster, and Set-SfCluster set of operations. More interesting is the PowerShell Logging collection. In this case, you can retrieve all the system-logged data, putting it into two principal categories: High-level operations: These tasks group all the system configuration changes that are executed by using the Desktop Studio, the Desktop Director, or Citrix PowerShell. Low-level operations: This category is related to all the system configuration changes that are executed by a service and not by using the system software's consoles. With the low level operations command, you can filter for a specific high level operation to which the low level refers, by specifying the -HighLevelOperationId parameter. This cmdlet category also gives you the ability to track the start and stop of a high level operation, by the use of Start-LogHighLevelOperation and Stop-LogHighLevelOperation. In this second case, you have to specify the previously started high level operation. There's more... In case of too much information in the log store, you have the ability to clear all of it. To refresh all the log entries, we use the following command: Remove-LogOperation -UserName<DBAdministrativeCredentials> -Password <DBUserPassword>-StartDateRange <StartDate> -EndDateRange <EndDate> The not encrypted –Password parameter can be substituted by –SecurePassword, the password indicated in secure string form. The credentials must be database administrative credentials, with deleting permissions on the destination database. This is a not reversible operation, so ensure that you want to delete the logs in the specified time range, or verify that you have some form of data backup. Resources for Article: Further resources on this subject: Working with Virtual Machines [article] Virtualization [article] Upgrading VMware Virtual Infrastructure Setups [article]
Read more
  • 0
  • 0
  • 20202

article-image-storage-configurations
Packt
07 Sep 2015
21 min read
Save for later

Storage Configurations

Packt
07 Sep 2015
21 min read
In this article by Wasim Ahmed, author of the book Proxmox Cookbook, we will cover topics such as local storage, shared storage, Ceph storage, and a recipe which shows you how to configure the Ceph RBD storage. (For more resources related to this topic, see here.) A storage is where virtual disk images of virtual machines reside. There are many different types of storage systems with many different features, performances, and use case scenarios. Whether it is a local storage configured with direct attached disks or a shared storage with hundreds of disks, the main responsibility of a storage is to hold virtual disk images, templates, backups, and so on. Proxmox supports different types of storages, such as NFS, Ceph, GlusterFS, and ZFS. Different storage types can hold different types of data. For example, a local storage can hold any type of data, such as disk images, ISO/container templates, backup files and so on. A Ceph storage, on the other hand, can only hold a .raw format disk image. In order to provide the right type of storage for the right scenario, it is vital to have a proper understanding of different types of storages. The full details of each storage is beyond the scope of this article, but we will look at how to connect them to Proxmox and maintain a storage system for VMs. Storages can be configured into two main categories: Local storage Shared storage Local storage Any storage that resides in the node itself by using directly attached disks is known as a local storage. This type of storage has no redundancy other than a RAID controller that manages an array. If the node itself fails, the storage becomes completely inaccessible. The live migration of a VM is impossible when VMs are stored on a local storage because during migration, the virtual disk of the VM has to be copied entirely to another node. A VM can only be live-migrated when there are several Proxmox nodes in a cluster and the virtual disk is stored on a shared storage accessed by all the nodes in the cluster. Shared storage A shared storage is one that is available to all the nodes in a cluster through some form of network media. In a virtual environment with shared storage, the actual virtual disk of the VM may be stored on a shared storage, while the VM actually runs on another Proxmox host node. With shared storage, the live migration of a VM becomes possible without powering down the VM. Multiple Proxmox nodes can share one shared storage, and VMs can be moved around since the virtual disk is stored on different shared storages. Usually, a few dedicated nodes are used to configure a shared storage with their own resources apart from sharing the resources of a Proxmox node, which could be used to host VMs. In recent releases, Proxmox has added some new storage plugins that allow users to take advantage of some great storage systems and integrating them with the Proxmox environment. Most of the storage configurations can be performed through the Proxmox GUI. Ceph storage Ceph is a powerful distributed storage system, which provides RADOS Block Device (RBD) object storage, Ceph filesystem (CephFS), and Ceph Object Storage. Ceph is built with a very high-level of reliability, scalability, and performance in mind. A Ceph cluster can be expanded to several petabytes without compromising data integrity, and can be configured using commodity hardware. Any data written to the storage gets replicated across a Ceph cluster. Ceph was originally designed with big data in mind. Unlike other types of storages, the bigger a Ceph cluster becomes, the higher the performance. However, it can also be used in small environments just as easily for data redundancy. A lower performance can be mitigated using SSD to store Ceph journals. Refer to the OSD Journal subsection in this section for information on journals. The built-in self-healing features of Ceph provide unprecedented resilience without a single point of failure. In a multinode Ceph cluster, the storage can tolerate not just hard drive failure, but also an entire node failure without losing data. Currently, only an RBD block device is supported in Proxmox. Ceph comprises a few components that are crucial for you to understand in order to configure and operate the storage. The following components are what Ceph is made of: Monitor daemon (MON) Object Storage Daemon (OSD) OSD Journal Metadata Server (MSD) Controlled Replication Under Scalable Hashing map (CRUSH map) Placement Group (PG) Pool MON Monitor daemons form quorums for a Ceph distributed cluster. There must be a minimum of three monitor daemons configured on separate nodes for each cluster. Monitor daemons can also be configured as virtual machines instead of using physical nodes. Monitors require a very small amount of resources to function, so allocated resources can be very small. A monitor can be set up through the Proxmox GUI after the initial cluster creation. OSD Object Storage Daemons (OSDs) are responsible for the storage and retrieval of actual cluster data. Usually, each physical storage device, such as HDD or SSD, is configured as a single OSD. Although several OSDs can be configured on a single physical disc, it is not recommended for any production environment at all. Each OSD requires a journal device where data first gets written and later gets transferred to an actual OSD. By storing journals on fast-performing SSDs, we can increase the Ceph I/O performance significantly. Thanks to the Ceph architecture, as more and more OSDs are added into the cluster, the I/O performance also increases. An SSD journal works very well on small clusters with about eight OSDs per node. OSDs can be set up through the Proxmox GUI after the initial MON creation. OSD Journal Every single piece of data that is destined to be a Ceph OSD first gets written in a journal. A journal allows OSD daemons to write smaller chunks to allow the actual drives to commit writes that give more time. In simpler terms, all data gets written to journals first, then the journal filesystem sends data to an actual drive for permanent writes. So, if the journal is kept on a fast-performing drive, such as SSD, incoming data will be written at a much higher speed, while behind the scenes, slower performing SATA drives can commit the writes at a slower speed. Journals on SSD can really improve the performance of a Ceph cluster, especially if the cluster is small, with only a few terabytes of data. It should also be noted that if there is a journal failure, it will take down all the OSDs that the journal is kept on the journal drive. In some environments, it may be necessary to put two SSDs to mirror RAIDs and use them as journaling. In a large environment with more than 12 OSDs per node, performance can actually be gained by collocating a journal on the same OSD drive instead of using SSD for a journal. MDS The Metadata Server (MDS) daemon is responsible for providing the Ceph filesystem (CephFS) in a Ceph distributed storage system. MDS can be configured on separate nodes or coexist with already configured monitor nodes or virtual machines. Although CephFS has come a long way, it is still not fully recommended to use in a production environment. It is worth mentioning here that there are many virtual environments actively running MDS and CephFS without any issues. Currently, it is not recommended to configure more than two MDSs in a Ceph cluster. CephFS is not currently supported by a Proxmox storage plugin. However, it can be configured as a local mount and then connected to a Proxmox cluster through the Directory storage. MDS cannot be set up through the Proxmox GUI as of version 3.4. CRUSH map A CRUSH map is the heart of the Ceph distributed storage. The algorithm for storing and retrieving user data in Ceph clusters is laid out in the CRUSH map. CRUSH allows a Ceph client to directly access an OSD. This eliminates a single point of failure and any physical limitations of scalability since there are no centralized servers or controllers to manage data in and out. Throughout Ceph clusters, CRUSH maintains a map of all MONs and OSDs. CRUSH determines how data should be chunked and replicated among OSDs spread across several local nodes or even nodes located remotely. A default CRUSH map is created on a freshly installed Ceph cluster. This can be further customized based on user requirements. For smaller Ceph clusters, this map should work just fine. However, when Ceph is deployed with very big data in mind, this map should be customized. A customized map will allow better control of a massive Ceph cluster. To operate Ceph clusters of any size successfully, a clear understanding of the CRUSH map is mandatory. For more details on the Ceph CRUSH map, visit http://ceph.com/docs/master/rados/operations/crush-map/ and http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map. As of Proxmox VE 3.4, we cannot customize the CRUSH map throughout the Proxmox GUI. It can only be viewed through a GUI and edited through a CLI. PG In a Ceph storage, data objects are aggregated in groups determined by CRUSH algorithms. This is known as a Placement Group (PG) since CRUSH places this group in various OSDs depending on the replication level set in the CRUSH map and the number of OSDs and nodes. By tracking a group of objects instead of the object itself, a massive amount of hardware resources can be saved. It would be impossible to track millions of individual objects in a cluster. The following diagram shows how objects are aggregated in groups and how PG relates to OSD: To balance available hardware resources, it is necessary to assign the right number of PGs. The number of PGs should vary depending on the number of OSDs in a cluster. The following is a table of PG suggestions made by Ceph developers: Number of OSDs Number of PGs Less than 5 OSDs 128 Between 5-10 OSDs 512 Between 10-50 OSDs 4096 Selecting the proper number of PGs is crucial since each PG will consume node resources. Too many PGs for the wrong number of OSDs will actually penalize the resource usage of an OSD node, while very few assigned PGs in a large cluster will put data at risk. A rule of thumb is to start with the lowest number of PGs possible, then increase them as the number of OSDs increases. For details on Placement Groups, visit http://ceph.com/docs/master/rados/operations/placement-groups/. There's a great PG calculator created by Ceph developers to calculate the recommended number of PGs for various sizes of Ceph clusters at http://ceph.com/pgcalc/. Pools Pools in Ceph are like partitions on a hard drive. We can create multiple pools on a Ceph cluster to separate stored data. For example, a pool named accounting can hold all the accounting department data, while another pool can store the human resources data of a company. When creating a pool, assigning the number of PGs is necessary. During the initial Ceph configuration, three default pools are created. They are data, metadata, and rbd. Deleting a pool will delete all stored objects permanently. For details on Ceph and its components, visit http://ceph.com/docs/master/. The following diagram shows a basic Proxmox+Ceph cluster: The preceding diagram shows four Proxmox nodes, three Monitor nodes, three OSD nodes, and two MDS nodes comprising a Proxmox+Ceph cluster. Note that Ceph is on a different network than the Proxmox public network. Depending on the set replication number, each incoming data object needs to be written more than once. This causes high bandwidth usage. By separating Ceph on a dedicated network, we can ensure that a Ceph network can fully utilize the bandwidth. On advanced clusters, a third network is created only between Ceph nodes for cluster replication, thus improving network performance even further. As of Proxmox VE 3.4, the same node can be used for both Proxmox and Ceph. This provides a great way to manage all the nodes from the same Proxmox GUI. It is not advisable to put Proxmox VMs on a node that is also configured as Ceph. During day-to-day operations, Ceph nodes do not consume large amounts of resources, such as CPU or memory. However, when Ceph goes into rebalancing mode due to OSD or node failure, a large amount of data replication occurs, which takes up lots of resources. Performance will degrade significantly if resources are shared by both VMs and Ceph. Ceph RBD storage can only store .raw virtual disk image files. Ceph itself does not come with a GUI to manage, so having the option to manage Ceph nodes through the Proxmox GUI makes administrative tasks mush easier. Refer to the Monitoring the Ceph storage subsection under the How to do it... section of the Connecting the Ceph RBD storage recipe later in this article to learn how to install a great read-only GUI to monitor Ceph clusters. Connecting the Ceph RBD storage In this recipe, we are going to see how to configure a Ceph block storage with a Proxmox cluster. Getting ready The initial Ceph configuration on a Proxmox cluster must be accomplished through a CLI. After the Ceph installation, initial configurations and one monitor creation for all other tasks can be accomplished through the Proxmox GUI. How to do it... We will now see how to configure the Ceph block storage with Proxmox. Installing Ceph on Proxmox Ceph is not installed by default. Prior to configuring a Proxmox node for the Ceph role, Ceph needs to be installed and the initial configuration must be created through a CLI. The following steps need to be performed on all Proxmox nodes that will be part of the Ceph cluster: Log in to each node through SSH or a console. Configure a second network interface to create a separate Ceph network with a different subnet. Reboot the nodes to initialize the network configuration. Using the following command, install the Ceph package on each node: # pveceph install –version giant Initializing the Ceph configuration Before Ceph is usable, we have to create the initial Ceph configuration file on one Proxmox+Ceph node. The following steps need to be performed only on one Proxmox node that will be part of the Ceph cluster: Log in to the node using SSH or a console. Run the following command create the initial Ceph configuration: # pveceph init –network <ceph_subnet>/CIDR Run the following command to create the first monitor: # pveceph createmon Configuring Ceph through the Proxmox GUI After the initial Ceph configuration and the creation of the first monitor, we can continue with further Ceph configurations through the Proxmox GUI or simply run the Ceph Monitor creation command on other nodes. The following steps show how to create Ceph Monitors and OSDs from the Proxmox GUI: Log in to the Proxmox GUI as a root or with any other administrative privilege. Select a node where the initial monitor was created in previous steps, and then click on Ceph from the tabbed menu. The following screenshot shows a Ceph cluster as it appears after the initial Ceph configuration: Since no OSDs have been created yet, it is normal for a new Ceph cluster to show PGs stuck and unclean error Click on Disks on the bottom tabbed menu under Ceph to display the disks attached to the node, as shown in the following screenshot: Select an available attached disk, then click on the Create: OSD button to open the OSD dialog box, as shown in the following screenshot: Click on the Journal Disk drop-down menu to select a different device or collocate the journal on the same OSD by keeping it as the default. Click on Create to finish the OSD creation. Create additional OSDs on Ceph nodes as needed. The following screenshot shows a Proxmox node with three OSDs configured: By default, Proxmox has created OSDs with an ext3 partition. However, sometimes, it may be necessary to create OSDs with different partition types due to a requirement or for performance improvement. Enter the following command format through the CLI to create an OSD with a different partition type: # pveceph createosd –fstype ext4 /dev/sdX The following steps show how to create Monitors through the Proxmox GUI: Click on Monitor from the tabbed menu under the Ceph feature. The following screenshot shows the Monitor status with the initial Ceph Monitor we created earlier in this recipe: Click on Create to open the Monitor dialog box. Select a Proxmox node from the drop-down menu. Click on the Create button to start the monitor creation process. Create a total of three Ceph monitors to establish a Ceph quorum. The following screenshot shows the Ceph status with three monitors and OSDs added: Note that even with three OSDs added, the PGs are still stuck with errors. This is because by default, the Ceph CRUSH is set up for two replicas. So far, we've only created OSDs on one node. For a successful replication, we need to add some OSDs on the second node so that data objects can be replicated twice. Follow the steps described earlier to create three additional OSDs on the second node. After creating three more OSDs, the Ceph status should look like the following screenshot: Managing Ceph pools It is possible to perform basic tasks, such as creating and removing Ceph pools through the Proxmox GUI. Besides these, we can see check the list, status, number of PGs, and usage of the Ceph pools. The following steps show how to check, create, and remove Ceph pools through the Proxmox GUI: Click on the Pools tabbed menu under Ceph in the Proxmox GUI. The following screenshot shows the status of the default rbd pool, which has replica 1, 256 PG, and 0% usage: Click on Create to open the pool creation dialog box. Fill in the required information, such as the name of the pool, replica size, and number of PGs. Unless the CRUSH map has been fully customized, the ruleset should be left at the default value 0. Click on OK to create the pool. To remove a pool, select the pool and click on Remove. Remember that once a Ceph pool is removed, all the data stored in this pool is deleted permanently. To increase the number of PGs, run the following command through the CLI: #ceph osd pool set <pool_name> pg_num <value> #ceph osd pool set <pool_name> pgp_num <value> It is only possible to increase the PG value. Once increased, the PG value can never be decreased. Connecting RBD to Proxmox Once a Ceph cluster is fully configured, we can proceed to attach it to the Proxmox cluster. During the initial configuration file creation, Ceph also creates an authentication keyring in the /etc/ceph/ceph.client.admin.keyring directory path. This keyring needs to be copied and renamed to match the name of the storage ID to be created in Proxmox. Run the following commands to create a directory and copy the keyring: # mkdir /etc/pve/priv/ceph # cd /etc/ceph/ # cp ceph.client.admin.keyring /etc/pve/priv/ceph/<storage>.keyring For our storage, we are naming it rbd.keyring. After the keyring is copied, we can attach the Ceph RBD storage with Proxmox using the GUI: Click on Datacenter, then click on Storage from the tabbed menu. Click on the Add drop-down menu and select the RBD storage plugin. Enter the information as described in the following table: Item Type of value Entered value ID The name of the storage. rbd Pool The name of the Ceph pool. rbd Monitor Host The IP address and port number of the Ceph MONs. We can enter multiple MON hosts for redundancy. 172.16.0.71:6789;172.16.0.72:6789; 172.16.0.73:6789 User name The default Ceph administrator. Admin Nodes The Proxmox nodes that will be able to use the storage. All Enable The checkbox for enabling/disabling the storage. Enabled Click on Add to attach the RBD storage. The following screenshot shows the RBD storage under Summary: Monitoring the Ceph storage Ceph itself does not come with any GUI to manage or monitor the cluster. We can view the cluster status and perform various Ceph-related tasks through the Proxmox GUI. There are several third-party software that allow Ceph-only GUI to manage and monitor the cluster. Some software provide management features, while others provide read-only features for Ceph monitoring. Ceph Dash is such a software that provides an appealing read-only GUI to monitor the entire Ceph cluster without logging on to the Proxmox GUI. Ceph Dash is freely available through GitHub. There are other heavyweight Ceph GUI dashboards, such as Kraken, Calamari, and others. In this section, we are only going to see how to set up the Ceph Dash cluster monitoring GUI. The following steps can be used to download and start Ceph Dash to monitor a Ceph cluster using any browser: Log in to any Proxmox node, which is also a Ceph MON. Run the following commands to download and start the dashboard: # mkdir /home/tools # apt-get install git # git clone https://github.com/Crapworks/ceph-dash # cd /home/tools/ceph-dash # ./ceph_dash.py Ceph Dash will now start listening on port 5000 of the node. If the node is behind a firewall, open port 5000 or any other ports with port forwarding in the firewall. Open any browser and enter <node_ip>:5000 to open the dashboard. The following screenshot shows the dashboard of the Ceph cluster we have created: We can also monitor the status of the Ceph cluster through a CLI using the following commands: To check the Ceph status: # ceph –s To view OSDs in different nodes: # ceph osd tree To display real-time Ceph logs: # ceph –w To display a list of Ceph pools: # rados lspools To change the number of replicas of a pool: # ceph osd pool set size <value> Besides the preceding commands, there are many more CLI commands to manage Ceph and perform advanced tasks. The Ceph official documentation has a wealth of information and how-to guides along with the CLI commands to perform them. The documentation can be found at http://ceph.com/docs/master/. How it works… At this point, we have successfully integrated a Ceph cluster with a Proxmox cluster, which comprises six OSDs, three MONs, and three nodes. By viewing the Ceph Status page, we can get lot of information about a Ceph cluster at a quick glance. From the previous figure, we can see that there are 256 PGs in the cluster and the total cluster storage space is 1.47 TB. A healthy cluster will have the PG status as active+clean. Based on the nature of issue, the PGs can have various states, such as active+unclean, inactive+degraded, active+stale, and so on. To learn details about all the states, visit http://ceph.com/docs/master/rados/operations/pg-states/. By configuring a second network interface, we can separate a Ceph network from the main network. The #pveceph init command creates a Ceph configuration file in the /etc/pve/ceph.conf directory path. A newly configured Ceph configuration file looks similar to the following screenshot: Since the ceph.conf configuration file is stored in pmxcfs, any changes made to it are immediately replicated in all the Proxmox nodes in the cluster. As of Proxmox VE 3.4, Ceph RBD can only store a .raw image format. No templates, containers, or backup files can be stored on the RBD block storage. Here is the content of a storage configuration file after adding the Ceph RBD storage: rbd: rbd monhost 172.16.0.71:6789;172.16.0.72:6789;172.16.0.73:6789 pool rbd content images username admin If a situation dictates the IP address change of any node, we can simply edit this content in the configuration file to manually change the IP address of the Ceph MON nodes. See also To learn about Ceph in greater detail, visit http://ceph.com/docs/master/ for the official Ceph documentation Also, visit https://indico.cern.ch/event/214784/session/6/contribution/68/material/slides/0.pdf to find out why Ceph is being used at CERN to store the massive data generated by the Large Hadron Collider (LHC) Summary In this article, we came across with different configurations for a variety of storage categories and got hands-on practice with various stages in configuring the Ceph RBD storage. Resources for Article: Further resources on this subject: Deploying New Hosts with vCenter [article] Let's Get Started with Active Di-rectory [article] Basic Concepts of Proxmox Virtual Environment [article]
Read more
  • 0
  • 0
  • 9102
Modal Close icon
Modal Close icon