Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-hadoop-monitoring-and-its-aspects

04 May 2015

8 min read

Hadoop Monitoring and its aspects

04 May 2015

In this article by Gurmukh Singh, the author of the book Monitoring Hadoop, tells us the importance of monitoring Hadoop and its importance. It also explains various other concepts of Hadoop, such as its architecture, Ganglia (a tool used to monitor Hadoop), and so on. (For more resources related to this topic, see here.) In any enterprise, how big or small it could be, it is very important to monitor the health of all its components like servers, network devices, databases, and many more and make sure things are working as intended. Monitoring is a critical part for any business dependent upon infrastructure, by giving signals to enable necessary actions incase of any failures. Monitoring can be very complex with many components and configurations in a real production environment. There might be different security zones; different ways in which servers are setup or a same database might be used in many different ways with servers listening on various service ports. Before diving into setting up Monitoring and logging for Hadoop, it is very important to understand the basics of monitoring, how it works and some commonly used tools in the market. In Hadoop, we can do monitoring of the resources, services and also do metrics collection of various Hadoop counters. There are many tools available in the market and one of them is Nagios, which is widely used. Nagios is a powerful monitoring system that provides you with instant awareness of your organization's mission-critical IT infrastructure. By using Nagios, you can: Plan release cycle and rollouts, before things get outdated Early detection, before it causes an outage Have automation and a better response across the organization Nagios Architecture It is based on a simple server client architecture, in which the server has the capability to execute checks remotely using NRPE agents on the Linux clients. The results of execution are captured by the server and accordingly alerted by the system. The checks could be for memory, disk, CPU utilization, network, database connection and many more. It provides the flexibility to use either active or passive checks. Ganglia Ganglia, it is a beautiful tool for aggregating the stats and plotting them nicely. Nagios, give the events and alerts, Ganglia aggregates and presents it in a meaningful way. What if you want to look for total CPU, memory per cluster of 2000 nodes or total free disk space on 1000 nodes. Some of the key feature of Ganglia. View historical and real time metrics of a single node or for the entire cluster Use the data to make decision on cluster sizing and performance Ganglia Components Ganglia Monitoring Daemon (gmond): This runs on the nodes that need to be monitored, captures state change and sends updates using XDR to a central daemon. Ganglia Meta Daemon (gmetad): This collects data from gmond and other gmetad daemons. The data is indexed and stored to disk in round robin fashion. There is also a Ganglia front-end for meaningful display of information collected. All these tools can be integrated with Hadoop, to monitor it and capture its metrics. Integration with Hadoop There are many important components in Hadoop that needs to be monitored, like NameNode uptime, disk space, memory utilization, and heap size. Similarly, on DataNode we need to monitor disk usage, memory utilization or job execution flow status across the MapReduce components. To know what to monitor, we must understand how Hadoop daemons communicate with each other. There are lots of ports used in Hadoop, some are for internal communication like scheduling jobs, and replication, while others are for user interactions. They may be exposed using TCP or HTTP. Hadoop daemons provide information over HTTP about logs, stacks, metrics that could be used for troubleshooting. NameNode can expose information about the file system, live or dead nodes or block reports by the DataNode or JobTracker for tracking the running jobs. Hadoop uses TCP, HTTP, IPC or socket for communication among the nodes or daemons. YARN Framework The YARN (Yet Another resource Negotiator) is the new MapReduce framework. It is designed to scale for large clusters and performs much better as compared to the old framework. There are new sets of daemons in the new framework and it is good to understand how to communicate with each other. The diagram that follows, explains the daemons and ports on which they talk. Logging in Hadoop In Hadoop, each daemon writes its own logs and the severity of logging is configurable. The logs in Hadoop can be related to the daemons or the jobs submitted. Useful to troubleshoot slowness, issue with map reduce tasks, connectivity issues and platforms bugs. The logs generated can be user level like task tracker logs on each node or can be related to master daemons like NameNode and JobTracker. In the newer YARN platform, there is a feature to move the logs to HDFS after the initial logging. In Hadoop 1.x the user log management is done using UserLogManager, which cleans and truncates logs according to retention and size parameters like mapred.userlog.retain.hours and mapreduce.cluster.map.userlog.retain-size respectively. The tasks standard out and error are piped to Unix tail program, so it retains the require size only. The following are some of the challenges of log management in Hadoop: Excessive logging: The truncation of logs is not done till the tasks finish, this for many jobs could cause disk space issues as the amount of data written is quite large. Truncation: We cannot always say what to log and how much is good enough. For some users 500KB of logs might be good but for some 10MB might not suffice. Retention: How long to retain logs, 1 or 6 months?. There is no rule, but there are best practices or governance issues. In many countries there is regulation in place to keep data for 1 year. Best practice for any organization is to keep it for at least 6 months. Analysis: What if we want to look at historical data, how to aggregate logs onto a central system and do analyses. In Hadoop logs are served over HTTP for a single node by default. Some of the above stated issues have been addressed in the YARN framework. Rather then truncating logs and that to on individual nodes, the logs can be moved to HDFS and processed using other tools. The logs are written at the per application level into directories per application. The user can access these logs through command line or web UI. For example, $HADOOP_YARN_HOME/bin/yarn logs. Hadoop metrics In Hadoop there are many daemons running like DataNode, NameNode, JobTracker, and so on, each of these daemons captures a lot of information about the components they work on. Similarly, in YARN framework we have ResourceManager, NodeManager, and Application Manager, each of which exposes metrics, explained in the following sections under Metrics2. For example, DataNode collects metrics like number of blocks it has for advertising to the NameNode, the number of replicated blocks, metrics about the various read or writes from clients. In addition to this there could be metrics related to events, and so on. Hence, it is very important to gather it for the working of the Hadoop cluster and also helps in debugging, if something goes wrong. For this, Hadoop has a metrics system, for collecting all this information. There are two versions of the metrics system, Metrics and Metrics2 for Hadoop 1.x and Hadoop 2.x respectively. The file hadoop-metrics.properties and hadoop-metrics2.properties for each Hadoop version can be configured respectively. Configuring Metrics2 For Hadoop version 2, which uses YARN framework, the metrics can be configured using hadoop-metrics2.properties, under the $HADOOP_HOME directory. *.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink *.period=10 namenode.sink.file.filename=namenode-metrics.out datanode.sink.file.filename=datanode-metrics.out jobtracker.sink.file.filename=jobtracker-metrics.out tasktracker.sink.file.filename=tasktracker-metrics.out maptask.sink.file.filename=maptask-metrics.out reducetask.sink.file.filename=reducetask-metrics.out Hadoop metrics Configuration for Ganglia Firstly, we need to define a sink class, as per Ganglia. *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31 Secondly, we need to define the frequency of how often the source showed be polled for data. We are polling every 30 seconds: *.sink.ganglia.period=30 Define retention for the metrics: *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40 Summary In this article, we learned about Hadoop monitoring and its importance, and also the various concepts of Hadoop. Resources for Article: Further resources on this subject: Hadoop and MapReduce [article] YARN and Hadoop [article] Hive in Hadoop [article]

0
0
2832

Packt

30 Apr 2015

15 min read

Machine Learning

Packt

30 Apr 2015

15 min read

0
0
2629

Packt

29 Apr 2015

16 min read

Algorithmic Trading

Packt

29 Apr 2015

16 min read

In this article by James Ma Weiming, author of the book Mastering Python for Finance , we will see how algorithmic trading automates the systematic trading process, where orders are executed at the best price possible based on a variety of factors, such as pricing, timing, and volume. Some brokerage firms may offer an application programming interface (API) as part of their service offering to customers who wish to deploy their own trading algorithms. For developing an algorithmic trading system, it must be highly robust and handle any point of failure during the order execution. Network configuration, hardware, memory management and speed, and user experience are some factors to be considered when designing a system in executing orders. Designing larger systems inevitably add complexity to the framework. As soon as a position in a market is opened, it is subjected to various types of risk, such as market risk. To preserve the trading capital as much as possible, it is important to incorporate risk management measures to the trading system. Perhaps the most common risk measure used in the financial industry is the value-at-risk (VaR) technique. We will discuss the beauty and flaws of VaR, and how it can be incorporated into our trading system that we will develop in this article. In this article, we will cover the following topics: An overview of algorithmic trading List of brokers and system vendors with public API Choosing a programming language for a trading system Setting up API access on Interactive Brokers (IB) trading platform Using the IbPy module to interact with IB Trader WorkStation (TWS) Introduction to algorithmic trading In the 1990s, exchanges had already begun to use electronic trading systems. By 1997, 44 exchanges worldwide used automated systems for trading futures and options with more exchanges in the process of developing automated technology. Exchanges such as the Chicago Board of Trade (CBOT) and the London International Financial Futures and Options Exchange (LIFFE) used their electronic trading systems as an after-hours complement to traditional open outcry trading in pits, giving traders 24-hour access to the exchange's risk management tools. With improvements in technology, technology-based trading became less expensive, fueling the growth of trading platforms that are faster and powerful. Higher reliability of order execution and lower rates of message transmission error deepened the reliance of technology by financial institutions. The majority of asset managers, proprietary traders, and market makers have since moved from the trading pits to electronic trading floors. As systematic or computerized trading became more commonplace, speed became the most important factor in determining the outcome of a trade. Quants utilizing sophisticated fundamental models are able to recompute fair values of trading products on the fly and execute trading decisions, enabling them to reap profits at the expense of fundamental traders using traditional tools. This gave way to the term high-frequency trading (HFT) that relies on fast computers to execute the trading decisions before anyone else can. HFT has evolved into a billion-dollar industry. Algorithmic trading refers to the automation of the systematic trading process, where the order execution is heavily optimized to give the best price possible. It is not part of the portfolio allocation process. Banks, hedge funds, brokerage firms, clearing firms, and trading firms typically have their servers placed right next to the electronic exchange to receive the latest market prices and to perform the fastest order execution where possible. They bring enormous trading volumes to the exchange. Anyone who wishes to participate in low-latency, high-volume trading activities, such as complex event processing or capturing fleeting price discrepancies, by acquiring exchange connectivity may do so in the form of co-location, where his or her server hardware can be placed on a rack right next to the exchange for a fee. The Financial Information Exchange (FIX) protocol is the industry standard for electronic communications with the exchange from the private server for direct market access (DMA) to real-time information. C++ is the common choice of programming language for trading over the FIX protocol, though other languages, such as .NET framework common language and Java can be used. Before creating an algorithmic trading platform, you would need to assess various factors, such as speed and ease of learning before deciding on a specific language for the purpose. Brokerage firms would provide a trading platform of some sort to their customers for them to execute orders on selected exchanges in return for the commission fees. Some brokerage firms may offer an API as part of their service offering to technically inclined customers who wish to run their own trading algorithms. In most circumstances, customers may also choose from a number of commercial trading platforms offered by third-party vendors. Some of these trading platforms may also offer API access to route orders electronically to the exchange. It is important to read the API documentation beforehand to understand the technical capabilities offered by your broker and to formulate an approach in developing an algorithmic trading system. List of trading platforms with public API The following table lists some brokers and trading platform vendors who have their API documentation publicly available: Broker/vendor URL Programming languages supported Interactive Brokers https://www.interactivebrokers.com/en/index.php?f=1325 C++, Posix C++, Java, and Visual Basic for ActiveX E*Trade https://developer.etrade.com Java, PHP, and C++ IG http://labs.ig.com/ REST, Java, FIX, and Microsoft .NET Framework 4.0 Tradier https://developer.tradier.com Java, Perl, Python, and Ruby TradeKing https://developers.tradeking.com Java, Node.js, PHP, R, and Ruby Cunningham trading systems http://www.ctsfutures.com/wiki/T4%20API%2040.MainPage.ashx Microsoft .NET Framework 4.0 CQG http://cqg.com/Products/CQG-API.aspx C#, C++, Excel, MATLAB, and VB.NET Trading technologies https://developer.tradingtechnologies.com Microsoft .NET Framework 4.0 OANDA http://developer.oanda.com REST, Java, FIX, and MT4 Which is the best programming language to use? With many choices of programming languages available to interface with brokers or vendors, the question that comes naturally to anyone starting out in algorithmic trading platform development is: which language should I use? Well, the short answer is that there is really no best programming language. How your product will be developed, the performance metrics to follow, the costs involved, latency threshold, risk measures, and the expected user interface are pieces of the puzzle to be taken into consideration. The risk manager, execution engine, and portfolio optimizer are some major components that will affect the design of your system. Your existing trading infrastructure, choice of operating system, programming language compiler capability, and available software tools poses further constraints on the system design, development, and deployment. System functionalities It is important to define the outcomes of your trading system. An outcome could be a research-based system that might be more concerned with obtaining high-quality data from data vendors, performing computations or running models, and evaluating a strategy through signal generation. Part of the research component might include a data-cleaning module or a backtesting interface to run a strategy with theoretical parameters over historical data. The CPU speed, memory size, and bandwidth are factors to be considered while designing our system. Another outcome could be an execution-based system that is more concerned with risk management and order handling features to ensure timely execution of multiple orders. The system must be highly robust and handle any point of failure during the order execution. As such, network configuration, hardware, memory management and speed, and user experience are some factors to be considered when designing a system in executing orders. A system may contain one or more of these functionalities. Designing larger systems inevitably add complexity to the framework. It is recommended that you choose one or more programming languages that can address and balance the development speed, ease of development, scalability, and reliability of your trading system. Algorithmic trading with Interactive Brokers and IbPy In this section, we will build a working algorithmic trading platform that will authenticate with Interactive Brokers (IB) and log in, retrieve the market data, and send orders. IB is one of the most popular brokers in the trading community and has a long history of API development. There are plenty of articles on the use of the API available on the Web. IB serves clients ranging from hedge funds to retail traders. Although the API does not support Python directly, Python wrappers such as IbPy are available to make the API calls to the IB interface. The IB API is unique to its own implementation, and every broker has its own API handling methods. Nevertheless, the documents and sample applications provided by your broker would demonstrate the core functionality of every API interface that can be easily integrated into an algorithmic trading system if designed properly. Getting Interactive Brokers' Trader WorkStation The official page for IB is https://www.interactivebrokers.com. Here, you can find a wealth of information regarding trading and investing for retail and institutional traders. In this section, we will take a look at how to get the Trader WorkStation X (TWS) installed and running on your local workstation before setting up an algorithmic trading system using Python. Note that we will perform simulated trading on a demonstration account. If your trading strategy turns out to be profitable, head to the OPEN AN ACCOUNT section of the IB website to open a live trading account. Rules, regulations, market data fees, exchange fees, commissions, and other conditions are subjected to the broker of your choice. In addition, market conditions are vastly different from the simulated environment. You are encouraged to perform extensive testing on your algorithmic trading system before running on live markets. The following key steps describe how to install TWS on your local workstation, log in to the demonstration account, and set it up for API use: From IB's official website, navigate to TRADING, and then select Standalone TWS. Choose the installation executable that is suitable for your local workstation. TWS runs on Java; therefore, ensure that Java runtime plugin is already installed on your local workstation. Refer to the following screenshot: When prompted during the installation process, choose Trader_WorkStation_X and IB Gateway options. The Trader WorkStation X (TWS) is the trading platform with full order management functionality. The IB Gateway program accepts and processes the API connections without any order management features of the TWS. We will not cover the use of the IB Gateway, but you may find it useful later. Select the destination directory on your local workstation where TWS will place all the required files, as shown in the following screenshot: When the installation is completed, a TWS shortcut icon will appear together with your list of installed applications. Double-click on the icon to start the TWS program. When TWS starts, you will be prompted to enter your login credentials. To log in to the demonstration account, type edemo in the username field and demouser in the password field, as shown in the following screenshot: Once we have managed to load our demo account on TWS, we can now set up its API functionality. On the toolbar, click on Configure: Under the Configuration tree, open the API node to reveal further options. Select Settings. Note that Socket port is 7496, and we added the IP address of our workstation housing our algorithmic trading system to the list of trusted IP addresses, which in this case is 127.0.0.1. Ensure that the Enable ActiveX and Socket Clients option is selected to allow the socket connections to TWS: Click on OK to save all the changes. TWS is now ready to accept orders and market data requests from our algorithmic trading system. Getting IbPy – the IB API wrapper IbPy is an add-on module for Python that wraps the IB API. It is open source and can be found at https://github.com/blampe/IbPy. Head to this URL and download the source files. Unzip the source folder, and use Terminal to navigate to this directory. Type python setup.py install to install IbPy as part of the Python runtime environment. The use of IbPy is similar to the API calls, as documented on the IB website. The documentation for IbPy is at https://code.google.com/p/ibpy/w/list. A simple order routing mechanism In this section, we will start interacting with TWS using Python by establishing a connection and sending out a market order to the exchange. Once IbPy is installed, import the following necessary modules into our Python script: from ib.ext.Contract import Contractfrom ib.ext.Order import Orderfrom ib.opt import Connection Next, implement the logging functions to handle calls from the server. The error_handler method is invoked whenever the API encounters an error, which is accompanied with a message. The server_handler method is dedicated to handle all the other forms of returned API messages. The msg variable is a type of an ib.opt.message object and references the method calls, as defined by the IB API EWrapper methods. The API documentation can be accessed at https://www.interactivebrokers.com/en/software/api/api.htm. The following is the Python code for the server_handler method: def error_handler(msg):print "Server Error:", msgdef server_handler(msg):print "Server Msg:", msg.typeName, "-", msg We will place a sample order of the stock AAPL. The contract specifications of the order are defined by the Contract class object found in the ib.ext.Contract module. We will create a method called create_contract that returns a new instance of this object: def create_contract(symbol, sec_type, exch, prim_exch, curr):contract = Contract()contract.m_symbol = symbolcontract.m_secType = sec_typecontract.m_exchange = exchcontract.m_primaryExch = prim_exchcontract.m_currency = currreturn contract The Order class object is used to place an order with TWS. Let's define a method called create_order that will return a new instance of the object: def create_order(order_type, quantity, action):order = Order()order.m_orderType = order_typeorder.m_totalQuantity = quantityorder.m_action = actionreturn order After the required methods are created, we can then begin to script the main functionality. Let's initialize the required variables: if __name__ == "__main__":client_id = 100order_id = 1port = 7496tws_conn = None Note that the client_id variable is our assigned integer that identifies the instance of the client communicating with TWS. The order_id variable is our assigned integer that identifies the order queue number sent to TWS. Each new order requires this value to be incremented sequentially. The port number has the same value as defined in our API settings of TWS earlier. The tws_conn variable holds the connection value to TWS. Let's initialize this variable with an empty value for now. Let's use a try block that encapsulates the Connection.create method to handle the socket connections to TWS in a graceful manner: try:# Establish connection to TWS.tws_conn = Connection.create(port=port,clientId=client_id)tws_conn.connect()# Assign error handling function.tws_conn.register(error_handler, 'Error')# Assign server messages handling function.tws_conn.registerAll(server_handler)finally:# Disconnect from TWSif tws_conn is not None:tws_conn.disconnect() The port and clientId parameter fields define this connection. After the connection instance is created, the connect method will try to connect to TWS. When the connection to TWS has successfully opened, it is time to register listeners to receive notifications from the server. The register method associates a function handler to a particular event. The registerAll method associates a handler to all the messages generated. This is where the error_handler and server_handler methods declared earlier will be used for this occasion. Before sending our very first order of 100 shares of AAPL to the exchange, we will call the create_contract method to create a new contract object for AAPL. Then, we will call the create_order method to create a new Order object, to go long 100 shares. Finally, we will call the placeOrder method of the Connection class to send out this order to TWS: # Create a contract for AAPL stock using SMART order routing.aapl_contract = create_contract('AAPL','STK','SMART','SMART','USD')# Go long 100 shares of AAPLaapl_order = create_order('MKT', 100, 'BUY')# Place order on IB TWS.tws_conn.placeOrder(order_id, aapl_contract, aapl_order) That's it! Let's run our Python script. We should get a similar output as follows: Server Error: <error id=-1, errorCode=2104, errorMsg=Market data farmconnection is OK:ibdemo>Server Response: error, <error id=-1, errorCode=2104, errorMsg=Marketdata farm connection is OK:ibdemo>Server Version: 75TWS Time at connection:20141210 23:14:17 CSTServer Msg: managedAccounts - <managedAccounts accountsList=DU15200>Server Msg: nextValidId - <nextValidId orderId=1>Server Error: <error id=-1, errorCode=2104, errorMsg=Market data farmconnection is OK:ibdemo>Server Msg: error - <error id=-1, errorCode=2104, errorMsg=Market datafarm connection is OK:ibdemo>Server Error: <error id=-1, errorCode=2107, errorMsg=HMDS data farmconnection is inactive but should be available upon demand.demohmds>Server Msg: error - <error id=-1, errorCode=2107, errorMsg=HMDS data farmconnection is inactive but should be available upon demand.demohmds> Basically, what the error messages say is that there are no errors and the connections are OK. Should the simulated order be executed successfully during market trading hours, the trade will be reflected in TWS: The full source code of our implementation is given as follows: """ A Simple Order Routing Mechanism """from ib.ext.Contract import Contractfrom ib.ext.Order import Orderfrom ib.opt import Connectiondef error_handler(msg):print "Server Error:", msgdef server_handler(msg):print "Server Msg:", msg.typeName, "-", msgdef create_contract(symbol, sec_type, exch, prim_exch, curr):contract = Contract()contract.m_symbol = symbolcontract.m_secType = sec_typecontract.m_exchange = exchcontract.m_primaryExch = prim_exchcontract.m_currency = currreturn contractdef create_order(order_type, quantity, action):order = Order()order.m_orderType = order_typeorder.m_totalQuantity = quantityorder.m_action = actionreturn orderif __name__ == "__main__":client_id = 1order_id = 119port = 7496tws_conn = Nonetry:# Establish connection to TWS.tws_conn = Connection.create(port=port,clientId=client_id)tws_conn.connect()# Assign error handling function.tws_conn.register(error_handler, 'Error')# Assign server messages handling function.tws_conn.registerAll(server_handler)# Create AAPL contract and send orderaapl_contract = create_contract('AAPL','STK','SMART','SMART','USD')# Go long 100 shares of AAPLaapl_order = create_order('MKT', 100, 'BUY')# Place order on IB TWS.tws_conn.placeOrder(order_id, aapl_contract, aapl_order)finally:# Disconnect from TWSif tws_conn is not None:tws_conn.disconnect() Summary In this article, we were introduced to the evolution of trading from the pits to the electronic trading platform, and learned how algorithmic trading came about. We looked at some brokers offering API access to their trading service offering. To help us get started on our journey in developing an algorithmic trading system, we used the TWS of IB and the IbPy Python module. In our first trading program, we successfully sent an order to our broker through the TWS API using a demonstration account. Resources for Article: Prototyping Arduino Projects using Python Python functions – Avoid repeating code Pentesting Using Python

0
0
13995

Packt

27 Apr 2015

9 min read

Apache Solr and Big Data – integration with MongoDB

Packt

27 Apr 2015

9 min read

In this article by Hrishikesh Vijay Karambelkar, author of the book Scaling Big Data with Hadoop and Solr - Second Edition, we will go through Apache Solr and MongoDB together. In an enterprise, data is generated from all the software that is participating in day-to-day operations. This data has different formats, and bringing in this data for big-data processing requires a storage system that is flexible enough to accommodate a data with varying data models. A NoSQL database, by its design, is best suited for this kind of storage requirements. One of the primary objectives of NoSQL is horizontal scaling, that is, the P in CAP theorem, but this works at the cost of sacrificing Consistency or Availability. Visit http://en.wikipedia.org/wiki/CAP_theorem to understand more about CAP theorem (For more resources related to this topic, see here.) What is NoSQL and how is it related to Big Data? As we have seen, data models for NoSQL differ completely from that of a relational database. With the flexible data model, it becomes very easy for developers to quickly integrate with the NoSQL database, and bring in large sized data from different data sources. This makes the NoSQL database ideal for Big Data storage, since it demands that different data types be brought together under one umbrella. NoSQL also has different data models, like KV store, document store and Big Table storage. In addition to flexible schema, NoSQL offers scalability and high performance, which is again one of the most important factors to be considered while running big data. NoSQL was developed to be a distributed type of database. When traditional relational stores rely on the high computing power of CPUs and the high memory focused on a centralized system, NoSQL can run on your low-cost, commodity hardware. These servers can be added or removed dynamically from the cluster running NoSQL, making the NoSQL database easier to scale. NoSQL enables most advanced features of a database, like data partitioning, index sharding, distributed query, caching, and so on. Although NoSQL offers optimized storage for big data, it may not be able to replace the relational database. While relational databases offer transactional (ACID), high CRUD, data integrity, and a structured database design approach, which are required in many applications, NoSQL may not support them. Hence, it is most suited for Big Data where there is less possibility of need for data to be transactional. MongoDB at glance MongoDB is one of the popular NoSQL databases, (just like Cassandra). MongoDB supports the storing of any random schemas in the document oriented storage of its own. MongoDB supports the JSON-based information pipe for any communication with the server. This database is designed to work with heavy data. Today, many organizations are focusing on utilizing MongoDB for various enterprise applications. MongoDB provides high availability and load balancing. Each data unit is replicated and the combination of a data with its copes is called a replica set. Replicas in MongoDB can either be primary or secondary. Primary is the active replica, which is used for direct read-write operations, while the secondary replica works like a backup for the primary. MongoDB supports searches by field, range queries, and regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions. Any field in a MongoDB document can be indexed. More information about MongoDB can be read at https://www.mongodb.org/. The data on MongoDB is eventually consistent. Apache Solr can be used to work with MongoDB, to enable database searching capabilities on a MongoDB-based data store. Unlike Cassandra, where the Solr indexes are stored directly in Cassandra through solandra, MongoDB integration with Solr brings in the indexes in the Solr-based optimized storage. There are various ways in which the data residing in MongoDB can be analyzed and searched. MongoDB's replication works by recording all operations made on a database in a log file, called the oplog (operation log). Mongo's oplog keeps a rolling record of all operations that modify the data stored in your databases. Many of the implementers suggest reading this log file using a standard file IO program to push the data directly to Apache Solr, using CURL, SolrJ. Since oplog is a collection of data with an upper limit on maximum storage, it is feasible to synch such querying with Apache Solr. Oplog also provides tailable cursors on the database. These cursors can provide a natural order to the documents loaded in MongoDB, thereby, preserving their order. However, we are going to look at a different approach. Let's look at the schematic following diagram: In this case, MongoDB is exposed as a database to Apache Solr through the custom database driver. Apache Solr reads MongoDB data through the DataImportHandler, which in turns calls the JDBC-based MongoDB driver for connecting to MongoDB and running data import utilities. Since MongoDB supports replica sets, it manages the distribution of data across nodes. It also supports Sharding just like Apache Solr. Installing MongoDB To install MongoDB in your development environment, please follow the following steps: Download the latest version of MongoDB from https://www.mongodb.org/downloads for your supported operating system. Unzip the zipped folder. MongoDB comes up with a default set of different command-line components and utilities: bin/mongod: The database process. bin/mongos: Sharding controller. bin/mongo: The database shell (uses interactive JavaScript). Now, create a directory for MongoDB, which it will use for user data creation and management, and run the following command to start the single node server: $ bin/mongod –dbpath <path to your data directory> --rest In this case, --rest parameter enables support for simple rest APIs that can be used for getting the status. Once the server is started, access http://localhost:28017 from your favorite browser, you should be able to see following administration status page: Now that you have successfully installed MongoDB, try loading a sample data set from the book on MongoDB by opening a new command-line interface. Change the directory to $MONGODB_HOME and run the following command: $ bin/mongoimport --db solr-test --collection zips --file "<file-dir>/samples/zips.json" Please note that the database name is solr-test. You can see the stored data using the MongoDB-based CLI by running the following set of commands from your shell: $ bin/mongo MongoDB shell version: 2.4.9 connecting to: test Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see http://docs.mongodb.org/ Questions? Try the support group http://groups.google.com/group/mongodb-user > use test Switched to db test > show dbs exampledb 0.203125GB local 0.078125GB test 0.203125GB > db.zips.find({city:"ACMAR"}) { "city" : "ACMAR", "loc" : [ -86.51557, 33.584132 ], "pop" : 6055, "state" :"AL", "_id" : "35004" } Congratulations! MongoDB is installed successfully Creating Solr indexes from MongoDB To run MongoDB as a database, you will need a JDBC driver built for MongoDB. However, the Mongo-JDBC driver has certain limitations, and it does not work with the Apache Solr DataImportHandler. So, I have extended Mongo-JDBC to work under the Solr-based DataImportHandler. The project repository is available at https://github.com/hrishik/solr-mongodb-dih. Let's look at the setting-up procedure for enabling MongoDB based Solr integration: You may not require a complete package from the solr-mongodb-dih repository, but just the jar file. This can be downloaded from https://github.com/hrishik/solr-mongodb-dih/tree/master/sample-jar. You will also need the following additional jar files: jsqlparser.jar mongo.jar These jars are available with the book Scaling Big Data with Hadoop and Solr, Second Edition for download. In your Solr setup, copy these jar files into the library path, that is, the $SOLR_WAR_LOCATION/WEB-INF/lib folder. Alternatively, point your container classpath variable to link them up. Using simple Java source code DataLoad.java (link https://github.com/hrishik/solr-mongodb-dih/blob/master/examples/DataLoad.java), populate the database with some sample schema and tables that you will use to load in Apache Solr. Now create a data source file (data-source-config.xml) as follows: <dataConfig> <dataSource name="mongod" type="JdbcDataSource" driver="com.mongodb. jdbc.MongoDriver" url="mongodb://localhost/solr-test"/> <document> <entity name="nameage" dataSource="mongod" query="select name, price from grocery"> <field column="name" name="name"/> <field column="name" name="id"/>  </entity> </document> </dataConfig> Copy the solr-dataimporthandler-*.jar from your contrib directory to a container/application library path. Modify $SOLR_COLLECTION_ROOT/conf/solr-config.xml with DIH entry:  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config"><path to config>/data-source-config.xml</str> </lst> </requestHandler>  Once this configuration is done, you are ready to test it out. Access http://localhost:8983/solr/dataimport?command=full-import from your browser to run the full import on Apache Solr, where you will see that your import handler has successfully ran, and has loaded the data in Solr store, as shown in the following screenshot: You can validate the content created by your new MongoDB DIH by accessing the Solr Admin page, and running a query: Using this connector, you can perform operations for full-import on various data elements. Since MongoDB is not a relational database, it does support join queries. However, it supports selects, order by, and so on. Summary In this article, we have understood the distributed aspects of any enterprise search where went through Apache Solr and MongoDB together. Resources for Article: Further resources on this subject: Evolution of Hadoop [article] In the Cloud [article] Learning Data Analytics with R and Hadoop [article]

0
0
9767

article-image-integrating-d3js-visualization-simple-angularjs-application

Packt

27 Apr 2015

19 min read

Integrating a D3.js visualization into a simple AngularJS application

Packt

27 Apr 2015

19 min read

In this article by Christoph Körner, author of the book Data Visualization with D3 and AngularJS, we will apply the acquired knowledge to integrate a D3.js visualization into a simple AngularJS application. First, we will set up an AngularJS template that serves as a boilerplate for the examples and the application. We will see a typical directory structure for an AngularJS project and initialize a controller. Similar to the previous example, the controller will generate random data that we want to display in an autoupdating chart. Next, we will wrap D3.js in a factory and create a directive for the visualization. You will learn how to isolate the components from each other. We will create a simple AngularJS directive and write a custom compile function to create and update the chart. (For more resources related to this topic, see here.) Setting up an AngularJS application To get started with this article, I assume that you feel comfortable with the main concepts of AngularJS: the application structure, controllers, directives, services, dependency injection, and scopes. I will use these concepts without introducing them in great detail, so if you do not know about one of these topics, first try an intermediate AngularJS tutorial. Organizing the directory To begin with, we will create a simple AngularJS boilerplate for the examples and the visualization application. We will use this boilerplate during the development of the sample application. Let's create a project root directory that contains the following files and folders: bower_components/: This directory contains all third-party components src/: This directory contains all source files src/app.js: This file contains source of the application src/app.css: CSS layout of the application test/: This directory contains all test files (test/config/ contains all test configurations, test/spec/ contains all unit tests, and test/e2e/ contains all integration tests) index.html: This is the starting point of the application Installing AngularJS In this article, we use the AngularJS version 1.3.14, but different patch versions (~1.3.0) should also work fine with the examples. Let's first install AngularJS with the Bower package manager. Therefore, we execute the following command in the root directory of the project: bower install angular#1.3.14 Now, AngularJS is downloaded and installed to the bower_components/ directory. If you don't want to use Bower, you can also simply download the source files from the AngularJS website and put them in a libs/ directory. Note that—if you develop large AngularJS applications—you most likely want to create a separate bower.json file and keep track of all your third-party dependencies. Bootstrapping the index file We can move on to the next step and code the index.html file that serves as a starting point for the application and all examples of this section. We need to include the JavaScript application files and the corresponding CSS layouts, the same for the chart component. Then, we need to initialize AngularJS by placing an ng-app attribute to the html tag; this will create the root scope of the application. Here, we will call the AngularJS application myApp, as shown in the following code: <html ng-app="myApp"> <head>  <script src="bower_components/d3/d3.js" charset="UTF- 8"></script> <script src="bower_components/angular/angular.js" charset="UTF-8"></script>  <script src="src/app.js"></script> <link href="src/app.css" rel="stylesheet">  <script src="src/chart.js"></script> <link href="src/chart.css" rel="stylesheet"> </head> <body>  </body> </html> For all the examples in this section, I will use the exact same setup as the preceding code. I will only change the body of the HTML page or the JavaScript or CSS sources of the application. I will indicate to which file the code belongs to with a comment for each code snippet. If you are not using Bower and previously downloaded D3.js and AngularJS in a libs/ directory, refer to this directory when including the JavaScript files. Adding a module and a controller Next, we initialize the AngularJS module in the app.js file and create a main controller for the application. The controller should create random data (that represent some simple logs) in a fixed interval. Let's generate some random number of visitors every second and store all data points on the scope as follows: /* src/app.js */ // Application Module angular.module('myApp', []) // Main application controller .controller('MainCtrl', ['$scope', '$interval', function ($scope, $interval) { var time = new Date('2014-01-01 00:00:00 +0100'); // Random data point generator var randPoint = function() { var rand = Math.random; return { time: time.toString(), visitors: rand()*100 }; } // We store a list of logs $scope.logs = [ randPoint() ]; $interval(function() { time.setSeconds(time.getSeconds() + 1); $scope.logs.push(randPoint()); }, 1000); }]); In the preceding example, we define an array of logs on the scope that we initialize with a random point. Every second, we will push a new random point to the logs. The points contain a number of visitors and a timestamp—starting with the date 2014-01-01 00:00:00 (timezone GMT+01) and counting up a second on each iteration. I want to keep it simple for now; therefore, we will use just a very basic example of random access log entries. Consider to use the cleaner controller as syntax for larger AngularJS applications because it makes the scopes in HTML templates explicit! However, for compatibility reasons, I will use the standard controller and $scope notation. Integrating D3.js into AngularJS We bootstrapped a simple AngularJS application in the previous section. Now, the goal is to integrate a D3.js component seamlessly into an AngularJS application—in an Angular way. This means that we have to design the AngularJS application and the visualization component such that the modules are fully encapsulated and reusable. In order to do so, we will use a separation on different levels: Code of different components goes into different files Code of the visualization library goes into a separate module Inside a module, we divide logics into controllers, services, and directives Using this clear separation allows you to keep files and modules organized and clean. If at anytime we want to replace the D3.js backend with a canvas pixel graphic, we can just implement it without interfering with the main application. This means that we want to use a new module of the visualization component and dependency injection. These modules enable us to have full control of the separate visualization component without touching the main application and they will make the component maintainable, reusable, and testable. Organizing the directory First, we add the new files for the visualization component to the project: src/: This is the default directory to store all the file components for the project src/chart.js: This is the JS source of the chart component src/chart.css: This is the CSS layout for the chart component test/test/config/: This directory contains all test configurations test/spec/test/spec/chart.spec.js: This file contains the unit tests of the chart component test/e2e/chart.e2e.js: This file contains the integration tests of the chart component If you develop large AngularJS applications, this is probably not the folder structure that you are aiming for. Especially in bigger applications, you will most likely want to have components in separate folders and directives and services in separate files. Then, we will encapsulate the visualization from the main application and create the new myChart module for it. This will make it possible to inject the visualization component or parts of it—for example just the chart directive—to the main application. Wrapping D3.js In this module, we will wrap D3.js—which is available via the global d3 variable—in a service; actually, we will use a factory to just return the reference to the d3 variable. This enables us to pass D3.js as a dependency inside the newly created module wherever we need it. The advantage of doing so is that the injectable d3 component—or some parts of it—can be mocked for testing easily. Let's assume we are loading data from a remote resource and do not want to wait for the time to load the resource every time we test the component. Then, the fact that we can mock and override functions without having to modify anything within the component will become very handy. Another great advantage will be defining custom localization configurations directly in the factory. This will guarantee that we have the proper localization wherever we use D3.js in the component. Moreover, in every component, we use the injected d3 variable in a private scope of a function and not in the global scope. This is absolutely necessary for clean and encapsulated components; we should never use any variables from global scope within an AngularJS component. Now, let's create a second module that stores all the visualization-specific code dependent on D3.js. Thus, we want to create an injectable factory for D3.js, as shown in the following code: /* src/chart.js */ // Chart Module angular.module('myChart', []) // D3 Factory .factory('d3', function() { /* We could declare locals or other D3.js specific configurations here. */ return d3; }); In the preceding example, we returned d3 without modifying it from the global scope. We can also define custom D3.js specific configurations here (such as locals and formatters). We can go one step further and load the complete D3.js code inside this factory so that d3 will not be available in the global scope at all. However, we don't use this approach here to keep things as simple and understandable as possible. We need to make this module or parts of it available to the main application. In AngularJS, we can do this by injecting the myChart module into the myApp application as follows: /* src/app.js */ angular.module('myApp', ['myChart']); Usually, we will just inject the directives and services of the visualization module that we want to use in the application, not the whole module. However, for the start and to access all parts of the visualization, we will leave it like this. We can use the components of the chart module now on the AngularJS application by injecting them into the controllers, services, and directives. The boilerplate—with a simple chart.js and chart.css file—is now ready. We can start to design the chart directive. A chart directive Next, we want to create a reusable and testable chart directive. The first question that comes into one's mind is where to put which functionality? Should we create a svg element as parent for the directive or a div element? Should we draw a data point as a circle in svg and use ng-repeat to replicate these points in the chart? Or should we better create and modify all data points with D3.js? I will answer all these question in the following sections. A directive for SVG As a general rule, we can say that different concepts should be encapsulated so that they can be replaced anytime by a new technology. Hence, we will use AngularJS with an element directive as a parent element for the visualization. We will bind the data and the options of the chart to the private scope of the directive. In the directive itself, we will create the complete chart including the parent svg container, the axis, and all data points using D3.js. Let's first add a simple directive for the chart component: /* src/chart.js */ … // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){ return { restrict: 'E', scope: { }, compile: function( element, attrs, transclude ) { // Create a SVG root element var svg = d3.select(element[0]).append('svg'); // Return the link function return function(scope, element, attrs) { }; } }; }]); In the preceding example, we first inject d3 to the directive by passing it as an argument to the caller function. Then, we return a directive as an element with a private scope. Next, we define a custom compile function that returns the link function of the directive. This is important because we need to create the svg container for the visualization during the compilation of the directive. Then, during the link phase of the directive, we need to draw the visualization. Let's try to define some of these directives and look at the generated output. We define three directives in the index.html file, as shown in the following code:  <div ng-controller="MainCtrl">   <my-scatter-chart class="chart"></my-scatter-chart>  <my-scatter-chart class="chart"></my-scatter-chart>  <my-scatter-chart class="chart"></my-scatter-chart> </div> If we look at the output of the html page in the developer tools, we can see that for each base element of the directive, we created a svg parent element for the visualization: Output of the HTML page In the resulting DOM tree, we can see that three svg elements are appended to the directives. We can now start to draw the chart in these directives. Let's fill these elements with some awesome charts. Implementing a custom compile function First, let's add a data attribute to the isolated scope of the directive. This gives us access to the dataset, which we will later pass to the directive in the HTML template. Next, we extend the compile function of the directive to create a g group container for the data points and the axis. We will also add a watcher that checks for changes of the scope data array. Every time the data changes, we call a draw() function that redraws the chart of the directive. Let's get started: /* src/capp..js */ ... // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){ // we will soon implement this function var draw = function(svg, width, height, data){ … }; return { restrict: 'E', scope: { data: '=' }, compile: function( element, attrs, transclude ) { // Create a SVG root element var svg = d3.select(element[0]).append('svg'); svg.append('g').attr('class', 'data'); svg.append('g').attr('class', 'x-axis axis'); svg.append('g').attr('class', 'y-axis axis'); // Define the dimensions for the chart var width = 600, height = 300; // Return the link function return function(scope, element, attrs) { // Watch the data attribute of the scope scope.$watch('data', function(newVal, oldVal, scope) { // Update the chart draw(svg, width, height, scope.data); }, true); }; } }; }]); Now, we implement the draw() function in the beginning of the directive. Drawing charts So far, the chart directive should look like the following code. We will now implement the draw() function, draw axis, and time series data. We start with setting the height and width for the svg element as follows: /* src/chart.js */ ... // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){ function draw(svg, width, height, data) { svg .attr('width', width) .attr('height', height); // code continues here } return { restrict: 'E', scope: { data: '=' }, compile: function( element, attrs, transclude ) { ... } }]); Axis, scale, range, and domain We first need to create the scales for the data and then the axis for the chart. The implementation looks very similar to the scatter chart. We want to update the axis with the minimum and maximum values of the dataset; therefore, we also add this code to the draw() function: /* src/chart.js --> myScatterChart --> draw() */ function draw(svg, width, height, data) { ... // Define a margin var margin = 30; // Define x-scale var xScale = d3.time.scale() .domain([ d3.min(data, function(d) { return d.time; }), d3.max(data, function(d) { return d.time; }) ]) .range([margin, width-margin]); // Define x-axis var xAxis = d3.svg.axis() .scale(xScale) .orient('top') .tickFormat(d3.time.format('%S')); // Define y-scale var yScale = d3.time.scale() .domain([0, d3.max(data, function(d) { return d.visitors; })]) .range([margin, height-margin]); // Define y-axis var yAxis = d3.svg.axis() .scale(yScale) .orient('left') .tickFormat(d3.format('f')); // Draw x-axis svg.select('.x-axis') .attr("transform", "translate(0, " + margin + ")") .call(xAxis); // Draw y-axis svg.select('.y-axis') .attr("transform", "translate(" + margin + ")") .call(yAxis); } In the preceding code, we create a timescale for the x-axis and a linear scale for the y-axis and adapt the domain of both axes to match the maximum value of the dataset (we can also use the d3.extent() function to return min and max at the same time). Then, we define the pixel range for our chart area. Next, we create two axes objects with the previously defined scales and specify the tick format of the axis. We want to display the number of seconds that have passed on the x-axis and an integer value of the number of visitors on the y-axis. In the end, we draw the axes by calling the axis generator on the axis selection. Joining the data points Now, we will draw the data points and the axis. We finish the draw() function with this code: /* src/chart.js --> myScatterChart --> draw() */ function draw(svg, width, height, data) { ... // Add new the data points svg.select('.data') .selectAll('circle').data(data) .enter() .append('circle'); // Updated all data points svg.select('.data') .selectAll('circle').data(data) .attr('r', 2.5) .attr('cx', function(d) { return xScale(d.time); }) .attr('cy', function(d) { return yScale(d.visitors); }); } In the preceding code, we first create circle elements for the enter join for the data points where no corresponding circle is found in the Selection. Then, we update the attributes of the center point of all circle elements of the chart. Let's look at the generated output of the application: Output of the chart directive We notice that the axes and the whole chart scales as soon as new data points are added to the chart. In fact, this result looks very similar to the previous example with the main difference that we used a directive to draw this chart. This means that the data of the visualization that belongs to the application is stored and updated in the application itself, whereas the directive is completely decoupled from the data. To achieve a nice output like in the previous figure, we need to add some styles to the cart.css file, as shown in the following code: /* src/chart.css */ .axis path, .axis line { fill: none; stroke: #999; shape-rendering: crispEdges; } .tick { font: 10px sans-serif; } circle { fill: steelblue; } We need to disable the filling of the axis and enable crisp edges rendering; this will give the whole visualization a much better look. Summary In this article, you learned how to properly integrate a D3.js component into an AngularJS application—the Angular way. All files, modules, and components should be maintainable, testable, and reusable. You learned how to set up an AngularJS application and how to structure the folder structure for the visualization component. We put different responsibilities in different files and modules. Every piece that we can separate from the main application can be reused in another application; the goal is to use as much modularization as possible. As a next step, we created the visualization directive by implementing a custom compile function. This gives us access to the first compilation of the element—where we can append the svg element as a parent for the visualization—and other container elements. Resources for Article: Further resources on this subject: AngularJS Performance [article] An introduction to testing AngularJS directives [article] Our App and Tool Stack [article]

0
0
7849

Packt

23 Apr 2015

9 min read

Solr Indexing Internals

Packt

23 Apr 2015

9 min read

In this article by Jayant Kumar, author of the book Apache Solr Search Patterns, we will discuss use cases for Solr in e-commerce and job sites. We will look at the problems faced while providing search in an e-commerce or job site: The e-commerce problem statement The job site problem statement Challenges of large-scale indexing (For more resources related to this topic, see here.) The e-commerce problem statement E-commerce provides an easy way to sell products to a large customer base. However, there is a lot of competition among multiple e-commerce sites. When users land on an e-commerce site, they expect to find what they are looking for quickly and easily. Also, users are not sure about the brands or the actual products they want to purchase. They have a very broad idea about what they want to buy. Many customers nowadays search for their products on Google rather than visiting specific e-commerce sites. They believe that Google will take them to the e-commerce sites that have their product. The purpose of any e-commerce website is to help customers narrow down their broad ideas and enable them to finalize the products they want to purchase. For example, suppose a customer is interested in purchasing a mobile. His or her search for a mobile should list mobile brands, operating systems on mobiles, screen size of mobiles, and all other features as facets. As the customer selects more and more features or options from the facets provided, the search narrows down to a small list of mobiles that suit his or her choice. If the list is small enough and the customer likes one of the mobiles listed, he or she will make the purchase. The challenge is also that each category will have a different set of facets to be displayed. For example, searching for books should display their format, as in paperpack or hardcover, author name, book series, language, and other facets related to books. These facets were different for mobiles that we discussed earlier. Similarly, each category will have different facets and it needs to be designed properly so that customers can narrow down to their preferred products, irrespective of the category they are looking into. The takeaway from this is that categorization and feature listing of products should be taken care of. Misrepresentation of features can lead to incorrect search results. Another takeaway is that we need to provide multiple facets in the search results. For example, while displaying the list of all mobiles, we need to provide facets for a brand. Once a brand is selected, another set of facets for operating systems, network, and mobile phone features has to be provided. As more and more facets are selected, we still need to show facets within the remaining products. Example of facet selection on Amazon.com Another problem is that we do not know what product the customer is searching for. A site that displays a huge list of products from different categories, such as electronics, mobiles, clothes, or books, needs to be able to identify what the customer is searching for. A customer can be searching for samsung, which can be in mobiles, tablets, electronics, or computers. The site should be able to identify whether the customer has input the author name or the book name. Identifying the input would help in increasing the relevance of the result set by increasing the precision of the search results. Most e-commerce sites provide search suggestions that include the category to help customers target the right category during their search. Amazon, for example, provides search suggestions that include both latest searched terms and products along with category-wise suggestions: Search suggestions on Amazon.com It is also important that products are added to the index as soon as they are available. It is even more important that they are removed from the index or marked as sold out as soon as their stock is exhausted. For this, modifications to the index should be immediately visible in the search. This is facilitated by a concept in Solr known as Near Real Time Indexing and Search (NRT). The job site problem statement A job site serves a dual purpose. On the one hand, it provides jobs to candidates, and on the other, it serves as a database of registered candidates' profiles for companies to shortlist. A job search has to be very intuitive for the candidates so that they can find jobs suiting their skills, position, industry, role, and location, or even by the company name. As it is important to keep the candidates engaged during their job search, it is important to provide facets on the abovementioned criteria so that they can narrow down to the job of their choice. The searches by candidates are not very elaborate. If the search is generic, the results need to have high precision. On the other hand, if the search does not return many results, then recall has to be high to keep the candidate engaged on the site. Providing a personalized job search to candidates on the basis of their profiles and past search history makes sense for the candidates. On the recruiter side, the search provided over the candidate database is required to have a huge set of fields to search upon every data point that the candidate has entered. The recruiters are very selective when it comes to searching for candidates for specific jobs. Educational qualification, industry, function, key skills, designation, location, and experience are some of the fields provided to the recruiter during a search. In such cases, the precision has to be high. The recruiter would like a certain candidate and may be interested in more candidates similar to the selected candidate. The more like this search in Solr can be used to provide a search for candidates similar to a selected candidate. NRT is important as the site should be able to provide a job or a candidate for a search as soon as any one of them is added to the database by either the recruiter or the candidate. The promptness of the site is an important factor in keeping users engaged on the site. Challenges of large-scale indexing Let us understand how indexing happens and what can be done to speed it up. We will also look at the challenges faced during the indexing of a large number of documents or bulky documents. An e-commerce site is a perfect example of a site containing a large number of products, while a job site is an example of a search where documents are bulky because of the content in candidate resumes. During indexing, Solr first analyzes the documents and converts them into tokens that are stored in the RAM buffer. When the RAM buffer is full, data is flushed into a segment on the disk. When the numbers of segments are more than that defined in the MergeFactor class of the Solr configuration, the segments are merged. Data is also written to disk when a commit is made in Solr. Let us discuss a few points to make Solr indexing fast and to handle a large index containing a huge number of documents. Using multiple threads for indexing on Solr We can divide our data into smaller chunks and each chunk can be indexed in a separate thread. Ideally, the number of threads should be twice the number of processor cores to avoid a lot of context switching. However, we can increase the number of threads beyond that and check for performance improvement. Using the Java binary format of data for indexing Instead of using XML files, we can use the Java bin format for indexing. This reduces a lot of overhead of parsing an XML file and converting it into a binary format that is usable. The way to use the Java bin format is to write our own program for creating fields, adding fields to documents, and finally adding documents to the index. Here is a sample code: //Create an instance of the Solr server String SOLR_URL = "http://localhost:8983/solr" SolrServer server = new HttpSolrServer(SOLR_URL); //Create collection of documents to add to Solr server SolrInputDocument doc1 = new SolrInputDocument(); document.addField("id",1); document.addField("desc", "description text for doc 1"); SolrInputDocument doc2 = new SolrInputDocument(); document.addField("id",2); document.addField("desc", "description text for doc 2"); Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>(); docs.add(doc1); docs.add(doc2); //Add the collection of documents to the Solr server and commit. server.add(docs); server.commit(); Here is the reference to the API for the HttpSolrServer program http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrServer.html. Add all files from the <solr_directory>/dist folder to the classpath for compiling and running the HttpSolrServer program. Using the ConcurrentUpdateSolrServer class for indexing Using the ConcurrentUpdateSolrServer class instead of the HttpSolrServer class can provide performance benefits as the former uses buffers to store processed documents before sending them to the Solr server. We can also specify the number of background threads to use to empty the buffers. The API docs for ConcurrentUpdateSolrServer are found in the following link: http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html The constructor for the ConcurrentUpdateSolrServer class is defined as: ConcurrentUpdateSolrServer(String solrServerUrl, int queueSize, int threadCount) Here, queueSize is the buffer and threadCount is the number of background threads used to flush the buffers to the index on disk. Note that using too many threads can increase the context switching between threads and reduce performance. In order to optimize the number of threads, we should monitor performance (docs indexed per minute) after each increase and ensure that there is no decrease in performance. Summary In this article, we saw in brief the problems faced by e-commerce and job websites during indexing and search. We discussed the challenges faced while indexing a large number of documents. We also saw some tips on improving the speed of indexing. Resources for Article: Further resources on this subject: Tuning Solr JVM and Container [article] Apache Solr PHP Integration [article] Boost Your search [article]

0
0
2303

Robi Sen

16 Apr 2015

4 min read

Text Mining with R: Part 2

Robi Sen

16 Apr 2015

4 min read

In Part 1, we covered the basics of doing text mining in R by selecting data, preparing it, cleaning, then performing various operations on it to visualize that data. In this post we look at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. Building the document matrix A common technique in text mining is using a matrix of documents terms called a document term matrix. A document term matrix is simply a matrix where columns are terms and rows are documents that contain the occurrence of specific terms within the document. Or if you reverse the order and have terms as rows and documents as columns it’s called a term document matrix. For example let’s say we have two documents D 1 and D2. For example let’s say we have the documents: D1 = "I like cats" D2 = "I hate cats" Then the document term matrix would look like: I like hate cats D1 1 1 0 1 D2 1 0 1 1 For our project to make a Document term matrix in R all you need to do is use the DocumentTermMatrix() like this: tdm <- DocumentTermMatrix(mycorpus) You can see information on your document term matrix by using print like: print(tdm) <<DocumentTermMatrix (documents: 4688, terms: 18363)>> Non-/sparse entries: 44400/86041344 Sparsity : 100% Maximal term length: 65 Weighting : term frequency (tf) Next because we need to sum up all the values in each term column so that we can drive the frequency of each term occurrence. We also want to sort those values from highest to lowest. You can use this code: m <- as.matrix(tdm) v <- sort(colSums(m),decreasing=TRUE) Next we will use the names() to pull the each term object’s name which in our case is a word. Then we want to build a dataframe from our words associated with their frequency of occurrences. Finally we want to create our word cloud but remove any terms that have an occurrence of less than 45 times to reduce clutter in our wordcloud. You could also use max.words to limit the total number of words in your word cloud. So your final code should look like this: words <- names(v) d <- data.frame(word=words, freq=v) wordcloud(d$word,d$freq,min.freq=45) If you run this in R studio you should see something like the figure which shows the words with highest occurrence in our corpus. The wordcloud object automatically scales the drawn words by the size of their frequency value. From here you can do a lot with your word cloud including change the scale, associate color to various values, and much more. You can read more about wordcloud here. While word clouds are often used on the web for things like blogs, news sites, and other similar use cases they have real value for data analysis beyond just visual indicators for users to find terms of interest. For example if you look at the word cloud we generated you will notice that one of the most popular terms mentioned in tweets is chocolate. Doing a short inspection of our CSV document for the term chocolate we find a lot of people mentioning the word in a variety of contexts but one of the most common is in relationship to a specific super bowl add. For example here is a tweet: Alexalabesky 41673.39 Chocolate chips and peanut butter 0 0 0 Unknown Unknown Unknown Unknown Unknown This appeared after the airing of this advertisement from Butterfinger. So even with this simple R code we can generate real meaning from social media which is the measurable impact of an advertisement during the Super Bowl. Summary In this post we looked at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. About the author Robi Sen, CSO at Department 13, is an experienced inventor, serial entrepreneur, and futurist whose dynamic twenty-plus year career in technology, engineering, and research has led him to work on cutting edge projects for DARPA, TSWG, SOCOM, RRTO, NASA, DOE, and the DOD. Robi also has extensive experience in the commercial space, including the co-creation of several successful start-up companies. He has worked with companies such as UnderArmour, Sony, CISCO, IBM, and many others to help build out new products and services. Robi specializes in bringing his unique vision and thought process to difficult and complex problems allowing companies and organizations to find innovative solutions that they can rapidly operationalize or go to market with.

0
0
4816

Packt

15 Apr 2015

29 min read

Visualization

Packt

15 Apr 2015

29 min read

0
0
4245

Packt

07 Apr 2015

9 min read

Work Item Querying

Packt

07 Apr 2015

9 min read

0
0
3573

Packt

06 Apr 2015

15 min read

Working with Blender

Packt

06 Apr 2015

15 min read

In this article by Jos Dirksen, author of Learning Three.js – the JavaScript 3D Library for WebGL - Second Edition, we will learn about Blender and also about how to load models in Three.js using different formats. (For more resources related to this topic, see here.) Before we get started with the configuration, we'll show the result that we'll be aiming for. In the following screenshot, you can see a simple Blender model that we exported with the Three.js plugin and imported in Three.js with THREE.JSONLoader: Installing the Three.js exporter in Blender To get Blender to export Three.js models, we first need to add the Three.js exporter to Blender. The following steps are for Mac OS X but are pretty much the same on Windows and Linux. You can download Blender from www.blender.org and follow the platform-specific installation instructions. After installation, you can add the Three.js plugin. First, locate the addons directory from your Blender installation using a terminal window: On my Mac, it's located here: ./blender.app/Contents/MacOS/2.70/scripts/addons. For Windows, this directory can be found at the following location: C:UsersUSERNAMEAppDataRoamingBlender FoundationBlender2.7Xscriptsaddons. And for Linux, you can find this directory here: /home/USERNAME/.config/blender/2.7X/scripts/addons. Next, you need to get the Three.js distribution and unpack it locally. In this distribution, you can find the following folder: utils/exporters/blender/2.65/scripts/addons/. In this directory, there is a single subdirectory with the name io_mesh_threejs. Copy this directory to the addons folder of your Blender installation. Now, all we need to do is start Blender and enable the exporter. In Blender, open Blender User Preferences (File | User Preferences). In the window that opens, select the Addons tab, and in the search box, type three. This will show the following screen: At this point, the Three.js plugin is found, but it is still disabled. Check the small checkbox to the right, and the Three.js exporter will be enabled. As a final check to see whether everything is working correctly, open the File | Export menu option, and you'll see Three.js listed as an export option. This is shown in the following screenshot: With the plugin installed, we can load our first model. Loading and exporting a model from Blender As an example, we've added a simple Blender model named misc_chair01.blend in the assets/models folder, which you can find in the sources for this article. In this section, we'll load this model and show the minimal steps it takes to export this model to Three.js. First, we need to load this model in Blender. Use File | Open and navigate to the folder containing the misc_chair01.blend file. Select this file and click on Open. This will show you a screen that looks somewhat like this: Exporting this model to the Three.js JSON format is pretty straightforward. From the File menu, open Export | Three.js, type in the name of the export file, and select Export Three.js. This will create a JSON file in a format Three.js understands. A part of the contents of this file is shown next: { "metadata" : { "formatVersion" : 3.1, "generatedBy" : "Blender 2.7 Exporter", "vertices" : 208, "faces" : 124, "normals" : 115, "colors" : 0, "uvs" : [270,151], "materials" : 1, "morphTargets" : 0, "bones" : 0 }, ... However, we aren't completely done. In the previous screenshot, you can see that the chair contains a wooden texture. If you look through the JSON export, you can see that the export for the chair also specifies a material, as follows: "materials": [{ "DbgColor": 15658734, "DbgIndex": 0, "DbgName": "misc_chair01", "blending": "NormalBlending", "colorAmbient": [0.53132, 0.25074, 0.147919], "colorDiffuse": [0.53132, 0.25074, 0.147919], "colorSpecular": [0.0, 0.0, 0.0], "depthTest": true, "depthWrite": true, "mapDiffuse": "misc_chair01_col.jpg", "mapDiffuseWrap": ["repeat", "repeat"], "shading": "Lambert", "specularCoef": 50, "transparency": 1.0, "transparent": false, "vertexColors": false }], This material specifies a texture, misc_chair01_col.jpg, for the mapDiffuse property. So, besides exporting the model, we also need to make sure the texture file is also available to Three.js. Luckily, we can save this texture directly from Blender. In Blender, open the UV/Image Editor view. You can select this view from the drop-down menu on the left-hand side of the File menu option. This will replace the top menu with the following: Make sure the texture you want to export is selected, misc_chair_01_col.jpg in our case (you can select a different one using the small image icon). Next, click on the Image menu and use the Save as Image menu option to save the image. Save it in the same folder where you saved the model using the name specified in the JSON export file. At this point, we're ready to load the model into Three.js. The code to load this into Three.js at this point looks like this: var loader = new THREE.JSONLoader(); loader.load('../assets/models/misc_chair01.js', function (geometry, mat) { mesh = new THREE.Mesh(geometry, mat[0]); mesh.scale.x = 15; mesh.scale.y = 15; mesh.scale.z = 15; scene.add(mesh); }, '../assets/models/'); We've already seen JSONLoader before, but this time, we use the load function instead of the parse function. In this function, we specify the URL we want to load (points to the exported JSON file), a callback that is called when the object is loaded, and the location, ../assets/models/, where the texture can be found (relative to the page). This callback takes two parameters: geometry and mat. The geometry parameter contains the model, and the mat parameter contains an array of material objects. We know that there is only one material, so when we create THREE.Mesh, we directly reference that material. If you open the 05-blender-from-json.html example, you can see the chair we just exported from Blender. Using the Three.js exporter isn't the only way of loading models from Blender into Three.js. Three.js understands a number of 3D file formats, and Blender can export in a couple of those formats. Using the Three.js format, however, is very easy, and if things go wrong, they are often quickly found. In the following section, we'll look at a couple of the formats Three.js supports and also show a Blender-based example for the OBJ and MTL file formats. Importing from 3D file formats At the beginning of this article, we listed a number of formats that are supported by Three.js. In this section, we'll quickly walk through a couple of examples for those formats. Note that for all these formats, an additional JavaScript file needs to be included. You can find all these files in the Three.js distribution in the examples/js/loaders directory. The OBJ and MTL formats OBJ and MTL are companion formats and often used together. The OBJ file defines the geometry, and the MTL file defines the materials that are used. Both OBJ and MTL are text-based formats. A part of an OBJ file looks like this: v -0.032442 0.010796 0.025935 v -0.028519 0.013697 0.026201 v -0.029086 0.014533 0.021409 usemtl Material s 1 f 2731 2735 2736 2732 f 2732 2736 3043 3044 The MTL file defines materials like this: newmtl Material Ns 56.862745 Ka 0.000000 0.000000 0.000000 Kd 0.360725 0.227524 0.127497 Ks 0.010000 0.010000 0.010000 Ni 1.000000 d 1.000000 illum 2 The OBJ and MTL formats by Three.js are understood well and are also supported by Blender. So, as an alternative, you could choose to export models from Blender in the OBJ/MTL format instead of the Three.js JSON format. Three.js has two different loaders you can use. If you only want to load the geometry, you can use OBJLoader. We used this loader for our example (06-load-obj.html). The following screenshot shows this example: To import this in Three.js, you have to add the OBJLoader JavaScript file: <script type="text/javascript" src="../libs/OBJLoader.js"> </script> Import the model like this: var loader = new THREE.OBJLoader(); loader.load('../assets/models/pinecone.obj', function (loadedMesh) { var material = new THREE.MeshLambertMaterial({color: 0x5C3A21}); // loadedMesh is a group of meshes. For // each mesh set the material, and compute the information // three.js needs for rendering. loadedMesh.children.forEach(function (child) { child.material = material; child.geometry.computeFaceNormals(); child.geometry.computeVertexNormals(); }); mesh = loadedMesh; loadedMesh.scale.set(100, 100, 100); loadedMesh.rotation.x = -0.3; scene.add(loadedMesh); }); In this code, we use OBJLoader to load the model from a URL. Once the model is loaded, the callback we provide is called, and we add the model to the scene. Usually, a good first step is to print out the response from the callback to the console to understand how the loaded object is built up. Often with these loaders, the geometry or mesh is returned as a hierarchy of groups. Understanding this makes it much easier to place and apply the correct material and take any other additional steps. Also, look at the position of a couple of vertices to determine whether you need to scale the model up or down and where to position the camera. In this example, we've also made the calls to computeFaceNormals and computeVertexNormals. This is required to ensure that the material used (THREE.MeshLambertMaterial) is rendered correctly. The next example (07-load-obj-mtl.html) uses OBJMTLLoader to load a model and directly assign a material. The following screenshot shows this example: First, we need to add the correct loaders to the page: <script type="text/javascript" src="../libs/OBJLoader.js"> </script> <script type="text/javascript" src="../libs/MTLLoader.js"> </script> <script type="text/javascript" src="../libs/OBJMTLLoader.js"> </script> We can load the model from the OBJ and MTL files like this: var loader = new THREE.OBJMTLLoader(); loader.load('../assets/models/butterfly.obj', '../assets/ models/butterfly.mtl', function(object) { // configure the wings var wing2 = object.children[5].children[0]; var wing1 = object.children[4].children[0]; wing1.material.opacity = 0.6; wing1.material.transparent = true; wing1.material.depthTest = false; wing1.material.side = THREE.DoubleSide; wing2.material.opacity = 0.6; wing2.material.depthTest = false; wing2.material.transparent = true; wing2.material.side = THREE.DoubleSide; object.scale.set(140, 140, 140); mesh = object; scene.add(mesh); mesh.rotation.x = 0.2; mesh.rotation.y = -1.3; }); The first thing to mention before we look at the code is that if you receive an OBJ file, an MTL file, and the required texture files, you'll have to check how the MTL file references the textures. These should be referenced relative to the MTL file and not as an absolute path. The code itself isn't that different from the one we saw for THREE.ObjLoader. We specify the location of the OBJ file, the location of the MTL file, and the function to call when the model is loaded. The model we've used as an example in this case is a complex model. So, we set some specific properties in the callback to fix some rendering issues, as follows: The opacity in the source files was set incorrectly, which caused the wings to be invisible. So, to fix that, we set the opacity and transparent properties ourselves. By default, Three.js only renders one side of an object. Since we look at the wings from two sides, we need to set the side property to the THREE.DoubleSide value. The wings caused some unwanted artifacts when they needed to be rendered on top of each other. We've fixed that by setting the depthTest property to false. This has a slight impact on performance but can often solve some strange rendering artifacts. But, as you can see, you can easily load complex models directly into Three.js and render them in real time in your browser. You might need to fine-tune some material properties though. Loading a Collada model Collada models (extension is .dae) are another very common format for defining scenes and models (and animations as well). In a Collada model, it is not just the geometry that is defined, but also the materials. It's even possible to define light sources. To load Collada models, you have to take pretty much the same steps as for the OBJ and MTL models. You start by including the correct loader: <script type="text/javascript" src="../libs/ColladaLoader.js"> </script> For this example, we'll load the following model: Loading a truck model is once again pretty simple: var mesh; loader.load("../assets/models/dae/Truck_dae.dae", function (result) { mesh = result.scene.children[0].children[0].clone(); mesh.scale.set(4, 4, 4); scene.add(mesh); }); The main difference here is the result of the object that is returned to the callback. The result object has the following structure: var result = { scene: scene, morphs: morphs, skins: skins, animations: animData, dae: { ... } }; In this article, we're interested in the objects that are in the scene parameter. I first printed out the scene to the console to look where the mesh was that I was interested in, which was result.scene.children[0].children[0]. All that was left to do was scale it to a reasonable size and add it to the scene. A final note on this specific example—when I loaded this model for the first time, the materials didn't render correctly. The reason was that the textures used the .tga format, which isn't supported in WebGL. To fix this, I had to convert the .tga files to .png and edit the XML of the .dae model to point to these .png files. As you can see, for most complex models, including materials, you often have to take some additional steps to get the desired results. By looking closely at how the materials are configured (using console.log()) or replacing them with test materials, problems are often easy to spot. Loading the STL, CTM, VTK, AWD, Assimp, VRML, and Babylon models We're going to quickly skim over these file formats as they all follow the same principles: Include [NameOfFormat]Loader.js in your web page. Use [NameOfFormat]Loader.load() to load a URL. Check what the response format for the callback looks like and render the result. We have included an example for all these formats: Name Example Screenshot STL 08-load-STL.html CTM 09-load-CTM.html VTK 10-load-vtk.html AWD 11-load-awd.html Assimp 12-load-assimp.html VRML 13-load-vrml.html Babylon The Babylon loader is slightly different from the other loaders in this table. With this loader, you don't load a single THREE.Mesh or THREE.Geometry instance, but with this loader, you load a complete scene, including lights. 14-load-babylon.html If you look at the source code for these examples, you might see that for some of them, we need to change some material properties or do some scaling before the model is rendered correctly. The reason we need to do this is because of the way the model is created in its external application, giving it different dimensions and grouping than we normally use in Three.js. Summary In this article, we've almost shown all the supported file formats. Using models from external sources isn't that hard to do in Three.js. Especially for simple models, you only have to take a few simple steps. When working with external models, or creating them using grouping and merging, it is good to keep a couple of things in mind. The first thing you need to remember is that when you group objects, they still remain available as individual objects. Transformations applied to the parent also affect the children, but you can still transform the children individually. Besides grouping, you can also merge geometries together. With this approach, you lose the individual geometries and get a single new geometry. This is especially useful when you're dealing with thousands of geometries you need to render and you're running into performance issues. Three.js supports a large number of external formats. When using these format loaders, it's a good idea to look through the source code and log out the information received in the callback. This will help you to understand the steps you need to take to get the correct mesh and set it to the correct position and scale. Often, when the model doesn't show correctly, this is caused by its material settings. It could be that incompatible texture formats are used, opacity is incorrectly defined, or the format contains incorrect links to the texture images. It is usually a good idea to use a test material to determine whether the model itself is loaded correctly and log the loaded material to the JavaScript console to check for unexpected values. It is also possible to export meshes and scenes, but remember that GeometryExporter, SceneExporter, and SceneLoader of Three.js are still work in progress. Resources for Article: Further resources on this subject: Creating the maze and animating the cube [article] Mesh animation [article] Working with the Basic Components That Make Up a Three.js Scene [article]

0
0
4871

article-image-machine-learning-using-spark-mllib

Packt

01 Apr 2015

22 min read

Machine Learning Using Spark MLlib

Packt

01 Apr 2015

22 min read

0
0
7085

Packt

01 Apr 2015

16 min read

Installing PostgreSQL

Packt

01 Apr 2015

16 min read

In this article by Hans-Jürgen Schönig, author of the book Troubleshooting PostgreSQL, we will cover what can go wrong during the installation process and what can be done to avoid those things from happening. At the end of the article, you should be able to avoid all of the pitfalls, traps, and dangers you might face during the setup process. (For more resources related to this topic, see here.) For this article, I have compiled some of the core problems that I have seen over the years, as follows: Deciding on a version during installation Memory and kernel issues Preventing problems by adding checksums to your database instance Wrong encodings and subsequent import errors Polluted template databases Killing the postmaster badly At the end of the article, you should be able to install PostgreSQL and protect yourself against the most common issues popping up immediately after installation. Deciding on a version number The first thing to work on when installing PostgreSQL is to decide on the version number. In general, a PostgreSQL version number consists of three digits. Here are some examples: 9.4.0, 9.4.1, or 9.4.2 9.3.4, 9.3.5, or 9.3.6 The last digit is the so-called minor release. When a new minor release is issued, it generally means that some bugs have been fixed (for example, some time zone changes, crashes, and so on). There will never be new features, missing functions, or changes of that sort in a minor release. The same applies to something truly important—the storage format. It won't change with a new minor release. These little facts have a wide range of consequences. As the binary format and the functionality are unchanged, you can simply upgrade your binaries, restart PostgreSQL, and enjoy your improved minor release. When the digit in the middle changes, things get a bit more complex. A changing middle digit is called a major release. It usually happens around once a year and provides you with significant new functionality. If this happens, we cannot just stop or start the database anymore to replace the binaries. If the first digit changes, something really important has happened. Examples of such important events were introductions of SQL (6.0), the Windows port (8.0), streaming replication (9.0), and so on. Technically, there is no difference between the first and the second digit—they mean the same thing to the end user. However, a migration process is needed. The question that now arises is this: if you have a choice, which version of PostgreSQL should you use? Well, in general, it is a good idea to take the latest stable release. In PostgreSQL, every version number following the design patterns I just outlined is a stable release. As of PostgreSQL 9.4, the PostgreSQL community provides fixes for versions as old as PostgreSQL 9.0. So, if you are running an older version of PostgreSQL, you can still enjoy bug fixes and so on. Methods of installing PostgreSQL Before digging into troubleshooting itself, the installation process will be outlined. The following choices are available: Installing binary packages Installing from source Installing from source is not too hard to do. However, this article will focus on installing binary packages only. Nowadays, most people (not including me) like to install PostgreSQL from binary packages because it is easier and faster. Basically, two types of binary packages are common these days: RPM (Red Hat-based) and DEB (Debian-based). Installing RPM packages Most Linux distributions include PostgreSQL. However, the shipped PostgreSQL version is somewhat ancient in many cases. Recently, I saw a Linux distribution that still featured PostgreSQL 8.4, a version already abandoned by the PostgreSQL community. Distributors tend to ship older versions to ensure that new bugs are not introduced into their systems. For high-performance production servers, outdated versions might not be the best idea, however. Clearly, for many people, it is not feasible to run long-outdated versions of PostgreSQL. Therefore, it makes sense to make use of repositories provided by the community. The Yum repository shows which distributions we can use RPMs for, at http://yum.postgresql.org/repopackages.php. Once you have found your distribution, the first thing is to install this repository information for Fedora 20 as it is shown in the next listing: yum install http://yum.postgresql.org/9.4/fedora/fedora-20-x86_64/pgdg-fedora94-9.4-1.noarch.rpm Once the repository has been added, we can install PostgreSQL: yum install postgresql94-server postgresql94-contrib /usr/pgsql-9.4/bin/postgresql94-setup initdb systemctl enable postgresql-9.4.service systemctl start postgresql-9.4.service First of all, PostgreSQL 9.4 is installed. Then a so-called database instance is created (initdb). Next, the service is enabled to make sure that it is always there after a reboot, and finally, the postgresql-9.4 service is started. The term database instance is an important concept. It basically describes an entire PostgreSQL environment (setup). A database instance is fired up when PostgreSQL is started. Databases are part of a database instance. Installing Debian packages Installing Debian packages is also not too hard. By the way, the process on Ubuntu as well as on some other similar distributions is the same as that on Debian, so you can directly use the knowledge gained from this article for other distributions. A simple file called /etc/apt/sources.list.d/pgdg.list can be created, and a line for the PostgreSQL repository (all the following steps can be done as root user or using sudo) can be added: deb http://apt.postgresql.org/pub/repos/apt/ YOUR_DEBIAN_VERSION_HERE-pgdg main So, in the case of Debian Wheezy, the following line would be useful: deb http://apt.postgresql.org/pub/repos/apt/ wheezy-pgdg main Once we have added the repository, we can import the signing key: $# wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - OK Voilà! Things are mostly done. In the next step, the repository information can be updated: apt-get update Once this has been done successfully, it is time to install PostgreSQL: apt-get install "postgresql-9.4" If no error is issued by the operating system, it means you have successfully installed PostgreSQL. The beauty here is that PostgreSQL will fire up automatically after a restart. A simple database instance has also been created for you. If everything has worked as expected, you can give it a try and log in to the database: root@chantal:~# su - postgres $ psql postgres psql (9.4.1) Type "help" for help. postgres=# Memory and kernel issues After this brief introduction to installing PostgreSQL, it is time to focus on some of the most common problems. Fixing memory issues Some of the most important issues are related to the kernel and memory. Up to version 9.2, PostgreSQL was using the classical system V shared memory to cache data, store locks, and so on. Since PostgreSQL 9.3, things have changed, solving most issues people had been facing during installation. However, in PostgreSQL 9.2 or before, you might have faced the following error message: FATAL: Could not create shared memory segment DETAIL: Failed system call was shmget (key=5432001, size=1122263040, 03600) HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 1122263040 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections. If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for. The PostgreSQL documentation contains more information about shared memory configuration. If you are facing a message like this, it means that the kernel does not provide you with enough shared memory to satisfy your needs. Where does this need for shared memory come from? Back in the old days, PostgreSQL stored a lot of stuff, such as the I/O cache (shared_buffers, locks, autovacuum-related information and a lot more), in the shared memory. Traditionally, most Linux distributions have had a tight grip on the memory, and they don't issue large shared memory segments; for example, Red Hat has long limited the maximum amount of shared memory available to applications to 32 MB. For most applications, this is not enough to run PostgreSQL in a useful way—especially not if performance does matter (and it usually does). To fix this problem, you have to adjust kernel parameters. Managing Kernel Resources of the PostgreSQL Administrator's Guide will tell you exactly why we have to adjust kernel parameters. For more information, check out the PostgreSQL documentation at http://www.postgresql.org/docs/9.4/static/kernel-resources.htm. This article describes all the kernel parameters that are relevant to PostgreSQL. Note that every operating system needs slightly different values here (for open files, semaphores, and so on). Adjusting kernel parameters for Linux In this article, parameters relevant to Linux will be covered. If shmget (previously mentioned) fails, two parameters must be changed: $ sysctl -w kernel.shmmax=17179869184 $ sysctl -w kernel.shmall=4194304 In this example, shmmax and shmall have been adjusted to 16 GB. Note that shmmax is in bytes while shmall is in 4k blocks. The kernel will now provide you with a great deal of shared memory. Also, there is more; to handle concurrency, PostgreSQL needs something called semaphores. These semaphores are also provided by the operating system. The following kernel variables are available: SEMMNI: This is the maximum number of semaphore identifiers. It should be at least ceil((max_connections + autovacuum_max_workers + 4) / 16). SEMMNS: This is the maximum number of system-wide semaphores. It should be at least ceil((max_connections + autovacuum_max_workers + 4) / 16) * 17, and it should have room for other applications in addition to this. SEMMSL: This is the maximum number of semaphores per set. It should be at least 17. SEMMAP: This is the number of entries in the semaphore map. SEMVMX: This is the maximum value of the semaphore. It should be at least 1000. Don't change these variables unless you really have to. Changes can be made with sysctl, as was shown for the shared memory. Adjusting kernel parameters for Mac OS X If you happen to run Mac OS X and plan to run a large system, there are also some kernel parameters that need changes. Again, /etc/sysctl.conf has to be changed. Here is an example: kern.sysv.shmmax=4194304 kern.sysv.shmmin=1 kern.sysv.shmmni=32 kern.sysv.shmseg=8 kern.sysv.shmall=1024 Mac OS X is somewhat nasty to configure. The reason is that you have to set all five parameters to make this work. Otherwise, your changes will be silently ignored, and this can be really painful. In addition to that, it has to be assured that SHMMAX is an exact multiple of 4096. If it is not, trouble is near. If you want to change these parameters on the fly, recent versions of OS X provide a systcl command just like Linux. Here is how it works: sysctl -w kern.sysv.shmmax sysctl -w kern.sysv.shmmin sysctl -w kern.sysv.shmmni sysctl -w kern.sysv.shmseg sysctl -w kern.sysv.shmall Fixing other kernel-related limitations If you are planning to run a large-scale system, it can also be beneficial to raise the maximum number of open files allowed. To do that, /etc/security/limits.conf can be adapted, as shown in the next example: postgres hard nofile 1024 postgres soft nofile 1024 This example says that the postgres user can have up to 1,024 open files per session. Note that this is only important for large systems; open files won't hurt an average setup. Adding checksums to a database instance When PostgreSQL is installed, a so-called database instance is created. This step is performed by a program called initdb, which is a part of every PostgreSQL setup. Most binary packages will do this for you and you don't have to do this by hand. Why should you care then? If you happen to run a highly critical system, it could be worthwhile to add checksums to the database instance. What is the purpose of checksums? In many cases, it is assumed that crashes happen instantly—something blows up and a system fails. This is not always the case. In many scenarios, the problem starts silently. RAM may start to break, or the filesystem may start to develop slight corruption. When the problem surfaces, it may be too late. Checksums have been implemented to fight this very problem. Whenever a piece of data is written or read, the checksum is checked. If this is done, a problem can be detected as it develops. How can those checksums be enabled? All you have to do is to add -k to initdb (just change your init scripts to enable this during instance creation). Don't worry! The performance penalty of this feature can hardly be measured, so it is safe and fast to enable its functionality. Keep in mind that this feature can really help prevent problems at fairly low costs (especially when your I/O system is lousy). Preventing encoding-related issues Encoding-related problems are some of the most frequent problems that occur when people start with a fresh PostgreSQL setup. In PostgreSQL, every database in your instance has a specific encoding. One database might be en_US@UTF-8, while some other database might have been created as de_AT@UTF-8 (which denotes German as it is used in Austria). To figure out which encodings your database system is using, try to run psql -l from your Unix shell. What you will get is a list of all databases in the instance that include those encodings. So where can we actually expect trouble? Once a database has been created, many people would want to load data into the system. Let's assume that you are loading data into the aUTF-8 database. However, the data you are loading contains some ASCII characters such as ä, ö, and so on. The ASCII code for ö is 148. Binary 148 is not a valid Unicode character. In Unicode, U+00F6 is needed. Boom! Your import will fail and PostgreSQL will error out. If you are planning to load data into a new database, ensure that the encoding or character set of the data is the same as that of the database. Otherwise, you may face ugly surprises. To create a database using the correct locale, check out the syntax of CREATE DATABASE: test=# h CREATE DATABASE Command: CREATE DATABASE Description: create a new database Syntax: CREATE DATABASE name [ [ WITH ] [ OWNER [=] user_name ] [ TEMPLATE [=] template ] [ ENCODING [=] encoding ] [ LC_COLLATE [=] lc_collate ] [ LC_CTYPE [=] lc_ctype ] [ TABLESPACE [=] tablespace_name ] [ CONNECTION LIMIT [=] connlimit ] ] ENCODING and the LC* settings are used here to define the proper encoding for your new database. Avoiding template pollution It is somewhat important to understand what happens during the creation of a new database in your system. The most important point is that CREATE DATABASE (unless told otherwise) clones the template1 database, which is available in all PostgreSQL setups. This cloning has some important implications. If you have loaded a very large amount of data into template1, all of that will be copied every time you create a new database. In many cases, this is not really desirable but happens by mistake. People new to PostgreSQL sometimes put data into template1 because they don't know where else to place new tables and so on. The consequences can be disastrous. However, you can also use this common pitfall to your advantage. You can place the functions you want in all your databases in template1 (maybe for monitoring or whatever benefits). Killing the postmaster After PostgreSQL has been installed and started, many people wonder how to stop it. The most simplistic way is, of course, to use your service postgresql stop or /etc/init.d/postgresql stop init scripts. However, some administrators tend to be a bit crueler and use kill -9 to terminate PostgreSQL. In general, this is not really beneficial because it will cause some nasty side effects. Why is this so? The PostgreSQL architecture works like this: when you start PostgreSQL you are starting a process called postmaster. Whenever a new connection comes in, this postmaster forks and creates a so-called backend process (BE). This process is in charge of handling exactly one connection. In a working system, you might see hundreds of processes serving hundreds of users. The important thing here is that all of those processes are synchronized through some common chunk of memory (traditionally, shared memory, and in the more recent versions, mapped memory), and all of them have access to this chunk. What might happen if a database connection or any other process in the PostgreSQL infrastructure is killed with kill -9? A process modifying this common chunk of memory might die while making a change. The process killed cannot defend itself against the onslaught, so who can guarantee that the shared memory is not corrupted due to the interruption? This is exactly when the postmaster steps in. It ensures that one of these backend processes has died unexpectedly. To prevent the potential corruption from spreading, it kills every other database connection, goes into recovery mode, and fixes the database instance. Then new database connections are allowed again. While this makes a lot of sense, it can be quite disturbing to those users who are connected to the database system. Therefore, it is highly recommended not to use kill -9. A normal kill will be fine. Keep in mind that a kill -9 cannot corrupt your database instance, which will always start up again. However, it is pretty nasty to kick everybody out of the system just because of one process! Summary In this article we have learned how to install PostgreSQL using binary packages. Some of the most common problems and pitfalls, including encoding-related issues, checksums, and versioning were discussed. Resources for Article: Further resources on this subject: Getting Started with PostgreSQL [article] PostgreSQL Cookbook - High Availability and Replication [article] PostgreSQL – New Features [article]

0
0
5976

Packt

01 Apr 2015

7 min read

Factor variables in R

Packt

01 Apr 2015

7 min read

This article by Jaynal Abedin and Kishor Kumar Das, authors of the book Data Manipulation with R Second Edition, will discuss factor variables in R. In any data analysis task, the majority of the time is dedicated to data cleaning and preprocessing. Sometimes, it is considered that about 80 percent of the effort is devoted to data cleaning before conducting the actual analysis. Also, in real-world data, we often work with categorical variables. A variable that takes only a limited number of distinct values is usually known as a categorical variable, and in R, it is known as a factor. Working with categorical variables in R is a bit technical, and in this article, we have tried to demystify this process of dealing with categorical variables. (For more resources related to this topic, see here.) During data analysis, the factor variable sometimes plays an important role, particularly in studying the relationship between two categorical variables. In this section, we will see some important aspects of factor manipulation. When a factor variable is first created, it stores all its levels along with the factor. But if we take any subset of that factor variable, it inherits all its levels from the original factor levels. This feature sometimes creates confusion in understanding the results. Numeric variables are convenient during statistical analysis, but sometimes, we need to create categorical (factor) variables from numeric variables. We can create a limited number of categories from a numeric variable using a series of conditional statements, but this is not an efficient way to perform this operation. In R, cut is a generic command to create factor variables from numeric variables. The split-apply-combine strategy Data manipulation is an integral part of data cleaning and analysis. For large data, it is always preferable to perform the operation within a subgroup of a dataset to speed up the process. In R, this type of data manipulation can be done with base functionality, but for large-scale data, it requires considerable amount of coding and eventually takes a longer time to process. In the case of big data, we can split the dataset, perform the manipulation or analysis, and then again combine the results into a single output. This type of split using base R is not efficient, and to overcome this limitation, Wickham developed an R package, plyr, where he efficiently implemented the split-apply-combine strategy. Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break down a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output. To understand the split-apply-combine strategy intuitively, we can compare it with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently. The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were not previously connected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator. The plyr package works on every type of data structure, whereas the dplyr package is designed to work only on data frames. The dplyr package offers a complete set of functions to perform every kind of data manipulation we would need in the process of analysis. These functions take a data frame as the input and also produce a data frame as the output, hence the name dplyr. There are two different types of functions in the dplyr package: single-table and aggregate. The single-table function takes a data frame as the input and an action such as subsetting a data frame, generating new columns in the data frame, or rearranging a data frame. The aggregate function takes a column as the input and produces a single value as the output, which is mostly used to summarize columns. These functions do not allow us to perform any group-wise operation, but a combination of these functions with the group_by() function allows us to implement the split-apply-combine approach. Reshaping a dataset Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping, and we need to implement some reorientation to perform certain types of analyses. A dataset's layout could be long or wide. In a long layout, multiple rows represent a single subject's record, whereas in a wide layout, a single row represents a single subject's record. Statistical analysis sometimes requires wide data and sometimes long data, and in such cases, we need to be able to fluently and fluidly reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. In this article, we will show you different layouts of the same dataset and see how they can be transferred from one layout to another. This article mainly highlights the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient. A single dataset can be rearranged in many different ways, but before going into rearrangement, let's look back at how we usually perceive a dataset. Whenever we think about any dataset, we think of a two-dimensional arrangement where a row represents a subject's (a subject could be a person and is typically the respondent in a survey) information for all the variables in a dataset, and a column represents the information for each characteristic for all subjects. This means that rows indicate records and columns indicate variables, characteristics, or attributes. This is the typical layout of a dataset. In this arrangement, one or more variables might play a role as an identifier, and others are measured characteristics. For the purpose of reshaping, we can group the variables into two groups: identifier variables and measured variables: The identifier variables: These help us identify the subject from whom we took information on different characteristics. Typically, identifier variables are qualitative in nature and take a limited number of unique values. In database terminology, an identifier is termed as the primary key, and this can be a single variable or a composite of multiple variables. The measured variables: These are those characteristics whose information we took from a subject of interest. These can be qualitative, quantitative, or a mixture of both. Now, beyond this typical structure of a dataset, we can think differently, where we will have only identification variables and a value. The identification variable identifies a subject along with which the measured variable the value represents. In this new paradigm, each row represents one observation of one variable. In the new paradigm, this is termed as melting and it produces molten data. The difference between this new layout of the data and the typical layout is that it now contains only the ID variable and a new column, value, which represents the value of that observation. Text processing Text data is one of the most important areas in the field of data analytics. Nowadays, we are producing a huge amount of text data through various media every day; for example, Twitter posts, blog writing, and Facebook posts are all major sources of text data. Text data can be used to retrieve information, in sentiment analysis and even entity recognition. Summary This article briefly explained the factor variables, the split-apply-combine strategy, reshaping a dataset in R, and text processing. Resources for Article: Further resources on this subject: Introduction to S4 Classes [Article] Warming Up [Article] Driving Visual Analyses with Automobile Data (Python) [Article]

0
0
5065

Packt

30 Mar 2015

28 min read

PostgreSQL – New Features

Packt

30 Mar 2015

28 min read

In this article, Jayadevan Maymala, author of the book, PostgreSQL for Data Architects, you will see how to troubleshoot the initial hiccups faced by people who are new to PostgreSQL. We will look at a few useful, but not commonly used data types. We will also cover pgbadger, a nifty third-party tool that can run through a PostgreSQL log. This tool can tell us a lot about what is happening in the cluster. Also, we will look at a few key features that are part of PostgreSQL 9.4 release. We will cover a couple of useful extensions. (For more resources related to this topic, see here.) Interesting data types We will start with the data types. PostgreSQL does have all the common data types we see in databases. These include: The number data types (smallint, integer, bigint, decimal, numeric, real, and double) The character data types (varchar, char, and text) The binary data types The date/time data types (including date, timestamp without timezone, and timestamp with timezone) BOOLEAN data types However, this is all standard fare. Let's start off by looking at the RANGE data type. RANGE This is a data type that can be used to capture values that fall in a specific range. Let's look at a few examples of use cases. Cars can be categorized as compact, convertible, MPV, SUV, and so on. Each of these categories will have a price range. For example, the price range of a category of cars can start from $15,000 at the lower end and the price range at the upper end can start from $40,000. We can have meeting rooms booked for different time slots. Each room is booked during different time slots and is available accordingly. Then, there are use cases that involve shift timings for employees. Each shift begins at a specific time, ends at a specific time, and involves a specific number of hours on duty. We would also need to capture the swipe-in and swipe-out time for employees. These are some use cases where we can consider range types. Range is a high-level data type; we can use int4range as the appropriate subtype for the car price range scenario. For the booking the meeting rooms and shifting use cases, we can consider tsrange or tstzrange (if we want to capture time zone as well). It makes sense to explore the possibility of using range data types in most scenarios, which involve the following features: From and to timestamps/dates for room reservations Lower and upper limit for price/discount ranges Scheduling jobs Timesheets Let's now look at an example. We have three meeting rooms. The rooms can be booked and the entries for reservations made go into another table (basic normalization principles). How can we find rooms that are not booked for a specific time period, say, 10:45 to 11:15? We will look at this with and without the range data type: CREATE TABLE rooms(id serial, descr varchar(50)); INSERT INTO rooms(descr) SELECT concat('Room ', generate_series(1,3)); CREATE TABLE room_book (id serial , room_id integer, from_time timestamp, to_time timestamp , res tsrange); INSERT INTO room_book (room_id,from_time,to_time,res) values(1,'2014-7-30 10:00:00', '2014-7-30 11:00:00', '(2014-7-30 10:00:00,2014-7-30 11:00:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 10:00:00', '2014-7-30 10:40:00', '(2014-7-30 10:00,2014-7-30 10:40:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 11:20:00', '2014-7-30 12:00:00', '(2014-7-30 11:20:00,2014-7-30 12:00:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(3,'2014-7-30 11:00:00', '2014-7-30 11:30:00', '(2014-7-30 11:00:00,2014-7-30 11:30:00)'); PostgreSQL has the OVERLAPS operator. This can be used to get all the reservations that overlap with the period for which we wanted to book a room: SELECT room_id FROM room_book WHERE (from_time,to_time) OVERLAPS ('2014-07-30 10:45:00','2014-07-30 11:15:00'); If we eliminate these room IDs from the master list, we have the list of rooms available. So, we prefix the following command to the preceding SQL: SELECT id FROM rooms EXCEPT We get a room ID that is not booked from 10:45 to 11:15. This is the old way of doing it. With the range data type, we can write the following SQL statement: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE res && '(2014-07-30 10:45:00,2014-07-30 11:15:00)'; Do look up GIST indexes to improve the performance of queries that use range operators. Another way of achieving the same is to use the following command: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE '2014-07-30 10:45:00' < to_time AND '2014-07-30 11:15:00' > from_time; Now, let's look at the finer points of how a range is represented. The range values can be opened using [ or ( and closed with ] or ). [ means include the lower value and ( means exclude the lower value. The closing (] or )) has a similar effect on the upper values. When we do not specify anything, [) is assumed, implying include the lower value, but exclude the upper value. Note that the lower bound is 3 and upper bound is 6 when we mention 3,5, as shown here: SELECT int4range(3,5,'[)') lowerincl ,int4range(3,5,'[]') bothincl, int4range(3,5,'()') bothexcl , int4range(3,5,'[)') upperexcl; lowerincl | bothincl | bothexcl | upperexcl -----------+----------+----------+----------- [3,5) | [3,6) | [4,5) | [3,5) Using network address types The network address types are cidr, inet, and macaddr. These are used to capture IPv4, IPv6, and Mac addresses. Let's look at a few use cases. When we have a website that is open to public, a number of users from different parts of the world access it. We may want to analyze the access patterns. Very often, websites can be used by users without registering or providing address information. In such cases, it becomes even more important that we get some insight into the users based on the country/city and similar location information. When anonymous users access our website, an IP is usually all we get to link the user to a country or city. Often, this becomes our not-so-accurate unique identifier (along with cookies) to keep track of repeat visits, to analyze website-usage patterns, and so on. The network address types can also be useful when we develop applications that monitor a number of systems in different networks to check whether they are up and running, to monitor resource consumption of the systems in the network, and so on. While data types (such as VARCHAR or BIGINT) can be used to store IP addresses, it's recommended to use one of the built-in types PostgreSQL provides to store network addresses. There are three data types to store network addresses. They are as follows: inet: This data type can be used to store an IPV4 or IPV6 address along with its subnet. The format in which data is to be inserted is Address/y, where y is the number of bits in the netmask. cidr: This data type can also be used to store networks and network addresses. Once we specify the subnet mask for a cidr data type, PostgreSQL will throw an error if we set bits beyond the mask, as shown in the following example: CREATE TABLE nettb (id serial, intclmn inet, cidrclmn cidr); CREATE TABLE INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/32', '192.168.64.2/32'); INSERT 0 1 INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.2/24'); ERROR: invalid cidr value: "192.168.64.2/24" LINE 1: ...b (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.6... ^ DETAIL: Value has bits set to right of mask. INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.0/24'); INSERT 0 1 SELECT * FROM nettb; id | intclmn | cidrclmn ----+-----------------+----------------- 1 | 192.168.64.2 | 192.168.64.2/32 2 | 192.168.64.2/24 | 192.168.64.0/24 Let's also look at a couple of useful operators available within network address types. Does an IP fall in a subnet? This can be figured out using <<=, as shown here: SELECT id,intclmn FROM nettb ; id | intclmn ----+-------------- 1 | 192.168.64.2 3 | 192.168.12.2 4 | 192.168.13.2 5 | 192.168.12.4 SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/24'; id | intclmn 3 | 192.168.12.2 5 | 192.168.12.4 SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/32'; id | intclmn 3 | 192.168.12.2 The operator used in the preceding command checks whether the column value is contained within or equal to the value we provided. Similarly, we have the equality operator, that is, greater than or equal to, bitwise AND, bitwise OR, and other standard operators. The macaddr data type can be used to store Mac addresses in different formats. hstore for key-value pairs A key-value store available in PostgreSQL is hstore. Many applications have requirements that make developers look for a schema-less data store. They end up turning to one of the NoSQL databases (Cassandra) or the simple and more prevalent stores such as Redis or Riak. While it makes sense to opt for one of these if the objective is to achieve horizontal scalability, it does make the system a bit complex because we now have more moving parts. After all, most applications do need a relational database to take care of all the important transactions along with the ability to write SQL to fetch data with different projections. If a part of the application needs to have a key-value store (and horizontal scalability is not the prime objective), the hstore data type in PostgreSQL should serve the purpose. It may not be necessary to make the system more complex by using different technologies that will also add to the maintenance overhead. Sometimes, what we want is not an entirely schema-less database, but some flexibility where we are certain about most of our entities and their attributes but are unsure about a few. For example, a person is sure to have a few key attributes such as first name, date of birth, and a couple of other attributes (irrespective of his nationality). However, there could be other attributes that undergo change. A U.S. citizen is likely to have a Social Security Number (SSN); someone from Canada has a Social Insurance Number (SIN). Some countries may provide more than one identifier. There can be more attributes with a similar pattern. There is usually a master attribute table (which links the IDs to attribute names) and a master table for the entities. Writing queries against tables designed on an EAV approach can get tricky. Using hstore may be an easier way of accomplishing the same. Let's see how we can do this using hstore with a simple example. The hstore key-value store is an extension and has to be installed using CREATE EXTENSION hstore. We will model a customer table with first_name and an hstore column to hold all the dynamic attributes: CREATE TABLE customer(id serial, first_name varchar(50), dynamic_attributes hstore); INSERT INTO customer (first_name ,dynamic_attributes) VALUES ('Michael','ssn=>"123-465-798" '), ('Smith','ssn=>"129-465-798" '), ('James','ssn=>"No data" '), ('Ram','uuid=>"1234567891" , npr=>"XYZ5678", ratnum=>"Somanyidentifiers" '); Now, let's try retrieving all customers with their SSN, as shown here: SELECT first_name, dynamic_attributes FROM customer WHERE dynamic_attributes ? 'ssn'; first_name | dynamic_attributes Michael | "ssn"=>"123-465-798" Smith | "ssn"=>"129-465-798" James | "ssn"=>"No data" Also, those with a specific SSN: SELECT first_name,dynamic_attributes FROM customer WHERE dynamic_attributes -> 'ssn'= '123-465-798'; first_name | dynamic_attributes - Michael | "ssn"=>"123-465-798" If we want to get records that do not contain a specific SSN, just use the following command: WHERE NOT dynamic_attributes -> 'ssn'= '123-465-798' Also, replacing it with WHERE NOT dynamic_attributes ? 'ssn'; gives us the following command: first_name | dynamic_attributes ------------+----------------------------------------------------- Ram | "npr"=>"XYZ5678", "uuid"=>"1234567891", "ratnum"=>"Somanyidentifiers" As is the case with all data types in PostgreSQL, there are a number of functions and operators available to fetch data selectively, update data, and so on. We must always use the appropriate data types. This is not just for the sake of doing it right, but because of the number of operators and functions available with a focus on each data type; hstore stores only text. We can use it to store numeric values, but these values will be stored as text. We can index the hstore columns to improve performance. The type of index to be used depends on the operators we will be using frequently. json/jsonb JavaScript Object Notation (JSON) is an open standard format used to transmit data in a human-readable format. It's a language-independent data format and is considered an alternative to XML. It's really lightweight compared to XML and has been steadily gaining popularity in the last few years. PostgreSQL added the JSON data type in Version 9.2 with a limited set of functions and operators. Quite a few new functions and operators were added in Version 9.3. Version 9.4 adds one more data type: jsonb.json, which is very similar to JSONB. The jsonb data type stores data in binary format. It also removes white spaces (which are insignificant) and avoids duplicate object keys. As a result of these differences, JSONB has an overhead when data goes in, while JSON has extra processing overhead when data is retrieved (consider how often each data point will be written and read). The number of operators available with each of these data types is also slightly different. As it's possible to cast one data type to the other, which one should we use depends on the use case. If the data will be stored as it is and retrieved without any operations, JSON should suffice. However, if we plan to use operators extensively and want indexing support, JSONB is a better choice. Also, if we want to preserve whitespace, key ordering, and duplicate keys, JSON is the right choice. Now, let's look at an example. Assume that we are doing a proof of concept project for a library management system. There are a number of categories of items (ranging from books to DVDs). We wouldn't have information about all the categories of items and their attributes at the piloting stage. For the pilot stage, we could use a table design with the JSON data type to hold various items and their attributes: CREATE TABLE items ( item_id serial, details json ); Now, we will add records. All DVDs go into one record, books go into another, and so on: INSERT INTO items (details) VALUES ('{ "DVDs" :[ {"Name":"The Making of Thunderstorms", "Types":"Educational", "Age-group":"5-10","Produced By":"National Geographic" }, {"Name":"My nightmares", "Types":"Movies", "Categories":"Horror", "Certificate":"A", "Director":"Dracula","Actors": [{"Name":"Meena"},{"Name":"Lucy"},{"Name":"Van Helsing"}] }, {"Name":"My Cousin Vinny", "Types":"Movies", "Categories":"Suspense", "Certificate":"A", "Director": "Jonathan Lynn","Actors": [{"Name":"Joe "},{"Name":"Marissa"}] }] }' ); A better approach would be to have one record for each item. Now, let's take a look at a few JSON functions: SELECT details->>'DVDs' dvds, pg_typeof(details->>'DVDs') datatype FROM items; SELECT details->'DVDs' dvds ,pg_typeof(details->'DVDs') datatype FROM items; Note the difference between ->> and -> in the following screenshot. We are using the pg_typeof function to clearly see the data type returned by the functions. Both return the JSON object field. The first function returns text and the second function returns JSON: Now, let's try something a bit more complex: retrieve all movies in DVDs in which Meena acted with the following SQL statement: WITH tmp (dvds) AS (SELECT json_array_elements(details->'DVDs') det FROM items) SELECT * FROM tmp , json_array_elements(tmp.dvds#>'{Actors}') as a WHERE a->>'Name'='Meena'; We get the record as shown here: We used one more function and a couple of operators. The json_array_elements expands a JSON array to a set of JSON elements. So, we first extracted the array for DVDs. We also created a temporary table, which ceases to exist as soon as the query is over, using the WITH clause. In the next part, we extracted the elements of the array actors from DVDs. Then, we checked whether the Name element is equal to Meena. XML PostgreSQL added the xml data type in Version 8.3. Extensible Markup Language (XML) has a set of rules to encode documents in a format that is both human-readable and machine-readable. This data type is best used to store documents. XML became the standard way of data exchanging information across systems. XML can be used to represent complex data structures such as hierarchical data. However, XML is heavy and verbose; it takes more bytes per data point compared to the JSON format. As a result, JSON is referred to as fat-free XML. XML structure can be verified against XML Schema Definition Documents (XSD). In short, XML is heavy and more sophisticated, whereas JSON is lightweight and faster to process. We need to configure PostgreSQL with libxml support (./configure --with-libxml) and then restart the cluster for XML features to work. There is no need to reinitialize the database cluster. Inserting and verifying XML data Now, let's take a look at what we can do with the xml data type in PostgreSQL: CREATE TABLE tbl_xml(id serial, docmnt xml); INSERT INTO tbl_xml(docmnt ) VALUES ('Not xml'); INSERT INTO tbl_xml (docmnt) SELECT query_to_xml( 'SELECT now()',true,false,'') ; SELECT xml_is_well_formed_document(docmnt::text), docmnt FROM tbl_xml; Then, take a look at the following screenshot: First, we created a table with a column to store the XML data. Then, we inserted a record, which is not in the XML format, into the table. Next, we used the query_to_xml function to get the output of a query in the XML format. We inserted this into the table. Then, we used a function to check whether the data in the table is well-formed XML. Generating XML files for table definitions and data We can use the table_to_xml function if we want to dump the data from a table in the XML format. Append and_xmlschema so that the function becomes table_to_xml_and_xmlschema, which will also generate the schema definition before dumping the content. If we want to generate just the definitions, we can use table_to_xmlschema. PostgreSQL also provides the xpath function to extract data as follows: SELECT xpath('/table/row/now/text()',docmnt) FROM tbl_xml WHERE id = 2; xpath ------------------------------------ {2014-07-29T16:55:00.781533+05:30} Using properly designed tables with separate columns to capture each attribute is always the best approach from a performance standpoint and update/write-options perspective. Data types such as json/xml are best used to temporarily store data when we need to provide feeds/extracts/views to other systems or when we get data from external systems. They can also be used to store documents. The maximum size for a field is 1 GB. We must consider this when we use the database to store text/document data. pgbadger Now, we will look at a must-have tool if we have just started with PostgreSQL and want to analyze the events taking place in the database. For those coming from an Oracle background, this tool provides reports similar to AWR reports, although the information is more query-centric. It does not include data regarding host configuration, wait statistics, and so on. Analyzing the activities in a live cluster provides a lot of insight. It tells us about load, bottlenecks, which queries get executed frequently (we can focus more on them for optimization). It even tells us if the parameters are set right, although a bit indirectly. For example, if we see that there are many temp files getting created while a specific query is getting executed, we know that we either have a buffer issue or have not written the query right. For pgbadger to effectively scan the log file and produce useful reports, we should get our logging configuration right as follows: log_destination = 'stderr' logging_collector = on log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m-%d.log' log_min_duration_statement = 0 log_connections = on log_disconnections = on log_duration = on log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' log_lock_waits = on track_activity_query_size = 2048 It might be necessary to restart the cluster for some of these changes to take effect. We will also ensure that there is some load on the database using pgbench. It's a utility that ships with PostgreSQL and can be used to benchmark PostgreSQL on our servers. We can initialize the tables required for pgbench by executing the following command at shell prompt: pgbench -i pgp This creates a few tables on the pgp database. We can log in to psql (database pgp) and check: \dt List of relations Schema | Name | Type | Owner --------+------------------+-------+---------- public | pgbench_accounts | table | postgres public | pgbench_branches | table | postgres public | pgbench_history | table | postgres public | pgbench_tellers | table | postgres Now, we can run pgbench to generate load on the database with the following command: pgbench -c 5 -T10 pgp The T option passes the duration for which pgbench should continue execution in seconds, c passes the number of clients, and pgp is the database. At shell prompt, execute: wget https://github.com/dalibo/pgbadger/archive/master.zip Once the file is downloaded, unzip the file using the following command: unzip master.zip Use cd to the directory pgbadger-master as follows: cd pgbadger-master Execute the following command: ./pgbadger /pgdata/9.3/pg_log/postgresql-2014-07-31.log –o myoutput.html Replace the log file name in the command with the actual name. It will generate a myoutput.html file. The HTML file generated will have a wealth of information about what happened in the cluster with great charts/tables. In fact, it takes quite a bit of time to go through the report. Here is a sample chart that provides the distribution of queries based on execution time: The following screenshot gives an idea about the number of performance metrics provided by the report: If our objective is to troubleshoot performance bottlenecks, the slowest individual queries and most frequent queries under the top drop-down list is the right place to start. Once the queries are identified, locks, temporary file generation, and so on can be studied to identify the root cause. Of course, EXPLAIN is the best option when we want to refine individual queries. If the objective is to understand how busy the cluster is, the Overview section and Sessions are the right places to explore. The logging configuration used may create huge log files in systems with a lot of activity. Tweak the parameters appropriately to ensure that this does not happen. With this, we covered most of the interesting data types, an interesting extension and a must-use tool from PostgreSQL ecosystem. Now, let's cover a few interesting features in PostgreSQL Version 9.4. Features over time Applying filters in Versions 8.0, 9.0, and 9.4 gives us a good idea about how quickly features are getting added to the database. Interesting features in 9.4 Each version of PostgreSQL adds many features grouped into different categories (such as performance, backend, data types, and so on). We will look at a few features that are more likely to be of interest (because they help us improve performance or they make maintenance and configuration easy). Keeping the buffer ready As we saw earlier, reads from disk have a significant overhead compared to those from memory. There are quite a few occasions when disk reads are unavoidable. Let's see a few examples. In a data warehouse, the Extract, Transform, Load (ETL) process, which may happen once a day usually, involves a lot of raw data getting processed in memory before being loaded into the final tables. This data is mostly transactional data. The master data, which does not get processed on a regular basis, may be evicted from memory as a result of this churn. Reports typically depend a lot on master data. When users refresh their reports after ETL, it's highly likely that the master data will be read from disk, resulting in a drop in the response time. If we could ensure that the master data as well as the recently processed data is in the buffer, it can really improve user experience. In a transactional system like an airline reservation system, a change to the fare rule may result in most of the fares being recalculated. This is a situation similar to the one described previously, ensuring that the fares and availability data for the most frequently searched routes in the buffer can provide a better user experience. This applies to an e-commerce site selling products also. If the product/price/inventory data is always available in memory, it can be retrieved very fast. You must use PostgreSQL 9.4 for trying out the code in the following sections. So, how can we ensure that the data is available in the buffer? A pg_prewarm module has been added as an extension to provide this functionality. The basic syntax is very simple: SELECT pg_prewarm('tablename');. This command will populate the buffers with data from the table. It's also possible to mention the blocks that should be loaded into the buffer from the table. We will install the extension in a database, create a table, and populate some data. Then, we will stop the server, drop buffers (OS), and restart the server. We will see how much time a SELECT count(*) takes. We will repeat the exercise, but we will use pg_prewarm before executing SELECT count(*) at psql: CREATE EXTENSION pg_prewarm; CREATE TABLE myt(id SERIAL, name VARCHAR(40)); INSERT INTO myt(name) SELECT concat(generate_series(1,10000),'name'); Now, stop the server using pg_ctl at the shell prompt: pg_ctl stop -m immediate Clean OS buffers using the following command at the shell prompt (will need to use sudo to do this): echo 1 > /proc/sys/vm/drop_caches The command may vary depending on the OS. Restart the cluster using pg_ctl start. Then, execute the following command: SELECT COUNT(*) FROM myt; Time: 333.115 ms We should repeat the steps of shutting down the server, dropping the cache, and starting PostgreSQL. Then, execute SELECT pg_prewarm('myt'); before SELECT count(*). The response time goes down significantly. Executing pg_prewarm does take some time, which is close to the time taken to execute the SELECT count(*) against a cold cache. However, the objective is to ensure that the user does not experience a delay. SELECT COUNT(*) FROM myt; count ------- 10000 (1 row) Time: 7.002 ms Better recoverability A new parameter called recovery_min_apply_delay has been added in 9.4. This will go to the recovery.conf file of the slave server. With this, we can control the replay of transactions on the slave server. We can set this to approximately 5 minutes and then the standby will replay the transaction from the master when the standby system time is 5 minutes past the time of commit at the master. This provides a bit more flexibility when it comes to recovering from mistakes. When we keep the value at 1 hour, the changes at the master will be replayed at the slave after one hour. If we realize that something went wrong on the master server, we have about 1 hour to stop the transaction replay so that the action that caused the issue (for example, accidental dropping of a table) doesn't get replayed at the slave. Easy-to-change parameters An ALTER SYSTEM command has been introduced so that we don't have to edit postgresql.conf to change parameters. The entry will go to a file named postgresql.auto.conf. We can execute ALTER SYSTEM SET work_mem='12MB'; and then check the file at psql: \! more postgresql.auto.conf # Do not edit this file manually! # It will be overwritten by ALTER SYSTEM command. work_mem = '12MB' We must execute SELECT pg_reload_conf(); to ensure that the changes are propagated. Logical decoding and consumption of changes Version 9.4 introduces physical and logical replication slots. We will look at logical slots as they let us track changes and filter specific transactions. This lets us pick and choose from the transactions that have been committed. We can grab some of the changes, decode, and possibly replay on a remote server. We do not have to have an all-or-nothing replication. As of now, we will have to do a lot of work to decode/move the changes. Two parameter changes are necessary to set this up. These are as follows: The max_replication_slots parameter (set to at least 1) and wal_level (set to logical). Then, we can connect to a database and create a slot as follows: SELECT * FROM pg_create_logical_replication_slot('myslot','test_decoding'); The first parameter is the name we give to our slot and the second parameter is the plugin to be used. Test_decoding is the sample plugin available, which converts WAL entries into text representations as follows: INSERT INTO myt(id) values (4); INSERT INTO myt(name) values ('abc'); Now, we will try retrieving the entries: SELECT * FROM pg_logical_slot_peek_changes('myslot',NULL,NULL); Then, check the following screenshot: This function lets us take a look at the changes without consuming them so that the changes can be accessed again: SELECT * FROM pg_logical_slot_get_changes('myslot',NULL,NULL); This is shown in the following screenshot: This function is similar to the peek function, but the changes are no longer available to be fetched again as they get consumed. Summary In this article, we covered a few data types that data architects will find interesting. We also covered what is probably the best utility available to parse the PostgreSQL log file to produce excellent reports. We also looked at some of the interesting features in PostgreSQL version 9.4, which will be of interest to data architects. Resources for Article: Further resources on this subject: PostgreSQL as an Extensible RDBMS [article] Getting Started with PostgreSQL [article] PostgreSQL Cookbook - High Availability and Replication [article]

0
0
3012

article-image-basic-concepts-machine-learning-and-logistic-regression-example-mahout

Packt

30 Mar 2015

33 min read

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

Packt

30 Mar 2015

33 min read

0
0
4995

How-To Tutorials - Data

Hadoop Monitoring and its aspects

Machine Learning

Algorithmic Trading

Apache Solr and Big Data – integration with MongoDB

Integrating a D3.js visualization into a simple AngularJS application

Solr Indexing Internals

Text Mining with R: Part 2

Visualization

Work Item Querying

Working with Blender

Trending Topics

Machine Learning Using Spark MLlib

Installing PostgreSQL

Factor variables in R

PostgreSQL – New Features

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access